[Commit] mint/doc draft1.txt, NONE, 1.1 fig1.fig, NONE, 1.1 regex.txt, NONE, 1.1

Mon Nov 29 06:53:30 PST 2004

Committed by: jlamar

Update of /local/src/CVS/mint/doc
In directory home.keithp.com:/tmp/cvs-serv27318/doc

Added Files:
	draft1.txt fig1.fig regex.txt 
Log Message:
Initial version.

--- NEW FILE: draft1.txt ---
+ MINT [James, Emma]
  + Design
    + LOLA/FLO-style split between generator + lexer and
      between generator + parser [fig]
    + Lexer and parser isolated in libraries + Generates Canonical
    LR(1) parsers + Produces ASTs
  + Implementation
    + Code base + Bootstrapping
  + Achieving goals

Design

There are three elements in the design of MINT that differentiate it
from traditional parser and lexer generators such as YACC and LEX.
First, MINT implements a canonical LR(1) parser rather than SLR or
LALR.  Processor speeds and memory capacities have grown enough in the
twenty years since parsing techniques were first developed that it was
deemed unecessary to implement a more complex algorithm in order to
achieve a (possibly modest) table size reduction. Second, rather than
targetting a specific output language such as C, MINT generates
language independent lexing and parsing tables, which can then be
loaded by lexer/parser libraries (the MINT runtime) implemented in the
desired target language. Third, MINT does not allow one to execute
arbitary actions during the parsing of input. Instead, MINT constructs
an abstract syntax tree. Grammar rules may additionally allow the
special action of changing the lexer state (also known as start
condition in flex-like parser generators.) This restriction on actions
is due to the difficulty of specifying arbitrary actions when the
target language is unknown, as it is in the MINT architecture.

[fig1: MINT architecture]

The design of MINT is split into two main sections: the MINT parser
and lexer generator, implemented in Nickle, and the MINT runtime,
which is currently implemented in Nickle but may be extended to other
languages. The output parsing and lexing tables are generated in a
simple line-oriented ASCII format that is easily parsed. Other formats
such as XML were considered, but dismissed as overly complicated for
this initial proof-of-concept application. Additionally, since the
lexer passes tokens to the parser, they need to agree on a standard
set of token specifications.

As stated above, MINT generates an abstract syntax tree after
successfully parsing a string. The advantage of this is that MINT can
parser generators targetting arbitrary back-ends written in multiple
languages.

Implementation

The lexer generator uses Thompson's construction and an NFA to DFA
conversion to create the lexing table. Minimum-state DFA and table
compaction methods were not used.

The lexer is a standard DFA based algorithm with three enhancements:
lexer states (aka "start conditions"), lookahead patterns, and
arbitrary size character sets. Because no code generation is performed
at at lexer generation time, the lexer cannot perform arbitrary
actions upon a rule match. The only allowed actions are "return a
token" and "none." Additionally the lexer may change its own state
upon matching a lexeme. MINT implements lexer states that are similar
to the "start conditions" present in FLEX. At any point in time, the
lexer is in one and only one state. Lexer matching rules only match if
the lexer is in the correct state. Rules may match in more than one
state. This is typically used to implement different "modes" of
lexing-- for example, "string mode," or "comment mode." The lexer's
state may be changed internally, by the lexer's own rule actions, or
externally by a call from the parser. Lookahead patterns are
implemented subject to the same restrictions as for FLEX. That is,
lookahead patterns such as "ab*/b" are not allowed. Finally, the lexer
can handle character sets of arbitrary size, and this should be
particularly useful with Unicode. Because a lookup table would grow
impractically large with such a large character set, MINT's lexer
implements the lexing table as an adjacency list and transitions are
matched based on character ranges.

The LR(1) parser generator algorithm is taken straight from the
classic text by Aho, Sethi, and Ullman. We decided to implement LR(1)
as opposed to SLR or LALR due to the simplicity of this algorithm and
the fact that memory sizes are large enough now that the compactness
of the tables is not so much of an issue. [cite table size statistics
for different algorithms] The parser generator is additionally
augmented with "precedence rules" to resolve shift/reduce conflicts.

Achieving goals

MINT was able to generate parsers for the regular expression and BNF
syntaxes fairly easily. Attempting to generate a parse table for a
small Nickle subset proved to be more difficult, however.

--- NEW FILE: fig1.fig ---
#FIG 3.2
Portrait
Center
Inches
Letter  
100.00
Single
-2
1200 2
6 5203 3039 5718 3586
6 5258 3155 5663 3470
4 0 0 50 0 4 10 0.0000 2 150 405 5258 3275 lexing\001
4 0 0 50 0 4 10 0.0000 2 120 330 5258 3470 table\001
-6
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 5203 3039 5718 3039 5718 3586 5203 3586 5203 3039
-6
6 3000 2636 7080 2786
4 0 0 50 0 4 10 0.0000 2 120 930 6150 2756 MINT runtime\001
4 0 0 50 0 4 10 0.0000 2 120 645 5107 2756 table files\001
4 0 0 50 0 4 10 0.0000 2 150 1530 3000 2756 parser/lexer generator\001
-6
6 5200 4037 5715 4585
6 5202 4138 5712 4483
4 0 0 50 0 4 10 0.0000 2 120 330 5202 4483 table\001
4 0 0 50 0 4 10 0.0000 2 150 510 5202 4258 parsing\001
-6
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 5200 4037 5715 4037 5715 4585 5200 4585 5200 4037
-6
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 4655 3296 5203 3296
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 4655 4295 5203 4295
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 5718 3296 6266 3296
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 5718 4295 6266 4295
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 6620 3586 6620 4037
2 1 0 1 0 7 50 0 -1 6.000 0 0 -1 1 0 4
	1 1 2.00 51.54 103.08
	 3560 3747 3689 3747 3689 3296 3914 3296
2 1 0 1 0 7 50 0 -1 6.000 0 0 -1 1 0 4
	1 1 2.00 51.54 103.08
	 3560 3844 3689 3844 3689 4295 3914 4295
2 1 0 1 0 7 50 0 -1 6.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 2432 3779 2819 3779
2 4 1 1 0 7 50 0 -1 4.000 0 0 3 0 0 5
	 4816 4810 4816 2813 2593 2813 2593 4810 4816 4810
2 4 1 1 0 7 50 0 -1 4.000 0 0 3 0 0 5
	 7275 4800 7275 2813 6000 2813 6000 4800 7275 4800
2 4 1 1 0 7 50 0 -1 4.000 0 0 3 0 0 5
	 5850 4800 5850 2813 4977 2813 4977 4800 5850 4800
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 7548 3300 7009 3300
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 2819 3554 3560 3554 3560 4101 2819 4101 2819 3554
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 3914 3039 4655 3039 4655 3586 3914 3586 3914 3039
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 3914 4037 4655 4037 4655 4585 3914 4585 3914 4037
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 6266 3039 7007 3039 7007 3586 6266 3586 6266 3039
2 2 0 1 0 7 50 0 -1 4.000 0 0 -1 0 0 5
	 6266 4037 7007 4037 7007 4585 6266 4585 6266 4037
2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 1 0 2
	1 1 2.00 51.54 103.08
	 7009 4275 7557 4275
4 0 0 50 0 4 10 0.0000 2 120 600 1800 3900 grammar\001
4 0 0 50 0 4 10 0.0000 2 150 345 1800 3750 input\001
4 0 0 50 0 4 10 0.0000 2 150 690 7596 4315 parse tree\001
4 0 0 50 0 4 10 0.0000 2 120 435 2850 4022 parser\001
4 0 0 50 0 4 10 0.0000 2 120 600 2850 3872 grammar\001
4 0 0 50 0 4 10 0.0000 2 120 375 2850 3722 MINT\001
4 0 0 50 0 4 10 0.0000 2 120 330 3975 3260 lexer\001
4 0 0 50 0 4 10 0.0000 2 150 675 3975 3455 generator\001
4 0 0 50 0 4 10 0.0000 2 120 435 3975 4243 parser\001
4 0 0 50 0 4 10 0.0000 2 150 675 3975 4438 generator\001
4 0 0 50 0 4 10 0.0000 2 120 330 6471 3372 lexer\001
4 0 0 50 0 4 10 0.0000 2 120 435 6419 4341 parser\001
4 0 0 50 0 4 10 0.0000 2 150 795 7592 3355 input string\001

--- NEW FILE: regex.txt ---
R1 -> R2 PIPE R1 | R2
R2 -> R3 R2 | R3
R3 -> R4 STAR | R4 PLUS | R4 QMARK | R4 RANGE | R4
R4 -> CHAR | CCLASS | DOT | LPAREN R1 RPAREN

CCLASS -> LBRACKET (CHAR MINUS CHAR | CHAR)+ RBRACKET
       -> LBRACKET CARET (CHAR MINUS CHAR | CHAR)+ RBRACKET

RANGE  -> LBRACE NUMBER MINUS NUMBER RBRACE
       -> LBRACE NUMBER MINUS RBRACE
       -> LBRACE MINUS NUMBER RBRACE

/* handled by lexer */

CHAR   -> [^|*(){}[]-\]
       -> \[|*(){}[]-\]

NUMBER -> [0-9]+

PIPE      -> '|'
STAR      -> '*'
PLUS      -> '+'
MINUS     -> '-'
CARET     -> '^'
QMARK     -> '?'
DOT       -> '.'
LPAREN    -> '('
RPAREN    -> ')'
LBRACKET  -> '['
RBRACKET  -> ']'
LBRACE    -> '{'
RBRACE    -> '}'