blob: f05c481d7960f2e74897454eab1731adc4cbd6e3 [file] [log] [blame]
drh75897232000-05-29 14:26:00 +00001<html>
2<head>
3<title>The Lemon Parser Generator</title>
4</head>
5<body bgcolor=white>
6<h1 align=center>The Lemon Parser Generator</h1>
7
drh9bccde32016-03-19 18:00:44 +00008<p>Lemon is an LALR(1) parser generator for C.
9It does the same job as "bison" and "yacc".
10But lemon is not a bison or yacc clone. Lemon
drh75897232000-05-29 14:26:00 +000011uses a different grammar syntax which is designed to
drh9bccde32016-03-19 18:00:44 +000012reduce the number of coding errors. Lemon also uses a
13parsing engine that is faster than yacc and
14bison and which is both reentrant and threadsafe.
15(Update: Since the previous sentence was written, bison
16has also been updated so that it too can generate a
17reentrant and threadsafe parser.)
18Lemon also implements features that can be used
drh75897232000-05-29 14:26:00 +000019to eliminate resource leaks, making is suitable for use
20in long-running programs such as graphical user interfaces
21or embedded controllers.</p>
22
23<p>This document is an introduction to the Lemon
24parser generator.</p>
25
drhc5e56b32017-06-01 01:53:19 +000026<h2>Security Note</h2>
27
28<p>The language parser code created by Lemon is very robust and
29is well-suited for use in internet-facing applications that need to
30safely process maliciously crafted inputs.
31
32<p>The "lemon.exe" command-line tool itself works great when given a valid
33input grammar file and almost always gives helpful
34error messages for malformed inputs. However, it is possible for
35a malicious user to craft a grammar file that will cause
36lemon.exe to crash.
37We do not see this as a problem, as lemon.exe is not intended to be used
38with hostile inputs.
39To summarize:</p>
40
41<ul>
42<li>Parser code generated by lemon &rarr; Robust and secure
43<li>The "lemon.exe" command line tool itself &rarr; Not so much
44</ul>
45
drh75897232000-05-29 14:26:00 +000046<h2>Theory of Operation</h2>
47
48<p>The main goal of Lemon is to translate a context free grammar (CFG)
49for a particular language into C code that implements a parser for
50that language.
51The program has two inputs:
52<ul>
53<li>The grammar specification.
54<li>A parser template file.
55</ul>
56Typically, only the grammar specification is supplied by the programmer.
57Lemon comes with a default parser template which works fine for most
58applications. But the user is free to substitute a different parser
59template if desired.</p>
60
61<p>Depending on command-line options, Lemon will generate between
62one and three files of outputs.
63<ul>
64<li>C code to implement the parser.
65<li>A header file defining an integer ID for each terminal symbol.
66<li>An information file that describes the states of the generated parser
67 automaton.
68</ul>
69By default, all three of these output files are generated.
drh9bccde32016-03-19 18:00:44 +000070The header file is suppressed if the "-m" command-line option is
71used and the report file is omitted when "-q" is selected.</p>
drh75897232000-05-29 14:26:00 +000072
drh9bccde32016-03-19 18:00:44 +000073<p>The grammar specification file uses a ".y" suffix, by convention.
drh75897232000-05-29 14:26:00 +000074In the examples used in this document, we'll assume the name of the
drh9bccde32016-03-19 18:00:44 +000075grammar file is "gram.y". A typical use of Lemon would be the
drh75897232000-05-29 14:26:00 +000076following command:
77<pre>
78 lemon gram.y
79</pre>
drh9bccde32016-03-19 18:00:44 +000080This command will generate three output files named "gram.c",
81"gram.h" and "gram.out".
drh75897232000-05-29 14:26:00 +000082The first is C code to implement the parser. The second
83is the header file that defines numerical values for all
84terminal symbols, and the last is the report that explains
85the states used by the parser automaton.</p>
86
87<h3>Command Line Options</h3>
88
89<p>The behavior of Lemon can be modified using command-line options.
90You can obtain a list of the available command-line options together
91with a brief explanation of what each does by typing
92<pre>
93 lemon -?
94</pre>
95As of this writing, the following command-line options are supported:
96<ul>
drh9bccde32016-03-19 18:00:44 +000097<li><b>-b</b>
98Show only the basis for each parser state in the report file.
99<li><b>-c</b>
100Do not compress the generated action tables.
101<li><b>-D<i>name</i></b>
102Define C preprocessor macro <i>name</i>. This macro is useable by
103"%ifdef" lines in the grammar file.
104<li><b>-g</b>
105Do not generate a parser. Instead write the input grammar to standard
106output with all comments, actions, and other extraneous text removed.
107<li><b>-l</b>
drhdfe4e6b2016-10-08 13:34:08 +0000108Omit "#line" directives in the generated parser C code.
drh9bccde32016-03-19 18:00:44 +0000109<li><b>-m</b>
110Cause the output C source code to be compatible with the "makeheaders"
111program.
112<li><b>-p</b>
113Display all conflicts that are resolved by
114<a href='#precrules'>precedence rules</a>.
115<li><b>-q</b>
116Suppress generation of the report file.
117<li><b>-r</b>
118Do not sort or renumber the parser states as part of optimization.
119<li><b>-s</b>
120Show parser statistics before existing.
121<li><b>-T<i>file</i></b>
122Use <i>file</i> as the template for the generated C-code parser implementation.
123<li><b>-x</b>
124Print the Lemon version number.
drh75897232000-05-29 14:26:00 +0000125</ul>
drh75897232000-05-29 14:26:00 +0000126
127<h3>The Parser Interface</h3>
128
129<p>Lemon doesn't generate a complete, working program. It only generates
130a few subroutines that implement a parser. This section describes
131the interface to those subroutines. It is up to the programmer to
132call these subroutines in an appropriate way in order to produce a
133complete system.</p>
134
135<p>Before a program begins using a Lemon-generated parser, the program
136must first create the parser.
137A new parser is created as follows:
138<pre>
139 void *pParser = ParseAlloc( malloc );
140</pre>
141The ParseAlloc() routine allocates and initializes a new parser and
142returns a pointer to it.
drh9bccde32016-03-19 18:00:44 +0000143The actual data structure used to represent a parser is opaque &mdash;
drh75897232000-05-29 14:26:00 +0000144its internal structure is not visible or usable by the calling routine.
145For this reason, the ParseAlloc() routine returns a pointer to void
146rather than a pointer to some particular structure.
147The sole argument to the ParseAlloc() routine is a pointer to the
drh9bccde32016-03-19 18:00:44 +0000148subroutine used to allocate memory. Typically this means malloc().</p>
drh75897232000-05-29 14:26:00 +0000149
150<p>After a program is finished using a parser, it can reclaim all
151memory allocated by that parser by calling
152<pre>
153 ParseFree(pParser, free);
154</pre>
155The first argument is the same pointer returned by ParseAlloc(). The
156second argument is a pointer to the function used to release bulk
157memory back to the system.</p>
158
159<p>After a parser has been allocated using ParseAlloc(), the programmer
160must supply the parser with a sequence of tokens (terminal symbols) to
161be parsed. This is accomplished by calling the following function
162once for each token:
163<pre>
164 Parse(pParser, hTokenID, sTokenData, pArg);
165</pre>
166The first argument to the Parse() routine is the pointer returned by
167ParseAlloc().
168The second argument is a small positive integer that tells the parse the
169type of the next token in the data stream.
170There is one token type for each terminal symbol in the grammar.
171The gram.h file generated by Lemon contains #define statements that
172map symbolic terminal symbol names into appropriate integer values.
drh9bccde32016-03-19 18:00:44 +0000173A value of 0 for the second argument is a special flag to the
174parser to indicate that the end of input has been reached.
drh75897232000-05-29 14:26:00 +0000175The third argument is the value of the given token. By default,
176the type of the third argument is integer, but the grammar will
177usually redefine this type to be some kind of structure.
178Typically the second argument will be a broad category of tokens
drh9bccde32016-03-19 18:00:44 +0000179such as "identifier" or "number" and the third argument will
drh75897232000-05-29 14:26:00 +0000180be the name of the identifier or the value of the number.</p>
181
182<p>The Parse() function may have either three or four arguments,
drh45f31be2016-02-16 21:19:49 +0000183depending on the grammar. If the grammar specification file requests
184it (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>),
185the Parse() function will have a fourth parameter that can be
drh75897232000-05-29 14:26:00 +0000186of any type chosen by the programmer. The parser doesn't do anything
187with this argument except to pass it through to action routines.
188This is a convenient mechanism for passing state information down
189to the action routines without having to use global variables.</p>
190
191<p>A typical use of a Lemon parser might look something like the
192following:
193<pre>
194 01 ParseTree *ParseFile(const char *zFilename){
195 02 Tokenizer *pTokenizer;
196 03 void *pParser;
197 04 Token sToken;
198 05 int hTokenId;
199 06 ParserState sState;
200 07
201 08 pTokenizer = TokenizerCreate(zFilename);
202 09 pParser = ParseAlloc( malloc );
203 10 InitParserState(&sState);
204 11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
205 12 Parse(pParser, hTokenId, sToken, &sState);
206 13 }
207 14 Parse(pParser, 0, sToken, &sState);
208 15 ParseFree(pParser, free );
209 16 TokenizerFree(pTokenizer);
210 17 return sState.treeRoot;
211 18 }
212</pre>
213This example shows a user-written routine that parses a file of
214text and returns a pointer to the parse tree.
drh9bccde32016-03-19 18:00:44 +0000215(All error-handling code is omitted from this example to keep it
drh75897232000-05-29 14:26:00 +0000216simple.)
217We assume the existence of some kind of tokenizer which is created
218using TokenizerCreate() on line 8 and deleted by TokenizerFree()
219on line 16. The GetNextToken() function on line 11 retrieves the
220next token from the input file and puts its type in the
221integer variable hTokenId. The sToken variable is assumed to be
222some kind of structure that contains details about each token,
223such as its complete text, what line it occurs on, etc. </p>
224
225<p>This example also assumes the existence of structure of type
226ParserState that holds state information about a particular parse.
227An instance of such a structure is created on line 6 and initialized
228on line 10. A pointer to this structure is passed into the Parse()
229routine as the optional 4th argument.
230The action routine specified by the grammar for the parser can use
231the ParserState structure to hold whatever information is useful and
232appropriate. In the example, we note that the treeRoot field of
233the ParserState structure is left pointing to the root of the parse
234tree.</p>
235
236<p>The core of this example as it relates to Lemon is as follows:
237<pre>
238 ParseFile(){
239 pParser = ParseAlloc( malloc );
240 while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
241 Parse(pParser, hTokenId, sToken);
242 }
243 Parse(pParser, 0, sToken);
244 ParseFree(pParser, free );
245 }
246</pre>
247Basically, what a program has to do to use a Lemon-generated parser
248is first create the parser, then send it lots of tokens obtained by
249tokenizing an input source. When the end of input is reached, the
250Parse() routine should be called one last time with a token type
251of 0. This step is necessary to inform the parser that the end of
252input has been reached. Finally, we reclaim memory used by the
253parser by calling ParseFree().</p>
254
255<p>There is one other interface routine that should be mentioned
256before we move on.
257The ParseTrace() function can be used to generate debugging output
258from the parser. A prototype for this routine is as follows:
259<pre>
260 ParseTrace(FILE *stream, char *zPrefix);
261</pre>
262After this routine is called, a short (one-line) message is written
263to the designated output stream every time the parser changes states
264or calls an action routine. Each such message is prefaced using
265the text given by zPrefix. This debugging output can be turned off
266by calling ParseTrace() again with a first argument of NULL (0).</p>
267
268<h3>Differences With YACC and BISON</h3>
269
270<p>Programmers who have previously used the yacc or bison parser
271generator will notice several important differences between yacc and/or
272bison and Lemon.
273<ul>
274<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
275 the tokenizer calls the parser.
276<li>Lemon uses no global variables. Yacc and bison use global variables
277 to pass information between the tokenizer and parser.
278<li>Lemon allows multiple parsers to be running simultaneously. Yacc
279 and bison do not.
280</ul>
281These differences may cause some initial confusion for programmers
282with prior yacc and bison experience.
283But after years of experience using Lemon, I firmly
284believe that the Lemon way of doing things is better.</p>
285
drh45f31be2016-02-16 21:19:49 +0000286<p><i>Updated as of 2016-02-16:</i>
287The text above was written in the 1990s.
288We are told that Bison has lately been enhanced to support the
289tokenizer-calls-parser paradigm used by Lemon, and to obviate the
290need for global variables.</p>
291
drh75897232000-05-29 14:26:00 +0000292<h2>Input File Syntax</h2>
293
294<p>The main purpose of the grammar specification file for Lemon is
295to define the grammar for the parser. But the input file also
296specifies additional information Lemon requires to do its job.
297Most of the work in using Lemon is in writing an appropriate
298grammar file.</p>
299
300<p>The grammar file for lemon is, for the most part, free format.
301It does not have sections or divisions like yacc or bison. Any
302declaration can occur at any point in the file.
303Lemon ignores whitespace (except where it is needed to separate
304tokens) and it honors the same commenting conventions as C and C++.</p>
305
306<h3>Terminals and Nonterminals</h3>
307
308<p>A terminal symbol (token) is any string of alphanumeric
drh9bccde32016-03-19 18:00:44 +0000309and/or underscore characters
drh75897232000-05-29 14:26:00 +0000310that begins with an upper case letter.
drhc8eee5e2011-07-30 23:50:12 +0000311A terminal can contain lowercase letters after the first character,
drh75897232000-05-29 14:26:00 +0000312but the usual convention is to make terminals all upper case.
313A nonterminal, on the other hand, is any string of alphanumeric
314and underscore characters than begins with a lower case letter.
315Again, the usual convention is to make nonterminals use all lower
316case letters.</p>
317
318<p>In Lemon, terminal and nonterminal symbols do not need to
319be declared or identified in a separate section of the grammar file.
320Lemon is able to generate a list of all terminals and nonterminals
321by examining the grammar rules, and it can always distinguish a
322terminal from a nonterminal by checking the case of the first
323character of the name.</p>
324
325<p>Yacc and bison allow terminal symbols to have either alphanumeric
326names or to be individual characters included in single quotes, like
327this: ')' or '$'. Lemon does not allow this alternative form for
328terminal symbols. With Lemon, all symbols, terminals and nonterminals,
329must have alphanumeric names.</p>
330
331<h3>Grammar Rules</h3>
332
333<p>The main component of a Lemon grammar file is a sequence of grammar
334rules.
335Each grammar rule consists of a nonterminal symbol followed by
drh9bccde32016-03-19 18:00:44 +0000336the special symbol "::=" and then a list of terminals and/or nonterminals.
drh75897232000-05-29 14:26:00 +0000337The rule is terminated by a period.
338The list of terminals and nonterminals on the right-hand side of the
339rule can be empty.
340Rules can occur in any order, except that the left-hand side of the
341first rule is assumed to be the start symbol for the grammar (unless
342specified otherwise using the <tt>%start</tt> directive described below.)
343A typical sequence of grammar rules might look something like this:
344<pre>
345 expr ::= expr PLUS expr.
346 expr ::= expr TIMES expr.
347 expr ::= LPAREN expr RPAREN.
348 expr ::= VALUE.
349</pre>
350</p>
351
drh9bccde32016-03-19 18:00:44 +0000352<p>There is one non-terminal in this example, "expr", and five
353terminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
354"RPAREN" and "VALUE".</p>
drh75897232000-05-29 14:26:00 +0000355
356<p>Like yacc and bison, Lemon allows the grammar to specify a block
357of C code that will be executed whenever a grammar rule is reduced
358by the parser.
359In Lemon, this action is specified by putting the C code (contained
360within curly braces <tt>{...}</tt>) immediately after the
361period that closes the rule.
362For example:
363<pre>
364 expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
365</pre>
366</p>
367
368<p>In order to be useful, grammar actions must normally be linked to
369their associated grammar rules.
drh9bccde32016-03-19 18:00:44 +0000370In yacc and bison, this is accomplished by embedding a "$$" in the
drh75897232000-05-29 14:26:00 +0000371action to stand for the value of the left-hand side of the rule and
drh9bccde32016-03-19 18:00:44 +0000372symbols "$1", "$2", and so forth to stand for the value of
drh75897232000-05-29 14:26:00 +0000373the terminal or nonterminal at position 1, 2 and so forth on the
374right-hand side of the rule.
375This idea is very powerful, but it is also very error-prone. The
376single most common source of errors in a yacc or bison grammar is
377to miscount the number of symbols on the right-hand side of a grammar
drh9bccde32016-03-19 18:00:44 +0000378rule and say "$7" when you really mean "$8".</p>
drh75897232000-05-29 14:26:00 +0000379
380<p>Lemon avoids the need to count grammar symbols by assigning symbolic
381names to each symbol in a grammar rule and then using those symbolic
382names in the action.
383In yacc or bison, one would write this:
384<pre>
385 expr -> expr PLUS expr { $$ = $1 + $3; };
386</pre>
387But in Lemon, the same rule becomes the following:
388<pre>
389 expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
390</pre>
391In the Lemon rule, any symbol in parentheses after a grammar rule
392symbol becomes a place holder for that symbol in the grammar rule.
393This place holder can then be used in the associated C action to
394stand for the value of that symbol.<p>
395
396<p>The Lemon notation for linking a grammar rule with its reduce
397action is superior to yacc/bison on several counts.
398First, as mentioned above, the Lemon method avoids the need to
399count grammar symbols.
400Secondly, if a terminal or nonterminal in a Lemon grammar rule
401includes a linking symbol in parentheses but that linking symbol
402is not actually used in the reduce action, then an error message
403is generated.
404For example, the rule
405<pre>
406 expr(A) ::= expr(B) PLUS expr(C). { A = B; }
407</pre>
drh9bccde32016-03-19 18:00:44 +0000408will generate an error because the linking symbol "C" is used
drh75897232000-05-29 14:26:00 +0000409in the grammar rule but not in the reduce action.</p>
410
411<p>The Lemon notation for linking grammar rules to reduce actions
412also facilitates the use of destructors for reclaiming memory
413allocated by the values of terminals and nonterminals on the
414right-hand side of a rule.</p>
415
drh9bccde32016-03-19 18:00:44 +0000416<a name='precrules'></a>
drh75897232000-05-29 14:26:00 +0000417<h3>Precedence Rules</h3>
418
419<p>Lemon resolves parsing ambiguities in exactly the same way as
420yacc and bison. A shift-reduce conflict is resolved in favor
421of the shift, and a reduce-reduce conflict is resolved by reducing
422whichever rule comes first in the grammar file.</p>
423
424<p>Just like in
425yacc and bison, Lemon allows a measure of control
426over the resolution of paring conflicts using precedence rules.
427A precedence value can be assigned to any terminal symbol
drh9bccde32016-03-19 18:00:44 +0000428using the
429<a href='#pleft'>%left</a>,
430<a href='#pright'>%right</a> or
431<a href='#pnonassoc'>%nonassoc</a> directives. Terminal symbols
drh75897232000-05-29 14:26:00 +0000432mentioned in earlier directives have a lower precedence that
433terminal symbols mentioned in later directives. For example:</p>
434
435<p><pre>
436 %left AND.
437 %left OR.
438 %nonassoc EQ NE GT GE LT LE.
439 %left PLUS MINUS.
440 %left TIMES DIVIDE MOD.
441 %right EXP NOT.
442</pre></p>
443
444<p>In the preceding sequence of directives, the AND operator is
445defined to have the lowest precedence. The OR operator is one
446precedence level higher. And so forth. Hence, the grammar would
447attempt to group the ambiguous expression
448<pre>
449 a AND b OR c
450</pre>
451like this
452<pre>
453 a AND (b OR c).
454</pre>
455The associativity (left, right or nonassoc) is used to determine
456the grouping when the precedence is the same. AND is left-associative
457in our example, so
458<pre>
459 a AND b AND c
460</pre>
461is parsed like this
462<pre>
463 (a AND b) AND c.
464</pre>
465The EXP operator is right-associative, though, so
466<pre>
467 a EXP b EXP c
468</pre>
469is parsed like this
470<pre>
471 a EXP (b EXP c).
472</pre>
473The nonassoc precedence is used for non-associative operators.
474So
475<pre>
476 a EQ b EQ c
477</pre>
478is an error.</p>
479
480<p>The precedence of non-terminals is transferred to rules as follows:
481The precedence of a grammar rule is equal to the precedence of the
482left-most terminal symbol in the rule for which a precedence is
483defined. This is normally what you want, but in those cases where
484you want to precedence of a grammar rule to be something different,
485you can specify an alternative precedence symbol by putting the
486symbol in square braces after the period at the end of the rule and
487before any C-code. For example:</p>
488
489<p><pre>
490 expr = MINUS expr. [NOT]
491</pre></p>
492
493<p>This rule has a precedence equal to that of the NOT symbol, not the
494MINUS symbol as would have been the case by default.</p>
495
496<p>With the knowledge of how precedence is assigned to terminal
497symbols and individual
498grammar rules, we can now explain precisely how parsing conflicts
499are resolved in Lemon. Shift-reduce conflicts are resolved
500as follows:
501<ul>
502<li> If either the token to be shifted or the rule to be reduced
503 lacks precedence information, then resolve in favor of the
504 shift, but report a parsing conflict.
505<li> If the precedence of the token to be shifted is greater than
506 the precedence of the rule to reduce, then resolve in favor
507 of the shift. No parsing conflict is reported.
508<li> If the precedence of the token it be shifted is less than the
509 precedence of the rule to reduce, then resolve in favor of the
510 reduce action. No parsing conflict is reported.
511<li> If the precedences are the same and the shift token is
512 right-associative, then resolve in favor of the shift.
513 No parsing conflict is reported.
mistachkind5578432012-08-25 10:01:29 +0000514<li> If the precedences are the same the shift token is
drh75897232000-05-29 14:26:00 +0000515 left-associative, then resolve in favor of the reduce.
516 No parsing conflict is reported.
517<li> Otherwise, resolve the conflict by doing the shift and
518 report the parsing conflict.
519</ul>
520Reduce-reduce conflicts are resolved this way:
521<ul>
522<li> If either reduce rule
523 lacks precedence information, then resolve in favor of the
524 rule that appears first in the grammar and report a parsing
525 conflict.
526<li> If both rules have precedence and the precedence is different
527 then resolve the dispute in favor of the rule with the highest
528 precedence and do not report a conflict.
529<li> Otherwise, resolve the conflict by reducing by the rule that
530 appears first in the grammar and report a parsing conflict.
531</ul>
532
533<h3>Special Directives</h3>
534
535<p>The input grammar to Lemon consists of grammar rules and special
536directives. We've described all the grammar rules, so now we'll
537talk about the special directives.</p>
538
539<p>Directives in lemon can occur in any order. You can put them before
540the grammar rules, or after the grammar rules, or in the mist of the
541grammar rules. It doesn't matter. The relative order of
542directives used to assign precedence to terminals is important, but
543other than that, the order of directives in Lemon is arbitrary.</p>
544
545<p>Lemon supports the following special directives:
546<ul>
drhf2340fc2001-06-08 00:25:18 +0000547<li><tt>%code</tt>
548<li><tt>%default_destructor</tt>
549<li><tt>%default_type</tt>
drh75897232000-05-29 14:26:00 +0000550<li><tt>%destructor</tt>
drh9bccde32016-03-19 18:00:44 +0000551<li><tt>%endif</tt>
drh75897232000-05-29 14:26:00 +0000552<li><tt>%extra_argument</tt>
drh9bccde32016-03-19 18:00:44 +0000553<li><tt>%fallback</tt>
554<li><tt>%ifdef</tt>
555<li><tt>%ifndef</tt>
drh75897232000-05-29 14:26:00 +0000556<li><tt>%include</tt>
557<li><tt>%left</tt>
558<li><tt>%name</tt>
559<li><tt>%nonassoc</tt>
560<li><tt>%parse_accept</tt>
561<li><tt>%parse_failure </tt>
562<li><tt>%right</tt>
563<li><tt>%stack_overflow</tt>
564<li><tt>%stack_size</tt>
565<li><tt>%start_symbol</tt>
566<li><tt>%syntax_error</tt>
drh9bccde32016-03-19 18:00:44 +0000567<li><tt>%token_class</tt>
drh75897232000-05-29 14:26:00 +0000568<li><tt>%token_destructor</tt>
569<li><tt>%token_prefix</tt>
570<li><tt>%token_type</tt>
571<li><tt>%type</tt>
drh9bccde32016-03-19 18:00:44 +0000572<li><tt>%wildcard</tt>
drh75897232000-05-29 14:26:00 +0000573</ul>
574Each of these directives will be described separately in the
575following sections:</p>
576
drh9bccde32016-03-19 18:00:44 +0000577<a name='pcode'></a>
drhf2340fc2001-06-08 00:25:18 +0000578<h4>The <tt>%code</tt> directive</h4>
579
drh9bccde32016-03-19 18:00:44 +0000580<p>The %code directive is used to specify addition C code that
drhf2340fc2001-06-08 00:25:18 +0000581is added to the end of the main output file. This is similar to
drh9bccde32016-03-19 18:00:44 +0000582the <a href='#pinclude'>%include</a> directive except that %include
583is inserted at the beginning of the main output file.</p>
drhf2340fc2001-06-08 00:25:18 +0000584
585<p>%code is typically used to include some action routines or perhaps
drh9bccde32016-03-19 18:00:44 +0000586a tokenizer or even the "main()" function
587as part of the output file.</p>
drhf2340fc2001-06-08 00:25:18 +0000588
drh9bccde32016-03-19 18:00:44 +0000589<a name='default_destructor'></a>
drhf2340fc2001-06-08 00:25:18 +0000590<h4>The <tt>%default_destructor</tt> directive</h4>
591
592<p>The %default_destructor directive specifies a destructor to
593use for non-terminals that do not have their own destructor
594specified by a separate %destructor directive. See the documentation
drh9bccde32016-03-19 18:00:44 +0000595on the <a name='#destructor'>%destructor</a> directive below for
596additional information.</p>
drhf2340fc2001-06-08 00:25:18 +0000597
598<p>In some grammers, many different non-terminal symbols have the
599same datatype and hence the same destructor. This directive is
600a convenience way to specify the same destructor for all those
601non-terminals using a single statement.</p>
602
drh9bccde32016-03-19 18:00:44 +0000603<a name='default_type'></a>
drhf2340fc2001-06-08 00:25:18 +0000604<h4>The <tt>%default_type</tt> directive</h4>
605
606<p>The %default_type directive specifies the datatype of non-terminal
607symbols that do no have their own datatype defined using a separate
drh9bccde32016-03-19 18:00:44 +0000608<a href='#ptype'>%type</a> directive.
609</p>
drhf2340fc2001-06-08 00:25:18 +0000610
drh9bccde32016-03-19 18:00:44 +0000611<a name='destructor'></a>
drh75897232000-05-29 14:26:00 +0000612<h4>The <tt>%destructor</tt> directive</h4>
613
614<p>The %destructor directive is used to specify a destructor for
615a non-terminal symbol.
drh9bccde32016-03-19 18:00:44 +0000616(See also the <a href='#token_destructor'>%token_destructor</a>
617directive which is used to specify a destructor for terminal symbols.)</p>
drh75897232000-05-29 14:26:00 +0000618
619<p>A non-terminal's destructor is called to dispose of the
620non-terminal's value whenever the non-terminal is popped from
621the stack. This includes all of the following circumstances:
622<ul>
623<li> When a rule reduces and the value of a non-terminal on
624 the right-hand side is not linked to C code.
625<li> When the stack is popped during error processing.
626<li> When the ParseFree() function runs.
627</ul>
628The destructor can do whatever it wants with the value of
629the non-terminal, but its design is to deallocate memory
630or other resources held by that non-terminal.</p>
631
632<p>Consider an example:
633<pre>
634 %type nt {void*}
635 %destructor nt { free($$); }
636 nt(A) ::= ID NUM. { A = malloc( 100 ); }
637</pre>
638This example is a bit contrived but it serves to illustrate how
639destructors work. The example shows a non-terminal named
drh9bccde32016-03-19 18:00:44 +0000640"nt" that holds values of type "void*". When the rule for
641an "nt" reduces, it sets the value of the non-terminal to
drh75897232000-05-29 14:26:00 +0000642space obtained from malloc(). Later, when the nt non-terminal
643is popped from the stack, the destructor will fire and call
644free() on this malloced space, thus avoiding a memory leak.
drh9bccde32016-03-19 18:00:44 +0000645(Note that the symbol "$$" in the destructor code is replaced
drh75897232000-05-29 14:26:00 +0000646by the value of the non-terminal.)</p>
647
648<p>It is important to note that the value of a non-terminal is passed
649to the destructor whenever the non-terminal is removed from the
650stack, unless the non-terminal is used in a C-code action. If
651the non-terminal is used by C-code, then it is assumed that the
drh9bccde32016-03-19 18:00:44 +0000652C-code will take care of destroying it.
653More commonly, the value is used to build some
drh75897232000-05-29 14:26:00 +0000654larger structure and we don't want to destroy it, which is why
655the destructor is not called in this circumstance.</p>
656
drh9bccde32016-03-19 18:00:44 +0000657<p>Destructors help avoid memory leaks by automatically freeing
658allocated objects when they go out of scope.
drh75897232000-05-29 14:26:00 +0000659To do the same using yacc or bison is much more difficult.</p>
660
drh45f31be2016-02-16 21:19:49 +0000661<a name="extraarg"></a>
drh75897232000-05-29 14:26:00 +0000662<h4>The <tt>%extra_argument</tt> directive</h4>
663
664The %extra_argument directive instructs Lemon to add a 4th parameter
665to the parameter list of the Parse() function it generates. Lemon
666doesn't do anything itself with this extra argument, but it does
667make the argument available to C-code action routines, destructors,
668and so forth. For example, if the grammar file contains:</p>
669
670<p><pre>
671 %extra_argument { MyStruct *pAbc }
672</pre></p>
673
674<p>Then the Parse() function generated will have an 4th parameter
drh9bccde32016-03-19 18:00:44 +0000675of type "MyStruct*" and all action routines will have access to
676a variable named "pAbc" that is the value of the 4th parameter
drh75897232000-05-29 14:26:00 +0000677in the most recent call to Parse().</p>
678
drh9bccde32016-03-19 18:00:44 +0000679<a name='pfallback'></a>
680<h4>The <tt>%fallback</tt> directive</h4>
681
682<p>The %fallback directive specifies an alternative meaning for one
683or more tokens. The alternative meaning is tried if the original token
684would have generated a syntax error.
685
686<p>The %fallback directive was added to support robust parsing of SQL
687syntax in <a href="https://www.sqlite.org/">SQLite</a>.
688The SQL language contains a large assortment of keywords, each of which
689appears as a different token to the language parser. SQL contains so
690many keywords, that it can be difficult for programmers to keep up with
691them all. Programmers will, therefore, sometimes mistakenly use an
692obscure language keyword for an identifier. The %fallback directive
693provides a mechanism to tell the parser: "If you are unable to parse
694this keyword, try treating it as an identifier instead."
695
696<p>The syntax of %fallback is as follows:
697
698<blockquote>
699<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b>
700</blockquote>
701
702<p>In words, the %fallback directive is followed by a list of token names
703terminated by a period. The first token name is the fallback token - the
704token to which all the other tokens fall back to. The second and subsequent
705arguments are tokens which fall back to the token identified by the first
706argument.
707
708<a name='pifdef'></a>
709<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives.</h4>
710
711<p>The %ifdef, %ifndef, and %endif directives are similar to
712#ifdef, #ifndef, and #endif in the C-preprocessor, just not as general.
713Each of these directives must begin at the left margin. No whitespace
714is allowed between the "%" and the directive name.
715
716<p>Grammar text in between "%ifdef MACRO" and the next nested "%endif" is
717ignored unless the "-DMACRO" command-line option is used. Grammar text
718betwen "%ifndef MACRO" and the next nested "%endif" is included except when
719the "-DMACRO" command-line option is used.
720
721<p>Note that the argument to %ifdef and %ifndef must be a single
722preprocessor symbol name, not a general expression. There is no "%else"
723directive.
724
725
726<a name='pinclude'></a>
drh75897232000-05-29 14:26:00 +0000727<h4>The <tt>%include</tt> directive</h4>
728
729<p>The %include directive specifies C code that is included at the
730top of the generated parser. You can include any text you want --
drhf2340fc2001-06-08 00:25:18 +0000731the Lemon parser generator copies it blindly. If you have multiple
drh9bccde32016-03-19 18:00:44 +0000732%include directives in your grammar file, their values are concatenated
733so that all %include code ultimately appears near the top of the
734generated parser, in the same order as it appeared in the grammer.</p>
drh75897232000-05-29 14:26:00 +0000735
736<p>The %include directive is very handy for getting some extra #include
737preprocessor statements at the beginning of the generated parser.
738For example:</p>
739
740<p><pre>
741 %include {#include &lt;unistd.h&gt;}
742</pre></p>
743
744<p>This might be needed, for example, if some of the C actions in the
745grammar call functions that are prototyed in unistd.h.</p>
746
drh9bccde32016-03-19 18:00:44 +0000747<a name='pleft'></a>
drh75897232000-05-29 14:26:00 +0000748<h4>The <tt>%left</tt> directive</h4>
749
drh9bccde32016-03-19 18:00:44 +0000750The %left directive is used (along with the <a href='#pright'>%right</a> and
751<a href='#pnonassoc'>%nonassoc</a> directives) to declare precedences of
752terminal symbols. Every terminal symbol whose name appears after
753a %left directive but before the next period (".") is
drh75897232000-05-29 14:26:00 +0000754given the same left-associative precedence value. Subsequent
755%left directives have higher precedence. For example:</p>
756
757<p><pre>
758 %left AND.
759 %left OR.
760 %nonassoc EQ NE GT GE LT LE.
761 %left PLUS MINUS.
762 %left TIMES DIVIDE MOD.
763 %right EXP NOT.
764</pre></p>
765
766<p>Note the period that terminates each %left, %right or %nonassoc
767directive.</p>
768
769<p>LALR(1) grammars can get into a situation where they require
770a large amount of stack space if you make heavy use or right-associative
771operators. For this reason, it is recommended that you use %left
772rather than %right whenever possible.</p>
773
drh9bccde32016-03-19 18:00:44 +0000774<a name='pname'></a>
drh75897232000-05-29 14:26:00 +0000775<h4>The <tt>%name</tt> directive</h4>
776
777<p>By default, the functions generated by Lemon all begin with the
drh9bccde32016-03-19 18:00:44 +0000778five-character string "Parse". You can change this string to something
drh75897232000-05-29 14:26:00 +0000779different using the %name directive. For instance:</p>
780
781<p><pre>
782 %name Abcde
783</pre></p>
784
785<p>Putting this directive in the grammar file will cause Lemon to generate
786functions named
787<ul>
788<li> AbcdeAlloc(),
789<li> AbcdeFree(),
790<li> AbcdeTrace(), and
791<li> Abcde().
792</ul>
793The %name directive allows you to generator two or more different
794parsers and link them all into the same executable.
795</p>
796
drh9bccde32016-03-19 18:00:44 +0000797<a name='pnonassoc'></a>
drh75897232000-05-29 14:26:00 +0000798<h4>The <tt>%nonassoc</tt> directive</h4>
799
800<p>This directive is used to assign non-associative precedence to
drh9bccde32016-03-19 18:00:44 +0000801one or more terminal symbols. See the section on
802<a href='#precrules'>precedence rules</a>
803or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh75897232000-05-29 14:26:00 +0000804
drh9bccde32016-03-19 18:00:44 +0000805<a name='parse_accept'></a>
drh75897232000-05-29 14:26:00 +0000806<h4>The <tt>%parse_accept</tt> directive</h4>
807
808<p>The %parse_accept directive specifies a block of C code that is
drh9bccde32016-03-19 18:00:44 +0000809executed whenever the parser accepts its input string. To "accept"
drh75897232000-05-29 14:26:00 +0000810an input string means that the parser was able to process all tokens
811without error.</p>
812
813<p>For example:</p>
814
815<p><pre>
816 %parse_accept {
817 printf("parsing complete!\n");
818 }
819</pre></p>
820
drh9bccde32016-03-19 18:00:44 +0000821<a name='parse_failure'></a>
drh75897232000-05-29 14:26:00 +0000822<h4>The <tt>%parse_failure</tt> directive</h4>
823
824<p>The %parse_failure directive specifies a block of C code that
825is executed whenever the parser fails complete. This code is not
826executed until the parser has tried and failed to resolve an input
827error using is usual error recovery strategy. The routine is
828only invoked when parsing is unable to continue.</p>
829
830<p><pre>
831 %parse_failure {
832 fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
833 }
834</pre></p>
835
drh9bccde32016-03-19 18:00:44 +0000836<a name='pright'></a>
drh75897232000-05-29 14:26:00 +0000837<h4>The <tt>%right</tt> directive</h4>
838
839<p>This directive is used to assign right-associative precedence to
drh9bccde32016-03-19 18:00:44 +0000840one or more terminal symbols. See the section on
841<a href='#precrules'>precedence rules</a>
842or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh75897232000-05-29 14:26:00 +0000843
drh9bccde32016-03-19 18:00:44 +0000844<a name='stack_overflow'></a>
drh75897232000-05-29 14:26:00 +0000845<h4>The <tt>%stack_overflow</tt> directive</h4>
846
847<p>The %stack_overflow directive specifies a block of C code that
848is executed if the parser's internal stack ever overflows. Typically
849this just prints an error message. After a stack overflow, the parser
850will be unable to continue and must be reset.</p>
851
852<p><pre>
853 %stack_overflow {
854 fprintf(stderr,"Giving up. Parser stack overflow\n");
855 }
856</pre></p>
857
858<p>You can help prevent parser stack overflows by avoiding the use
859of right recursion and right-precedence operators in your grammar.
860Use left recursion and and left-precedence operators instead, to
861encourage rules to reduce sooner and keep the stack size down.
862For example, do rules like this:
863<pre>
864 list ::= list element. // left-recursion. Good!
865 list ::= .
866</pre>
867Not like this:
868<pre>
869 list ::= element list. // right-recursion. Bad!
870 list ::= .
871</pre>
872
drh9bccde32016-03-19 18:00:44 +0000873<a name='stack_size'></a>
drh75897232000-05-29 14:26:00 +0000874<h4>The <tt>%stack_size</tt> directive</h4>
875
876<p>If stack overflow is a problem and you can't resolve the trouble
877by using left-recursion, then you might want to increase the size
878of the parser's stack using this directive. Put an positive integer
879after the %stack_size directive and Lemon will generate a parse
880with a stack of the requested size. The default value is 100.</p>
881
882<p><pre>
883 %stack_size 2000
884</pre></p>
885
drh9bccde32016-03-19 18:00:44 +0000886<a name='start_symbol'></a>
drh75897232000-05-29 14:26:00 +0000887<h4>The <tt>%start_symbol</tt> directive</h4>
888
889<p>By default, the start-symbol for the grammar that Lemon generates
890is the first non-terminal that appears in the grammar file. But you
891can choose a different start-symbol using the %start_symbol directive.</p>
892
893<p><pre>
894 %start_symbol prog
895</pre></p>
896
drh9bccde32016-03-19 18:00:44 +0000897<a name='token_destructor'></a>
drh75897232000-05-29 14:26:00 +0000898<h4>The <tt>%token_destructor</tt> directive</h4>
899
900<p>The %destructor directive assigns a destructor to a non-terminal
901symbol. (See the description of the %destructor directive above.)
902This directive does the same thing for all terminal symbols.</p>
903
904<p>Unlike non-terminal symbols which may each have a different data type
905for their values, terminals all use the same data type (defined by
906the %token_type directive) and so they use a common destructor. Other
907than that, the token destructor works just like the non-terminal
908destructors.</p>
909
drh9bccde32016-03-19 18:00:44 +0000910<a name='token_prefix'></a>
drh75897232000-05-29 14:26:00 +0000911<h4>The <tt>%token_prefix</tt> directive</h4>
912
913<p>Lemon generates #defines that assign small integer constants
914to each terminal symbol in the grammar. If desired, Lemon will
915add a prefix specified by this directive
916to each of the #defines it generates.
917So if the default output of Lemon looked like this:
918<pre>
919 #define AND 1
920 #define MINUS 2
921 #define OR 3
922 #define PLUS 4
923</pre>
924You can insert a statement into the grammar like this:
925<pre>
926 %token_prefix TOKEN_
927</pre>
928to cause Lemon to produce these symbols instead:
929<pre>
930 #define TOKEN_AND 1
931 #define TOKEN_MINUS 2
932 #define TOKEN_OR 3
933 #define TOKEN_PLUS 4
934</pre>
935
drh9bccde32016-03-19 18:00:44 +0000936<a name='token_type'></a><a name='ptype'></a>
drh75897232000-05-29 14:26:00 +0000937<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
938
939<p>These directives are used to specify the data types for values
940on the parser's stack associated with terminal and non-terminal
941symbols. The values of all terminal symbols must be of the same
942type. This turns out to be the same data type as the 3rd parameter
943to the Parse() function generated by Lemon. Typically, you will
944make the value of a terminal symbol by a pointer to some kind of
945token structure. Like this:</p>
946
947<p><pre>
948 %token_type {Token*}
949</pre></p>
950
951<p>If the data type of terminals is not specified, the default value
drhdfe4e6b2016-10-08 13:34:08 +0000952is "void*".</p>
drh75897232000-05-29 14:26:00 +0000953
954<p>Non-terminal symbols can each have their own data types. Typically
955the data type of a non-terminal is a pointer to the root of a parse-tree
956structure that contains all information about that non-terminal.
957For example:</p>
958
959<p><pre>
960 %type expr {Expr*}
961</pre></p>
962
963<p>Each entry on the parser's stack is actually a union containing
964instances of all data types for every non-terminal and terminal symbol.
965Lemon will automatically use the correct element of this union depending
966on what the corresponding non-terminal or terminal symbol is. But
967the grammar designer should keep in mind that the size of the union
968will be the size of its largest element. So if you have a single
969non-terminal whose data type requires 1K of storage, then your 100
970entry parser stack will require 100K of heap space. If you are willing
971and able to pay that price, fine. You just need to know.</p>
972
drh9bccde32016-03-19 18:00:44 +0000973<a name='pwildcard'></a>
974<h4>The <tt>%wildcard</tt> directive</h4>
975
976<p>The %wildcard directive is followed by a single token name and a
977period. This directive specifies that the identified token should
978match any input token.
979
980<p>When the generated parser has the choice of matching an input against
981the wildcard token and some other token, the other token is always used.
982The wildcard token is only matched if there are no other alternatives.
983
drh75897232000-05-29 14:26:00 +0000984<h3>Error Processing</h3>
985
986<p>After extensive experimentation over several years, it has been
987discovered that the error recovery strategy used by yacc is about
988as good as it gets. And so that is what Lemon uses.</p>
989
990<p>When a Lemon-generated parser encounters a syntax error, it
991first invokes the code specified by the %syntax_error directive, if
992any. It then enters its error recovery strategy. The error recovery
993strategy is to begin popping the parsers stack until it enters a
994state where it is permitted to shift a special non-terminal symbol
drh9bccde32016-03-19 18:00:44 +0000995named "error". It then shifts this non-terminal and continues
drh75897232000-05-29 14:26:00 +0000996parsing. But the %syntax_error routine will not be called again
997until at least three new tokens have been successfully shifted.</p>
998
999<p>If the parser pops its stack until the stack is empty, and it still
1000is unable to shift the error symbol, then the %parse_failed routine
1001is invoked and the parser resets itself to its start state, ready
1002to begin parsing a new file. This is what will happen at the very
1003first syntax error, of course, if there are no instances of the
drh9bccde32016-03-19 18:00:44 +00001004"error" non-terminal in your grammar.</p>
drh75897232000-05-29 14:26:00 +00001005
1006</body>
1007</html>