Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 9b4648f46c1ed039620ca8d9720c58de647e8ecf [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame^]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
				5	<body bgcolor=white>
				6	<h1 align=center>The Lemon Parser Generator</h1>
				7
				8	<p>Lemon is an LALR(1) parser generator for C or C++.
				9	It does the same job as ``bison'' and ``yacc''.
				10	But lemon is not another bison or yacc clone. It
				11	uses a different grammar syntax which is designed to
				12	reduce the number of coding errors. Lemon also uses a more
				13	sophisticated parsing engine that is faster than yacc and
				14	bison and which is both reentrant and thread-safe.
				15	Furthermore, Lemon implements features that can be used
				16	to eliminate resource leaks, making is suitable for use
				17	in long-running programs such as graphical user interfaces
				18	or embedded controllers.</p>
				19
				20	<p>This document is an introduction to the Lemon
				21	parser generator.</p>
				22
				23	<h2>Theory of Operation</h2>
				24
				25	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				26	for a particular language into C code that implements a parser for
				27	that language.
				28	The program has two inputs:
				29	<ul>
				30	<li>The grammar specification.
				31	<li>A parser template file.
				32	</ul>
				33	Typically, only the grammar specification is supplied by the programmer.
				34	Lemon comes with a default parser template which works fine for most
				35	applications. But the user is free to substitute a different parser
				36	template if desired.</p>
				37
				38	<p>Depending on command-line options, Lemon will generate between
				39	one and three files of outputs.
				40	<ul>
				41	<li>C code to implement the parser.
				42	<li>A header file defining an integer ID for each terminal symbol.
				43	<li>An information file that describes the states of the generated parser
				44	automaton.
				45	</ul>
				46	By default, all three of these output files are generated.
				47	The header file is suppressed if the ``-m'' command-line option is
				48	used and the report file is omitted when ``-q'' is selected.</p>
				49
				50	<p>The grammar specification file uses a ``.y'' suffix, by convention.
				51	In the examples used in this document, we'll assume the name of the
				52	grammar file is ``gram.y''. A typical use of Lemon would be the
				53	following command:
				54	<pre>
				55	lemon gram.y
				56	</pre>
				57	This command will generate three output files named ``gram.c'',
				58	``gram.h'' and ``gram.out''.
				59	The first is C code to implement the parser. The second
				60	is the header file that defines numerical values for all
				61	terminal symbols, and the last is the report that explains
				62	the states used by the parser automaton.</p>
				63
				64	<h3>Command Line Options</h3>
				65
				66	<p>The behavior of Lemon can be modified using command-line options.
				67	You can obtain a list of the available command-line options together
				68	with a brief explanation of what each does by typing
				69	<pre>
				70	lemon -?
				71	</pre>
				72	As of this writing, the following command-line options are supported:
				73	<ul>
				74	<li><tt>-b</tt>
				75	<li><tt>-c</tt>
				76	<li><tt>-g</tt>
				77	<li><tt>-m</tt>
				78	<li><tt>-q</tt>
				79	<li><tt>-s</tt>
				80	<li><tt>-x</tt>
				81	</ul>
				82	The ``-b'' option reduces the amount of text in the report file by
				83	printing only the basis of each parser state, rather than the full
				84	configuration.
				85	The ``-c'' option suppresses action table compression. Using -c
				86	will make the parser a little larger and slower but it will detect
				87	syntax errors sooner.
				88	The ``-g'' option causes no output files to be generated at all.
				89	Instead, the input grammar file is printed on standard output but
				90	with all comments, actions and other extraneous text deleted. This
				91	is a useful way to get a quick summary of a grammar.
				92	The ``-m'' option causes the output C source file to be compatible
				93	with the ``makeheaders'' program.
				94	Makeheaders is a program that automatically generates header files
				95	from C source code. When the ``-m'' option is used, the header
				96	file is not output since the makeheaders program will take care
				97	of generated all header files automatically.
				98	The ``-q'' option suppresses the report file.
				99	Using ``-s'' causes a brief summary of parser statistics to be
				100	printed. Like this:
				101	<pre>
				102	Parser statistics: 74 terminals, 70 nonterminals, 179 rules
				103	340 states, 2026 parser table entries, 0 conflicts
				104	</pre>
				105	Finally, the ``-x'' option causes Lemon to print its version number
				106	and copyright information
				107	and then stop without attempting to read the grammar or generate a parser.</p>
				108
				109	<h3>The Parser Interface</h3>
				110
				111	<p>Lemon doesn't generate a complete, working program. It only generates
				112	a few subroutines that implement a parser. This section describes
				113	the interface to those subroutines. It is up to the programmer to
				114	call these subroutines in an appropriate way in order to produce a
				115	complete system.</p>
				116
				117	<p>Before a program begins using a Lemon-generated parser, the program
				118	must first create the parser.
				119	A new parser is created as follows:
				120	<pre>
				121	void *pParser = ParseAlloc( malloc );
				122	</pre>
				123	The ParseAlloc() routine allocates and initializes a new parser and
				124	returns a pointer to it.
				125	The actual data structure used to represent a parser is opaque --
				126	its internal structure is not visible or usable by the calling routine.
				127	For this reason, the ParseAlloc() routine returns a pointer to void
				128	rather than a pointer to some particular structure.
				129	The sole argument to the ParseAlloc() routine is a pointer to the
				130	subroutine used to allocate memory. Typically this means ``malloc()''.</p>
				131
				132	<p>After a program is finished using a parser, it can reclaim all
				133	memory allocated by that parser by calling
				134	<pre>
				135	ParseFree(pParser, free);
				136	</pre>
				137	The first argument is the same pointer returned by ParseAlloc(). The
				138	second argument is a pointer to the function used to release bulk
				139	memory back to the system.</p>
				140
				141	<p>After a parser has been allocated using ParseAlloc(), the programmer
				142	must supply the parser with a sequence of tokens (terminal symbols) to
				143	be parsed. This is accomplished by calling the following function
				144	once for each token:
				145	<pre>
				146	Parse(pParser, hTokenID, sTokenData, pArg);
				147	</pre>
				148	The first argument to the Parse() routine is the pointer returned by
				149	ParseAlloc().
				150	The second argument is a small positive integer that tells the parse the
				151	type of the next token in the data stream.
				152	There is one token type for each terminal symbol in the grammar.
				153	The gram.h file generated by Lemon contains #define statements that
				154	map symbolic terminal symbol names into appropriate integer values.
				155	(A value of 0 for the second argument is a special flag to the
				156	parser to indicate that the end of input has been reached.)
				157	The third argument is the value of the given token. By default,
				158	the type of the third argument is integer, but the grammar will
				159	usually redefine this type to be some kind of structure.
				160	Typically the second argument will be a broad category of tokens
				161	such as ``identifier'' or ``number'' and the third argument will
				162	be the name of the identifier or the value of the number.</p>
				163
				164	<p>The Parse() function may have either three or four arguments,
				165	depending on the grammar. If the grammar specification file request
				166	it, the Parse() function will have a fourth parameter that can be
				167	of any type chosen by the programmer. The parser doesn't do anything
				168	with this argument except to pass it through to action routines.
				169	This is a convenient mechanism for passing state information down
				170	to the action routines without having to use global variables.</p>
				171
				172	<p>A typical use of a Lemon parser might look something like the
				173	following:
				174	<pre>
				175	01 ParseTree ParseFile(const char zFilename){
				176	02 Tokenizer *pTokenizer;
				177	03 void *pParser;
				178	04 Token sToken;
				179	05 int hTokenId;
				180	06 ParserState sState;
				181	07
				182	08 pTokenizer = TokenizerCreate(zFilename);
				183	09 pParser = ParseAlloc( malloc );
				184	10 InitParserState(&sState);
				185	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				186	12 Parse(pParser, hTokenId, sToken, &sState);
				187	13 }
				188	14 Parse(pParser, 0, sToken, &sState);
				189	15 ParseFree(pParser, free );
				190	16 TokenizerFree(pTokenizer);
				191	17 return sState.treeRoot;
				192	18 }
				193	</pre>
				194	This example shows a user-written routine that parses a file of
				195	text and returns a pointer to the parse tree.
				196	(We've omitted all error-handling from this example to keep it
				197	simple.)
				198	We assume the existence of some kind of tokenizer which is created
				199	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				200	on line 16. The GetNextToken() function on line 11 retrieves the
				201	next token from the input file and puts its type in the
				202	integer variable hTokenId. The sToken variable is assumed to be
				203	some kind of structure that contains details about each token,
				204	such as its complete text, what line it occurs on, etc. </p>
				205
				206	<p>This example also assumes the existence of structure of type
				207	ParserState that holds state information about a particular parse.
				208	An instance of such a structure is created on line 6 and initialized
				209	on line 10. A pointer to this structure is passed into the Parse()
				210	routine as the optional 4th argument.
				211	The action routine specified by the grammar for the parser can use
				212	the ParserState structure to hold whatever information is useful and
				213	appropriate. In the example, we note that the treeRoot field of
				214	the ParserState structure is left pointing to the root of the parse
				215	tree.</p>
				216
				217	<p>The core of this example as it relates to Lemon is as follows:
				218	<pre>
				219	ParseFile(){
				220	pParser = ParseAlloc( malloc );
				221	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
				222	Parse(pParser, hTokenId, sToken);
				223	}
				224	Parse(pParser, 0, sToken);
				225	ParseFree(pParser, free );
				226	}
				227	</pre>
				228	Basically, what a program has to do to use a Lemon-generated parser
				229	is first create the parser, then send it lots of tokens obtained by
				230	tokenizing an input source. When the end of input is reached, the
				231	Parse() routine should be called one last time with a token type
				232	of 0. This step is necessary to inform the parser that the end of
				233	input has been reached. Finally, we reclaim memory used by the
				234	parser by calling ParseFree().</p>
				235
				236	<p>There is one other interface routine that should be mentioned
				237	before we move on.
				238	The ParseTrace() function can be used to generate debugging output
				239	from the parser. A prototype for this routine is as follows:
				240	<pre>
				241	ParseTrace(FILE stream, char zPrefix);
				242	</pre>
				243	After this routine is called, a short (one-line) message is written
				244	to the designated output stream every time the parser changes states
				245	or calls an action routine. Each such message is prefaced using
				246	the text given by zPrefix. This debugging output can be turned off
				247	by calling ParseTrace() again with a first argument of NULL (0).</p>
				248
				249	<h3>Differences With YACC and BISON</h3>
				250
				251	<p>Programmers who have previously used the yacc or bison parser
				252	generator will notice several important differences between yacc and/or
				253	bison and Lemon.
				254	<ul>
				255	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				256	the tokenizer calls the parser.
				257	<li>Lemon uses no global variables. Yacc and bison use global variables
				258	to pass information between the tokenizer and parser.
				259	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				260	and bison do not.
				261	</ul>
				262	These differences may cause some initial confusion for programmers
				263	with prior yacc and bison experience.
				264	But after years of experience using Lemon, I firmly
				265	believe that the Lemon way of doing things is better.</p>
				266
				267	<h2>Input File Syntax</h2>
				268
				269	<p>The main purpose of the grammar specification file for Lemon is
				270	to define the grammar for the parser. But the input file also
				271	specifies additional information Lemon requires to do its job.
				272	Most of the work in using Lemon is in writing an appropriate
				273	grammar file.</p>
				274
				275	<p>The grammar file for lemon is, for the most part, free format.
				276	It does not have sections or divisions like yacc or bison. Any
				277	declaration can occur at any point in the file.
				278	Lemon ignores whitespace (except where it is needed to separate
				279	tokens) and it honors the same commenting conventions as C and C++.</p>
				280
				281	<h3>Terminals and Nonterminals</h3>
				282
				283	<p>A terminal symbol (token) is any string of alphanumeric
				284	and underscore characters
				285	that begins with an upper case letter.
				286	A terminal can contain lower class letters after the first character,
				287	but the usual convention is to make terminals all upper case.
				288	A nonterminal, on the other hand, is any string of alphanumeric
				289	and underscore characters than begins with a lower case letter.
				290	Again, the usual convention is to make nonterminals use all lower
				291	case letters.</p>
				292
				293	<p>In Lemon, terminal and nonterminal symbols do not need to
				294	be declared or identified in a separate section of the grammar file.
				295	Lemon is able to generate a list of all terminals and nonterminals
				296	by examining the grammar rules, and it can always distinguish a
				297	terminal from a nonterminal by checking the case of the first
				298	character of the name.</p>
				299
				300	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				301	names or to be individual characters included in single quotes, like
				302	this: ')' or '$'. Lemon does not allow this alternative form for
				303	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				304	must have alphanumeric names.</p>
				305
				306	<h3>Grammar Rules</h3>
				307
				308	<p>The main component of a Lemon grammar file is a sequence of grammar
				309	rules.
				310	Each grammar rule consists of a nonterminal symbol followed by
				311	the special symbol ``::='' and then a list of terminals and/or nonterminals.
				312	The rule is terminated by a period.
				313	The list of terminals and nonterminals on the right-hand side of the
				314	rule can be empty.
				315	Rules can occur in any order, except that the left-hand side of the
				316	first rule is assumed to be the start symbol for the grammar (unless
				317	specified otherwise using the <tt>%start</tt> directive described below.)
				318	A typical sequence of grammar rules might look something like this:
				319	<pre>
				320	expr ::= expr PLUS expr.
				321	expr ::= expr TIMES expr.
				322	expr ::= LPAREN expr RPAREN.
				323	expr ::= VALUE.
				324	</pre>
				325	</p>
				326
				327	<p>There is one non-terminal in this example, ``expr'', and five
				328	terminal symbols or tokens: ``PLUS'', ``TIMES'', ``LPAREN'',
				329	``RPAREN'' and ``VALUE''.</p>
				330
				331	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				332	of C code that will be executed whenever a grammar rule is reduced
				333	by the parser.
				334	In Lemon, this action is specified by putting the C code (contained
				335	within curly braces <tt>{...}</tt>) immediately after the
				336	period that closes the rule.
				337	For example:
				338	<pre>
				339	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				340	</pre>
				341	</p>
				342
				343	<p>In order to be useful, grammar actions must normally be linked to
				344	their associated grammar rules.
				345	In yacc and bison, this is accomplished by embedding a ``$$'' in the
				346	action to stand for the value of the left-hand side of the rule and
				347	symbols ``$1'', ``$2'', and so forth to stand for the value of
				348	the terminal or nonterminal at position 1, 2 and so forth on the
				349	right-hand side of the rule.
				350	This idea is very powerful, but it is also very error-prone. The
				351	single most common source of errors in a yacc or bison grammar is
				352	to miscount the number of symbols on the right-hand side of a grammar
				353	rule and say ``$7'' when you really mean ``$8''.</p>
				354
				355	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				356	names to each symbol in a grammar rule and then using those symbolic
				357	names in the action.
				358	In yacc or bison, one would write this:
				359	<pre>
				360	expr -> expr PLUS expr { $$ = $1 + $3; };
				361	</pre>
				362	But in Lemon, the same rule becomes the following:
				363	<pre>
				364	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				365	</pre>
				366	In the Lemon rule, any symbol in parentheses after a grammar rule
				367	symbol becomes a place holder for that symbol in the grammar rule.
				368	This place holder can then be used in the associated C action to
				369	stand for the value of that symbol.<p>
				370
				371	<p>The Lemon notation for linking a grammar rule with its reduce
				372	action is superior to yacc/bison on several counts.
				373	First, as mentioned above, the Lemon method avoids the need to
				374	count grammar symbols.
				375	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				376	includes a linking symbol in parentheses but that linking symbol
				377	is not actually used in the reduce action, then an error message
				378	is generated.
				379	For example, the rule
				380	<pre>
				381	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				382	</pre>
				383	will generate an error because the linking symbol ``C'' is used
				384	in the grammar rule but not in the reduce action.</p>
				385
				386	<p>The Lemon notation for linking grammar rules to reduce actions
				387	also facilitates the use of destructors for reclaiming memory
				388	allocated by the values of terminals and nonterminals on the
				389	right-hand side of a rule.</p>
				390
				391	<h3>Precedence Rules</h3>
				392
				393	<p>Lemon resolves parsing ambiguities in exactly the same way as
				394	yacc and bison. A shift-reduce conflict is resolved in favor
				395	of the shift, and a reduce-reduce conflict is resolved by reducing
				396	whichever rule comes first in the grammar file.</p>
				397
				398	<p>Just like in
				399	yacc and bison, Lemon allows a measure of control
				400	over the resolution of paring conflicts using precedence rules.
				401	A precedence value can be assigned to any terminal symbol
				402	using the %left, %right or %nonassoc directives. Terminal symbols
				403	mentioned in earlier directives have a lower precedence that
				404	terminal symbols mentioned in later directives. For example:</p>
				405
				406	<p><pre>
				407	%left AND.
				408	%left OR.
				409	%nonassoc EQ NE GT GE LT LE.
				410	%left PLUS MINUS.
				411	%left TIMES DIVIDE MOD.
				412	%right EXP NOT.
				413	</pre></p>
				414
				415	<p>In the preceding sequence of directives, the AND operator is
				416	defined to have the lowest precedence. The OR operator is one
				417	precedence level higher. And so forth. Hence, the grammar would
				418	attempt to group the ambiguous expression
				419	<pre>
				420	a AND b OR c
				421	</pre>
				422	like this
				423	<pre>
				424	a AND (b OR c).
				425	</pre>
				426	The associativity (left, right or nonassoc) is used to determine
				427	the grouping when the precedence is the same. AND is left-associative
				428	in our example, so
				429	<pre>
				430	a AND b AND c
				431	</pre>
				432	is parsed like this
				433	<pre>
				434	(a AND b) AND c.
				435	</pre>
				436	The EXP operator is right-associative, though, so
				437	<pre>
				438	a EXP b EXP c
				439	</pre>
				440	is parsed like this
				441	<pre>
				442	a EXP (b EXP c).
				443	</pre>
				444	The nonassoc precedence is used for non-associative operators.
				445	So
				446	<pre>
				447	a EQ b EQ c
				448	</pre>
				449	is an error.</p>
				450
				451	<p>The precedence of non-terminals is transferred to rules as follows:
				452	The precedence of a grammar rule is equal to the precedence of the
				453	left-most terminal symbol in the rule for which a precedence is
				454	defined. This is normally what you want, but in those cases where
				455	you want to precedence of a grammar rule to be something different,
				456	you can specify an alternative precedence symbol by putting the
				457	symbol in square braces after the period at the end of the rule and
				458	before any C-code. For example:</p>
				459
				460	<p><pre>
				461	expr = MINUS expr. [NOT]
				462	</pre></p>
				463
				464	<p>This rule has a precedence equal to that of the NOT symbol, not the
				465	MINUS symbol as would have been the case by default.</p>
				466
				467	<p>With the knowledge of how precedence is assigned to terminal
				468	symbols and individual
				469	grammar rules, we can now explain precisely how parsing conflicts
				470	are resolved in Lemon. Shift-reduce conflicts are resolved
				471	as follows:
				472	<ul>
				473	<li> If either the token to be shifted or the rule to be reduced
				474	lacks precedence information, then resolve in favor of the
				475	shift, but report a parsing conflict.
				476	<li> If the precedence of the token to be shifted is greater than
				477	the precedence of the rule to reduce, then resolve in favor
				478	of the shift. No parsing conflict is reported.
				479	<li> If the precedence of the token it be shifted is less than the
				480	precedence of the rule to reduce, then resolve in favor of the
				481	reduce action. No parsing conflict is reported.
				482	<li> If the precedences are the same and the shift token is
				483	right-associative, then resolve in favor of the shift.
				484	No parsing conflict is reported.
				485	<li> If the precedences are the same the the shift token is
				486	left-associative, then resolve in favor of the reduce.
				487	No parsing conflict is reported.
				488	<li> Otherwise, resolve the conflict by doing the shift and
				489	report the parsing conflict.
				490	</ul>
				491	Reduce-reduce conflicts are resolved this way:
				492	<ul>
				493	<li> If either reduce rule
				494	lacks precedence information, then resolve in favor of the
				495	rule that appears first in the grammar and report a parsing
				496	conflict.
				497	<li> If both rules have precedence and the precedence is different
				498	then resolve the dispute in favor of the rule with the highest
				499	precedence and do not report a conflict.
				500	<li> Otherwise, resolve the conflict by reducing by the rule that
				501	appears first in the grammar and report a parsing conflict.
				502	</ul>
				503
				504	<h3>Special Directives</h3>
				505
				506	<p>The input grammar to Lemon consists of grammar rules and special
				507	directives. We've described all the grammar rules, so now we'll
				508	talk about the special directives.</p>
				509
				510	<p>Directives in lemon can occur in any order. You can put them before
				511	the grammar rules, or after the grammar rules, or in the mist of the
				512	grammar rules. It doesn't matter. The relative order of
				513	directives used to assign precedence to terminals is important, but
				514	other than that, the order of directives in Lemon is arbitrary.</p>
				515
				516	<p>Lemon supports the following special directives:
				517	<ul>
				518	<li><tt>%destructor</tt>
				519	<li><tt>%extra_argument</tt>
				520	<li><tt>%include</tt>
				521	<li><tt>%left</tt>
				522	<li><tt>%name</tt>
				523	<li><tt>%nonassoc</tt>
				524	<li><tt>%parse_accept</tt>
				525	<li><tt>%parse_failure </tt>
				526	<li><tt>%right</tt>
				527	<li><tt>%stack_overflow</tt>
				528	<li><tt>%stack_size</tt>
				529	<li><tt>%start_symbol</tt>
				530	<li><tt>%syntax_error</tt>
				531	<li><tt>%token_destructor</tt>
				532	<li><tt>%token_prefix</tt>
				533	<li><tt>%token_type</tt>
				534	<li><tt>%type</tt>
				535	</ul>
				536	Each of these directives will be described separately in the
				537	following sections:</p>
				538
				539	<h4>The <tt>%destructor</tt> directive</h4>
				540
				541	<p>The %destructor directive is used to specify a destructor for
				542	a non-terminal symbol.
				543	(See also the %token_destructor directive which is used to
				544	specify a destructor for terminal symbols.)</p>
				545
				546	<p>A non-terminal's destructor is called to dispose of the
				547	non-terminal's value whenever the non-terminal is popped from
				548	the stack. This includes all of the following circumstances:
				549	<ul>
				550	<li> When a rule reduces and the value of a non-terminal on
				551	the right-hand side is not linked to C code.
				552	<li> When the stack is popped during error processing.
				553	<li> When the ParseFree() function runs.
				554	</ul>
				555	The destructor can do whatever it wants with the value of
				556	the non-terminal, but its design is to deallocate memory
				557	or other resources held by that non-terminal.</p>
				558
				559	<p>Consider an example:
				560	<pre>
				561	%type nt {void*}
				562	%destructor nt { free($$); }
				563	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				564	</pre>
				565	This example is a bit contrived but it serves to illustrate how
				566	destructors work. The example shows a non-terminal named
				567	``nt'' that holds values of type ``void*''. When the rule for
				568	an ``nt'' reduces, it sets the value of the non-terminal to
				569	space obtained from malloc(). Later, when the nt non-terminal
				570	is popped from the stack, the destructor will fire and call
				571	free() on this malloced space, thus avoiding a memory leak.
				572	(Note that the symbol ``$$'' in the destructor code is replaced
				573	by the value of the non-terminal.)</p>
				574
				575	<p>It is important to note that the value of a non-terminal is passed
				576	to the destructor whenever the non-terminal is removed from the
				577	stack, unless the non-terminal is used in a C-code action. If
				578	the non-terminal is used by C-code, then it is assumed that the
				579	C-code will take care of destroying it if it should really
				580	be destroyed. More commonly, the value is used to build some
				581	larger structure and we don't want to destroy it, which is why
				582	the destructor is not called in this circumstance.</p>
				583
				584	<p>By appropriate use of destructors, it is possible to
				585	build a parser using Lemon that can be used within a long-running
				586	program, such as a GUI, that will not leak memory or other resources.
				587	To do the same using yacc or bison is much more difficult.</p>
				588
				589	<h4>The <tt>%extra_argument</tt> directive</h4>
				590
				591	The %extra_argument directive instructs Lemon to add a 4th parameter
				592	to the parameter list of the Parse() function it generates. Lemon
				593	doesn't do anything itself with this extra argument, but it does
				594	make the argument available to C-code action routines, destructors,
				595	and so forth. For example, if the grammar file contains:</p>
				596
				597	<p><pre>
				598	%extra_argument { MyStruct *pAbc }
				599	</pre></p>
				600
				601	<p>Then the Parse() function generated will have an 4th parameter
				602	of type ``MyStruct*'' and all action routines will have access to
				603	a variable named ``pAbc'' that is the value of the 4th parameter
				604	in the most recent call to Parse().</p>
				605
				606	<h4>The <tt>%include</tt> directive</h4>
				607
				608	<p>The %include directive specifies C code that is included at the
				609	top of the generated parser. You can include any text you want --
				610	the Lemon parser generator copies to blindly. If you have multiple
				611	%include directives in your grammar file, their values are concatenated
				612	before being put at the beginning of the generated parser.</p>
				613
				614	<p>The %include directive is very handy for getting some extra #include
				615	preprocessor statements at the beginning of the generated parser.
				616	For example:</p>
				617
				618	<p><pre>
				619	%include {#include <unistd.h>}
				620	</pre></p>
				621
				622	<p>This might be needed, for example, if some of the C actions in the
				623	grammar call functions that are prototyed in unistd.h.</p>
				624
				625	<h4>The <tt>%left</tt> directive</h4>
				626
				627	The %left directive is used (along with the %right and
				628	%nonassoc directives) to declare precedences of terminal
				629	symbols. Every terminal symbol whose name appears after
				630	a %left directive but before the next period (``.'') is
				631	given the same left-associative precedence value. Subsequent
				632	%left directives have higher precedence. For example:</p>
				633
				634	<p><pre>
				635	%left AND.
				636	%left OR.
				637	%nonassoc EQ NE GT GE LT LE.
				638	%left PLUS MINUS.
				639	%left TIMES DIVIDE MOD.
				640	%right EXP NOT.
				641	</pre></p>
				642
				643	<p>Note the period that terminates each %left, %right or %nonassoc
				644	directive.</p>
				645
				646	<p>LALR(1) grammars can get into a situation where they require
				647	a large amount of stack space if you make heavy use or right-associative
				648	operators. For this reason, it is recommended that you use %left
				649	rather than %right whenever possible.</p>
				650
				651	<h4>The <tt>%name</tt> directive</h4>
				652
				653	<p>By default, the functions generated by Lemon all begin with the
				654	five-character string ``Parse''. You can change this string to something
				655	different using the %name directive. For instance:</p>
				656
				657	<p><pre>
				658	%name Abcde
				659	</pre></p>
				660
				661	<p>Putting this directive in the grammar file will cause Lemon to generate
				662	functions named
				663	<ul>
				664	<li> AbcdeAlloc(),
				665	<li> AbcdeFree(),
				666	<li> AbcdeTrace(), and
				667	<li> Abcde().
				668	</ul>
				669	The %name directive allows you to generator two or more different
				670	parsers and link them all into the same executable.
				671	</p>
				672
				673	<h4>The <tt>%nonassoc</tt> directive</h4>
				674
				675	<p>This directive is used to assign non-associative precedence to
				676	one or more terminal symbols. See the section on precedence rules
				677	or on the %left directive for additional information.</p>
				678
				679	<h4>The <tt>%parse_accept</tt> directive</h4>
				680
				681	<p>The %parse_accept directive specifies a block of C code that is
				682	executed whenever the parser accepts its input string. To ``accept''
				683	an input string means that the parser was able to process all tokens
				684	without error.</p>
				685
				686	<p>For example:</p>
				687
				688	<p><pre>
				689	%parse_accept {
				690	printf("parsing complete!\n");
				691	}
				692	</pre></p>
				693
				694
				695	<h4>The <tt>%parse_failure</tt> directive</h4>
				696
				697	<p>The %parse_failure directive specifies a block of C code that
				698	is executed whenever the parser fails complete. This code is not
				699	executed until the parser has tried and failed to resolve an input
				700	error using is usual error recovery strategy. The routine is
				701	only invoked when parsing is unable to continue.</p>
				702
				703	<p><pre>
				704	%parse_failure {
				705	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				706	}
				707	</pre></p>
				708
				709	<h4>The <tt>%right</tt> directive</h4>
				710
				711	<p>This directive is used to assign right-associative precedence to
				712	one or more terminal symbols. See the section on precedence rules
				713	or on the %left directive for additional information.</p>
				714
				715	<h4>The <tt>%stack_overflow</tt> directive</h4>
				716
				717	<p>The %stack_overflow directive specifies a block of C code that
				718	is executed if the parser's internal stack ever overflows. Typically
				719	this just prints an error message. After a stack overflow, the parser
				720	will be unable to continue and must be reset.</p>
				721
				722	<p><pre>
				723	%stack_overflow {
				724	fprintf(stderr,"Giving up. Parser stack overflow\n");
				725	}
				726	</pre></p>
				727
				728	<p>You can help prevent parser stack overflows by avoiding the use
				729	of right recursion and right-precedence operators in your grammar.
				730	Use left recursion and and left-precedence operators instead, to
				731	encourage rules to reduce sooner and keep the stack size down.
				732	For example, do rules like this:
				733	<pre>
				734	list ::= list element. // left-recursion. Good!
				735	list ::= .
				736	</pre>
				737	Not like this:
				738	<pre>
				739	list ::= element list. // right-recursion. Bad!
				740	list ::= .
				741	</pre>
				742
				743	<h4>The <tt>%stack_size</tt> directive</h4>
				744
				745	<p>If stack overflow is a problem and you can't resolve the trouble
				746	by using left-recursion, then you might want to increase the size
				747	of the parser's stack using this directive. Put an positive integer
				748	after the %stack_size directive and Lemon will generate a parse
				749	with a stack of the requested size. The default value is 100.</p>
				750
				751	<p><pre>
				752	%stack_size 2000
				753	</pre></p>
				754
				755	<h4>The <tt>%start_symbol</tt> directive</h4>
				756
				757	<p>By default, the start-symbol for the grammar that Lemon generates
				758	is the first non-terminal that appears in the grammar file. But you
				759	can choose a different start-symbol using the %start_symbol directive.</p>
				760
				761	<p><pre>
				762	%start_symbol prog
				763	</pre></p>
				764
				765	<h4>The <tt>%token_destructor</tt> directive</h4>
				766
				767	<p>The %destructor directive assigns a destructor to a non-terminal
				768	symbol. (See the description of the %destructor directive above.)
				769	This directive does the same thing for all terminal symbols.</p>
				770
				771	<p>Unlike non-terminal symbols which may each have a different data type
				772	for their values, terminals all use the same data type (defined by
				773	the %token_type directive) and so they use a common destructor. Other
				774	than that, the token destructor works just like the non-terminal
				775	destructors.</p>
				776
				777	<h4>The <tt>%token_prefix</tt> directive</h4>
				778
				779	<p>Lemon generates #defines that assign small integer constants
				780	to each terminal symbol in the grammar. If desired, Lemon will
				781	add a prefix specified by this directive
				782	to each of the #defines it generates.
				783	So if the default output of Lemon looked like this:
				784	<pre>
				785	#define AND 1
				786	#define MINUS 2
				787	#define OR 3
				788	#define PLUS 4
				789	</pre>
				790	You can insert a statement into the grammar like this:
				791	<pre>
				792	%token_prefix TOKEN_
				793	</pre>
				794	to cause Lemon to produce these symbols instead:
				795	<pre>
				796	#define TOKEN_AND 1
				797	#define TOKEN_MINUS 2
				798	#define TOKEN_OR 3
				799	#define TOKEN_PLUS 4
				800	</pre>
				801
				802	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				803
				804	<p>These directives are used to specify the data types for values
				805	on the parser's stack associated with terminal and non-terminal
				806	symbols. The values of all terminal symbols must be of the same
				807	type. This turns out to be the same data type as the 3rd parameter
				808	to the Parse() function generated by Lemon. Typically, you will
				809	make the value of a terminal symbol by a pointer to some kind of
				810	token structure. Like this:</p>
				811
				812	<p><pre>
				813	%token_type {Token*}
				814	</pre></p>
				815
				816	<p>If the data type of terminals is not specified, the default value
				817	is ``int''.</p>
				818
				819	<p>Non-terminal symbols can each have their own data types. Typically
				820	the data type of a non-terminal is a pointer to the root of a parse-tree
				821	structure that contains all information about that non-terminal.
				822	For example:</p>
				823
				824	<p><pre>
				825	%type expr {Expr*}
				826	</pre></p>
				827
				828	<p>Each entry on the parser's stack is actually a union containing
				829	instances of all data types for every non-terminal and terminal symbol.
				830	Lemon will automatically use the correct element of this union depending
				831	on what the corresponding non-terminal or terminal symbol is. But
				832	the grammar designer should keep in mind that the size of the union
				833	will be the size of its largest element. So if you have a single
				834	non-terminal whose data type requires 1K of storage, then your 100
				835	entry parser stack will require 100K of heap space. If you are willing
				836	and able to pay that price, fine. You just need to know.</p>
				837
				838	<h3>Error Processing</h3>
				839
				840	<p>After extensive experimentation over several years, it has been
				841	discovered that the error recovery strategy used by yacc is about
				842	as good as it gets. And so that is what Lemon uses.</p>
				843
				844	<p>When a Lemon-generated parser encounters a syntax error, it
				845	first invokes the code specified by the %syntax_error directive, if
				846	any. It then enters its error recovery strategy. The error recovery
				847	strategy is to begin popping the parsers stack until it enters a
				848	state where it is permitted to shift a special non-terminal symbol
				849	named ``error''. It then shifts this non-terminal and continues
				850	parsing. But the %syntax_error routine will not be called again
				851	until at least three new tokens have been successfully shifted.</p>
				852
				853	<p>If the parser pops its stack until the stack is empty, and it still
				854	is unable to shift the error symbol, then the %parse_failed routine
				855	is invoked and the parser resets itself to its start state, ready
				856	to begin parsing a new file. This is what will happen at the very
				857	first syntax error, of course, if there are no instances of the
				858	``error'' non-terminal in your grammar.</p>
				859
				860	</body>
				861	</html>