Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 1d04eec7fa56b26edf44e01a0c95b931a1cf99d1 [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
				5	<body bgcolor=white>
				6	<h1 align=center>The Lemon Parser Generator</h1>
				7
				8	<p>Lemon is an LALR(1) parser generator for C or C++.
				9	It does the same job as ``bison'' and ``yacc''.
				10	But lemon is not another bison or yacc clone. It
				11	uses a different grammar syntax which is designed to
				12	reduce the number of coding errors. Lemon also uses a more
				13	sophisticated parsing engine that is faster than yacc and
				14	bison and which is both reentrant and thread-safe.
				15	Furthermore, Lemon implements features that can be used
				16	to eliminate resource leaks, making is suitable for use
				17	in long-running programs such as graphical user interfaces
				18	or embedded controllers.</p>
				19
				20	<p>This document is an introduction to the Lemon
				21	parser generator.</p>
				22
				23	<h2>Theory of Operation</h2>
				24
				25	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				26	for a particular language into C code that implements a parser for
				27	that language.
				28	The program has two inputs:
				29	<ul>
				30	<li>The grammar specification.
				31	<li>A parser template file.
				32	</ul>
				33	Typically, only the grammar specification is supplied by the programmer.
				34	Lemon comes with a default parser template which works fine for most
				35	applications. But the user is free to substitute a different parser
				36	template if desired.</p>
				37
				38	<p>Depending on command-line options, Lemon will generate between
				39	one and three files of outputs.
				40	<ul>
				41	<li>C code to implement the parser.
				42	<li>A header file defining an integer ID for each terminal symbol.
				43	<li>An information file that describes the states of the generated parser
				44	automaton.
				45	</ul>
				46	By default, all three of these output files are generated.
				47	The header file is suppressed if the ``-m'' command-line option is
				48	used and the report file is omitted when ``-q'' is selected.</p>
				49
				50	<p>The grammar specification file uses a ``.y'' suffix, by convention.
				51	In the examples used in this document, we'll assume the name of the
				52	grammar file is ``gram.y''. A typical use of Lemon would be the
				53	following command:
				54	<pre>
				55	lemon gram.y
				56	</pre>
				57	This command will generate three output files named ``gram.c'',
				58	``gram.h'' and ``gram.out''.
				59	The first is C code to implement the parser. The second
				60	is the header file that defines numerical values for all
				61	terminal symbols, and the last is the report that explains
				62	the states used by the parser automaton.</p>
				63
				64	<h3>Command Line Options</h3>
				65
				66	<p>The behavior of Lemon can be modified using command-line options.
				67	You can obtain a list of the available command-line options together
				68	with a brief explanation of what each does by typing
				69	<pre>
				70	lemon -?
				71	</pre>
				72	As of this writing, the following command-line options are supported:
				73	<ul>
				74	<li><tt>-b</tt>
				75	<li><tt>-c</tt>
				76	<li><tt>-g</tt>
				77	<li><tt>-m</tt>
				78	<li><tt>-q</tt>
				79	<li><tt>-s</tt>
				80	<li><tt>-x</tt>
				81	</ul>
				82	The ``-b'' option reduces the amount of text in the report file by
				83	printing only the basis of each parser state, rather than the full
				84	configuration.
				85	The ``-c'' option suppresses action table compression. Using -c
				86	will make the parser a little larger and slower but it will detect
				87	syntax errors sooner.
				88	The ``-g'' option causes no output files to be generated at all.
				89	Instead, the input grammar file is printed on standard output but
				90	with all comments, actions and other extraneous text deleted. This
				91	is a useful way to get a quick summary of a grammar.
				92	The ``-m'' option causes the output C source file to be compatible
				93	with the ``makeheaders'' program.
				94	Makeheaders is a program that automatically generates header files
				95	from C source code. When the ``-m'' option is used, the header
				96	file is not output since the makeheaders program will take care
				97	of generated all header files automatically.
				98	The ``-q'' option suppresses the report file.
				99	Using ``-s'' causes a brief summary of parser statistics to be
				100	printed. Like this:
				101	<pre>
				102	Parser statistics: 74 terminals, 70 nonterminals, 179 rules
				103	340 states, 2026 parser table entries, 0 conflicts
				104	</pre>
				105	Finally, the ``-x'' option causes Lemon to print its version number
				106	and copyright information
				107	and then stop without attempting to read the grammar or generate a parser.</p>
				108
				109	<h3>The Parser Interface</h3>
				110
				111	<p>Lemon doesn't generate a complete, working program. It only generates
				112	a few subroutines that implement a parser. This section describes
				113	the interface to those subroutines. It is up to the programmer to
				114	call these subroutines in an appropriate way in order to produce a
				115	complete system.</p>
				116
				117	<p>Before a program begins using a Lemon-generated parser, the program
				118	must first create the parser.
				119	A new parser is created as follows:
				120	<pre>
				121	void *pParser = ParseAlloc( malloc );
				122	</pre>
				123	The ParseAlloc() routine allocates and initializes a new parser and
				124	returns a pointer to it.
				125	The actual data structure used to represent a parser is opaque --
				126	its internal structure is not visible or usable by the calling routine.
				127	For this reason, the ParseAlloc() routine returns a pointer to void
				128	rather than a pointer to some particular structure.
				129	The sole argument to the ParseAlloc() routine is a pointer to the
				130	subroutine used to allocate memory. Typically this means ``malloc()''.</p>
				131
				132	<p>After a program is finished using a parser, it can reclaim all
				133	memory allocated by that parser by calling
				134	<pre>
				135	ParseFree(pParser, free);
				136	</pre>
				137	The first argument is the same pointer returned by ParseAlloc(). The
				138	second argument is a pointer to the function used to release bulk
				139	memory back to the system.</p>
				140
				141	<p>After a parser has been allocated using ParseAlloc(), the programmer
				142	must supply the parser with a sequence of tokens (terminal symbols) to
				143	be parsed. This is accomplished by calling the following function
				144	once for each token:
				145	<pre>
				146	Parse(pParser, hTokenID, sTokenData, pArg);
				147	</pre>
				148	The first argument to the Parse() routine is the pointer returned by
				149	ParseAlloc().
				150	The second argument is a small positive integer that tells the parse the
				151	type of the next token in the data stream.
				152	There is one token type for each terminal symbol in the grammar.
				153	The gram.h file generated by Lemon contains #define statements that
				154	map symbolic terminal symbol names into appropriate integer values.
				155	(A value of 0 for the second argument is a special flag to the
				156	parser to indicate that the end of input has been reached.)
				157	The third argument is the value of the given token. By default,
				158	the type of the third argument is integer, but the grammar will
				159	usually redefine this type to be some kind of structure.
				160	Typically the second argument will be a broad category of tokens
				161	such as ``identifier'' or ``number'' and the third argument will
				162	be the name of the identifier or the value of the number.</p>
				163
				164	<p>The Parse() function may have either three or four arguments,
				165	depending on the grammar. If the grammar specification file request
				166	it, the Parse() function will have a fourth parameter that can be
				167	of any type chosen by the programmer. The parser doesn't do anything
				168	with this argument except to pass it through to action routines.
				169	This is a convenient mechanism for passing state information down
				170	to the action routines without having to use global variables.</p>
				171
				172	<p>A typical use of a Lemon parser might look something like the
				173	following:
				174	<pre>
				175	01 ParseTree ParseFile(const char zFilename){
				176	02 Tokenizer *pTokenizer;
				177	03 void *pParser;
				178	04 Token sToken;
				179	05 int hTokenId;
				180	06 ParserState sState;
				181	07
				182	08 pTokenizer = TokenizerCreate(zFilename);
				183	09 pParser = ParseAlloc( malloc );
				184	10 InitParserState(&sState);
				185	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				186	12 Parse(pParser, hTokenId, sToken, &sState);
				187	13 }
				188	14 Parse(pParser, 0, sToken, &sState);
				189	15 ParseFree(pParser, free );
				190	16 TokenizerFree(pTokenizer);
				191	17 return sState.treeRoot;
				192	18 }
				193	</pre>
				194	This example shows a user-written routine that parses a file of
				195	text and returns a pointer to the parse tree.
				196	(We've omitted all error-handling from this example to keep it
				197	simple.)
				198	We assume the existence of some kind of tokenizer which is created
				199	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				200	on line 16. The GetNextToken() function on line 11 retrieves the
				201	next token from the input file and puts its type in the
				202	integer variable hTokenId. The sToken variable is assumed to be
				203	some kind of structure that contains details about each token,
				204	such as its complete text, what line it occurs on, etc. </p>
				205
				206	<p>This example also assumes the existence of structure of type
				207	ParserState that holds state information about a particular parse.
				208	An instance of such a structure is created on line 6 and initialized
				209	on line 10. A pointer to this structure is passed into the Parse()
				210	routine as the optional 4th argument.
				211	The action routine specified by the grammar for the parser can use
				212	the ParserState structure to hold whatever information is useful and
				213	appropriate. In the example, we note that the treeRoot field of
				214	the ParserState structure is left pointing to the root of the parse
				215	tree.</p>
				216
				217	<p>The core of this example as it relates to Lemon is as follows:
				218	<pre>
				219	ParseFile(){
				220	pParser = ParseAlloc( malloc );
				221	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
				222	Parse(pParser, hTokenId, sToken);
				223	}
				224	Parse(pParser, 0, sToken);
				225	ParseFree(pParser, free );
				226	}
				227	</pre>
				228	Basically, what a program has to do to use a Lemon-generated parser
				229	is first create the parser, then send it lots of tokens obtained by
				230	tokenizing an input source. When the end of input is reached, the
				231	Parse() routine should be called one last time with a token type
				232	of 0. This step is necessary to inform the parser that the end of
				233	input has been reached. Finally, we reclaim memory used by the
				234	parser by calling ParseFree().</p>
				235
				236	<p>There is one other interface routine that should be mentioned
				237	before we move on.
				238	The ParseTrace() function can be used to generate debugging output
				239	from the parser. A prototype for this routine is as follows:
				240	<pre>
				241	ParseTrace(FILE stream, char zPrefix);
				242	</pre>
				243	After this routine is called, a short (one-line) message is written
				244	to the designated output stream every time the parser changes states
				245	or calls an action routine. Each such message is prefaced using
				246	the text given by zPrefix. This debugging output can be turned off
				247	by calling ParseTrace() again with a first argument of NULL (0).</p>
				248
				249	<h3>Differences With YACC and BISON</h3>
				250
				251	<p>Programmers who have previously used the yacc or bison parser
				252	generator will notice several important differences between yacc and/or
				253	bison and Lemon.
				254	<ul>
				255	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				256	the tokenizer calls the parser.
				257	<li>Lemon uses no global variables. Yacc and bison use global variables
				258	to pass information between the tokenizer and parser.
				259	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				260	and bison do not.
				261	</ul>
				262	These differences may cause some initial confusion for programmers
				263	with prior yacc and bison experience.
				264	But after years of experience using Lemon, I firmly
				265	believe that the Lemon way of doing things is better.</p>
				266
				267	<h2>Input File Syntax</h2>
				268
				269	<p>The main purpose of the grammar specification file for Lemon is
				270	to define the grammar for the parser. But the input file also
				271	specifies additional information Lemon requires to do its job.
				272	Most of the work in using Lemon is in writing an appropriate
				273	grammar file.</p>
				274
				275	<p>The grammar file for lemon is, for the most part, free format.
				276	It does not have sections or divisions like yacc or bison. Any
				277	declaration can occur at any point in the file.
				278	Lemon ignores whitespace (except where it is needed to separate
				279	tokens) and it honors the same commenting conventions as C and C++.</p>
				280
				281	<h3>Terminals and Nonterminals</h3>
				282
				283	<p>A terminal symbol (token) is any string of alphanumeric
				284	and underscore characters
				285	that begins with an upper case letter.
				286	A terminal can contain lower class letters after the first character,
				287	but the usual convention is to make terminals all upper case.
				288	A nonterminal, on the other hand, is any string of alphanumeric
				289	and underscore characters than begins with a lower case letter.
				290	Again, the usual convention is to make nonterminals use all lower
				291	case letters.</p>
				292
				293	<p>In Lemon, terminal and nonterminal symbols do not need to
				294	be declared or identified in a separate section of the grammar file.
				295	Lemon is able to generate a list of all terminals and nonterminals
				296	by examining the grammar rules, and it can always distinguish a
				297	terminal from a nonterminal by checking the case of the first
				298	character of the name.</p>
				299
				300	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				301	names or to be individual characters included in single quotes, like
				302	this: ')' or '$'. Lemon does not allow this alternative form for
				303	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				304	must have alphanumeric names.</p>
				305
				306	<h3>Grammar Rules</h3>
				307
				308	<p>The main component of a Lemon grammar file is a sequence of grammar
				309	rules.
				310	Each grammar rule consists of a nonterminal symbol followed by
				311	the special symbol ``::='' and then a list of terminals and/or nonterminals.
				312	The rule is terminated by a period.
				313	The list of terminals and nonterminals on the right-hand side of the
				314	rule can be empty.
				315	Rules can occur in any order, except that the left-hand side of the
				316	first rule is assumed to be the start symbol for the grammar (unless
				317	specified otherwise using the <tt>%start</tt> directive described below.)
				318	A typical sequence of grammar rules might look something like this:
				319	<pre>
				320	expr ::= expr PLUS expr.
				321	expr ::= expr TIMES expr.
				322	expr ::= LPAREN expr RPAREN.
				323	expr ::= VALUE.
				324	</pre>
				325	</p>
				326
				327	<p>There is one non-terminal in this example, ``expr'', and five
				328	terminal symbols or tokens: ``PLUS'', ``TIMES'', ``LPAREN'',
				329	``RPAREN'' and ``VALUE''.</p>
				330
				331	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				332	of C code that will be executed whenever a grammar rule is reduced
				333	by the parser.
				334	In Lemon, this action is specified by putting the C code (contained
				335	within curly braces <tt>{...}</tt>) immediately after the
				336	period that closes the rule.
				337	For example:
				338	<pre>
				339	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				340	</pre>
				341	</p>
				342
				343	<p>In order to be useful, grammar actions must normally be linked to
				344	their associated grammar rules.
				345	In yacc and bison, this is accomplished by embedding a ``$$'' in the
				346	action to stand for the value of the left-hand side of the rule and
				347	symbols ``$1'', ``$2'', and so forth to stand for the value of
				348	the terminal or nonterminal at position 1, 2 and so forth on the
				349	right-hand side of the rule.
				350	This idea is very powerful, but it is also very error-prone. The
				351	single most common source of errors in a yacc or bison grammar is
				352	to miscount the number of symbols on the right-hand side of a grammar
				353	rule and say ``$7'' when you really mean ``$8''.</p>
				354
				355	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				356	names to each symbol in a grammar rule and then using those symbolic
				357	names in the action.
				358	In yacc or bison, one would write this:
				359	<pre>
				360	expr -> expr PLUS expr { $$ = $1 + $3; };
				361	</pre>
				362	But in Lemon, the same rule becomes the following:
				363	<pre>
				364	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				365	</pre>
				366	In the Lemon rule, any symbol in parentheses after a grammar rule
				367	symbol becomes a place holder for that symbol in the grammar rule.
				368	This place holder can then be used in the associated C action to
				369	stand for the value of that symbol.<p>
				370
				371	<p>The Lemon notation for linking a grammar rule with its reduce
				372	action is superior to yacc/bison on several counts.
				373	First, as mentioned above, the Lemon method avoids the need to
				374	count grammar symbols.
				375	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				376	includes a linking symbol in parentheses but that linking symbol
				377	is not actually used in the reduce action, then an error message
				378	is generated.
				379	For example, the rule
				380	<pre>
				381	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				382	</pre>
				383	will generate an error because the linking symbol ``C'' is used
				384	in the grammar rule but not in the reduce action.</p>
				385
				386	<p>The Lemon notation for linking grammar rules to reduce actions
				387	also facilitates the use of destructors for reclaiming memory
				388	allocated by the values of terminals and nonterminals on the
				389	right-hand side of a rule.</p>
				390
				391	<h3>Precedence Rules</h3>
				392
				393	<p>Lemon resolves parsing ambiguities in exactly the same way as
				394	yacc and bison. A shift-reduce conflict is resolved in favor
				395	of the shift, and a reduce-reduce conflict is resolved by reducing
				396	whichever rule comes first in the grammar file.</p>
				397
				398	<p>Just like in
				399	yacc and bison, Lemon allows a measure of control
				400	over the resolution of paring conflicts using precedence rules.
				401	A precedence value can be assigned to any terminal symbol
				402	using the %left, %right or %nonassoc directives. Terminal symbols
				403	mentioned in earlier directives have a lower precedence that
				404	terminal symbols mentioned in later directives. For example:</p>
				405
				406	<p><pre>
				407	%left AND.
				408	%left OR.
				409	%nonassoc EQ NE GT GE LT LE.
				410	%left PLUS MINUS.
				411	%left TIMES DIVIDE MOD.
				412	%right EXP NOT.
				413	</pre></p>
				414
				415	<p>In the preceding sequence of directives, the AND operator is
				416	defined to have the lowest precedence. The OR operator is one
				417	precedence level higher. And so forth. Hence, the grammar would
				418	attempt to group the ambiguous expression
				419	<pre>
				420	a AND b OR c
				421	</pre>
				422	like this
				423	<pre>
				424	a AND (b OR c).
				425	</pre>
				426	The associativity (left, right or nonassoc) is used to determine
				427	the grouping when the precedence is the same. AND is left-associative
				428	in our example, so
				429	<pre>
				430	a AND b AND c
				431	</pre>
				432	is parsed like this
				433	<pre>
				434	(a AND b) AND c.
				435	</pre>
				436	The EXP operator is right-associative, though, so
				437	<pre>
				438	a EXP b EXP c
				439	</pre>
				440	is parsed like this
				441	<pre>
				442	a EXP (b EXP c).
				443	</pre>
				444	The nonassoc precedence is used for non-associative operators.
				445	So
				446	<pre>
				447	a EQ b EQ c
				448	</pre>
				449	is an error.</p>
				450
				451	<p>The precedence of non-terminals is transferred to rules as follows:
				452	The precedence of a grammar rule is equal to the precedence of the
				453	left-most terminal symbol in the rule for which a precedence is
				454	defined. This is normally what you want, but in those cases where
				455	you want to precedence of a grammar rule to be something different,
				456	you can specify an alternative precedence symbol by putting the
				457	symbol in square braces after the period at the end of the rule and
				458	before any C-code. For example:</p>
				459
				460	<p><pre>
				461	expr = MINUS expr. [NOT]
				462	</pre></p>
				463
				464	<p>This rule has a precedence equal to that of the NOT symbol, not the
				465	MINUS symbol as would have been the case by default.</p>
				466
				467	<p>With the knowledge of how precedence is assigned to terminal
				468	symbols and individual
				469	grammar rules, we can now explain precisely how parsing conflicts
				470	are resolved in Lemon. Shift-reduce conflicts are resolved
				471	as follows:
				472	<ul>
				473	<li> If either the token to be shifted or the rule to be reduced
				474	lacks precedence information, then resolve in favor of the
				475	shift, but report a parsing conflict.
				476	<li> If the precedence of the token to be shifted is greater than
				477	the precedence of the rule to reduce, then resolve in favor
				478	of the shift. No parsing conflict is reported.
				479	<li> If the precedence of the token it be shifted is less than the
				480	precedence of the rule to reduce, then resolve in favor of the
				481	reduce action. No parsing conflict is reported.
				482	<li> If the precedences are the same and the shift token is
				483	right-associative, then resolve in favor of the shift.
				484	No parsing conflict is reported.
				485	<li> If the precedences are the same the the shift token is
				486	left-associative, then resolve in favor of the reduce.
				487	No parsing conflict is reported.
				488	<li> Otherwise, resolve the conflict by doing the shift and
				489	report the parsing conflict.
				490	</ul>
				491	Reduce-reduce conflicts are resolved this way:
				492	<ul>
				493	<li> If either reduce rule
				494	lacks precedence information, then resolve in favor of the
				495	rule that appears first in the grammar and report a parsing
				496	conflict.
				497	<li> If both rules have precedence and the precedence is different
				498	then resolve the dispute in favor of the rule with the highest
				499	precedence and do not report a conflict.
				500	<li> Otherwise, resolve the conflict by reducing by the rule that
				501	appears first in the grammar and report a parsing conflict.
				502	</ul>
				503
				504	<h3>Special Directives</h3>
				505
				506	<p>The input grammar to Lemon consists of grammar rules and special
				507	directives. We've described all the grammar rules, so now we'll
				508	talk about the special directives.</p>
				509
				510	<p>Directives in lemon can occur in any order. You can put them before
				511	the grammar rules, or after the grammar rules, or in the mist of the
				512	grammar rules. It doesn't matter. The relative order of
				513	directives used to assign precedence to terminals is important, but
				514	other than that, the order of directives in Lemon is arbitrary.</p>
				515
				516	<p>Lemon supports the following special directives:
				517	<ul>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	518	<li><tt>%code</tt>
				519	<li><tt>%default_destructor</tt>
				520	<li><tt>%default_type</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	521	<li><tt>%destructor</tt>
				522	<li><tt>%extra_argument</tt>
				523	<li><tt>%include</tt>
				524	<li><tt>%left</tt>
				525	<li><tt>%name</tt>
				526	<li><tt>%nonassoc</tt>
				527	<li><tt>%parse_accept</tt>
				528	<li><tt>%parse_failure </tt>
				529	<li><tt>%right</tt>
				530	<li><tt>%stack_overflow</tt>
				531	<li><tt>%stack_size</tt>
				532	<li><tt>%start_symbol</tt>
				533	<li><tt>%syntax_error</tt>
				534	<li><tt>%token_destructor</tt>
				535	<li><tt>%token_prefix</tt>
				536	<li><tt>%token_type</tt>
				537	<li><tt>%type</tt>
				538	</ul>
				539	Each of these directives will be described separately in the
				540	following sections:</p>
				541
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	542	<h4>The <tt>%code</tt> directive</h4>
				543
				544	<p>The %code directive is used to specify addition C/C++ code that
				545	is added to the end of the main output file. This is similar to
				546	the %include directive except that %include is inserted at the
				547	beginning of the main output file.</p>
				548
				549	<p>%code is typically used to include some action routines or perhaps
				550	a tokenizer as part of the output file.</p>
				551
				552	<h4>The <tt>%default_destructor</tt> directive</h4>
				553
				554	<p>The %default_destructor directive specifies a destructor to
				555	use for non-terminals that do not have their own destructor
				556	specified by a separate %destructor directive. See the documentation
				557	on the %destructor directive below for additional information.</p>
				558
				559	<p>In some grammers, many different non-terminal symbols have the
				560	same datatype and hence the same destructor. This directive is
				561	a convenience way to specify the same destructor for all those
				562	non-terminals using a single statement.</p>
				563
				564	<h4>The <tt>%default_type</tt> directive</h4>
				565
				566	<p>The %default_type directive specifies the datatype of non-terminal
				567	symbols that do no have their own datatype defined using a separate
				568	%type directive. See the documentation on %type below for addition
				569	information.</p>
				570
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	571	<h4>The <tt>%destructor</tt> directive</h4>
				572
				573	<p>The %destructor directive is used to specify a destructor for
				574	a non-terminal symbol.
				575	(See also the %token_destructor directive which is used to
				576	specify a destructor for terminal symbols.)</p>
				577
				578	<p>A non-terminal's destructor is called to dispose of the
				579	non-terminal's value whenever the non-terminal is popped from
				580	the stack. This includes all of the following circumstances:
				581	<ul>
				582	<li> When a rule reduces and the value of a non-terminal on
				583	the right-hand side is not linked to C code.
				584	<li> When the stack is popped during error processing.
				585	<li> When the ParseFree() function runs.
				586	</ul>
				587	The destructor can do whatever it wants with the value of
				588	the non-terminal, but its design is to deallocate memory
				589	or other resources held by that non-terminal.</p>
				590
				591	<p>Consider an example:
				592	<pre>
				593	%type nt {void*}
				594	%destructor nt { free($$); }
				595	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				596	</pre>
				597	This example is a bit contrived but it serves to illustrate how
				598	destructors work. The example shows a non-terminal named
				599	``nt'' that holds values of type ``void*''. When the rule for
				600	an ``nt'' reduces, it sets the value of the non-terminal to
				601	space obtained from malloc(). Later, when the nt non-terminal
				602	is popped from the stack, the destructor will fire and call
				603	free() on this malloced space, thus avoiding a memory leak.
				604	(Note that the symbol ``$$'' in the destructor code is replaced
				605	by the value of the non-terminal.)</p>
				606
				607	<p>It is important to note that the value of a non-terminal is passed
				608	to the destructor whenever the non-terminal is removed from the
				609	stack, unless the non-terminal is used in a C-code action. If
				610	the non-terminal is used by C-code, then it is assumed that the
				611	C-code will take care of destroying it if it should really
				612	be destroyed. More commonly, the value is used to build some
				613	larger structure and we don't want to destroy it, which is why
				614	the destructor is not called in this circumstance.</p>
				615
				616	<p>By appropriate use of destructors, it is possible to
				617	build a parser using Lemon that can be used within a long-running
				618	program, such as a GUI, that will not leak memory or other resources.
				619	To do the same using yacc or bison is much more difficult.</p>
				620
				621	<h4>The <tt>%extra_argument</tt> directive</h4>
				622
				623	The %extra_argument directive instructs Lemon to add a 4th parameter
				624	to the parameter list of the Parse() function it generates. Lemon
				625	doesn't do anything itself with this extra argument, but it does
				626	make the argument available to C-code action routines, destructors,
				627	and so forth. For example, if the grammar file contains:</p>
				628
				629	<p><pre>
				630	%extra_argument { MyStruct *pAbc }
				631	</pre></p>
				632
				633	<p>Then the Parse() function generated will have an 4th parameter
				634	of type ``MyStruct*'' and all action routines will have access to
				635	a variable named ``pAbc'' that is the value of the 4th parameter
				636	in the most recent call to Parse().</p>
				637
				638	<h4>The <tt>%include</tt> directive</h4>
				639
				640	<p>The %include directive specifies C code that is included at the
				641	top of the generated parser. You can include any text you want --
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	642	the Lemon parser generator copies it blindly. If you have multiple
				643	%include directives in your grammar file the value of the last
				644	%include directive overwrites all the others.</p.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	645
				646	<p>The %include directive is very handy for getting some extra #include
				647	preprocessor statements at the beginning of the generated parser.
				648	For example:</p>
				649
				650	<p><pre>
				651	%include {#include <unistd.h>}
				652	</pre></p>
				653
				654	<p>This might be needed, for example, if some of the C actions in the
				655	grammar call functions that are prototyed in unistd.h.</p>
				656
				657	<h4>The <tt>%left</tt> directive</h4>
				658
				659	The %left directive is used (along with the %right and
				660	%nonassoc directives) to declare precedences of terminal
				661	symbols. Every terminal symbol whose name appears after
				662	a %left directive but before the next period (``.'') is
				663	given the same left-associative precedence value. Subsequent
				664	%left directives have higher precedence. For example:</p>
				665
				666	<p><pre>
				667	%left AND.
				668	%left OR.
				669	%nonassoc EQ NE GT GE LT LE.
				670	%left PLUS MINUS.
				671	%left TIMES DIVIDE MOD.
				672	%right EXP NOT.
				673	</pre></p>
				674
				675	<p>Note the period that terminates each %left, %right or %nonassoc
				676	directive.</p>
				677
				678	<p>LALR(1) grammars can get into a situation where they require
				679	a large amount of stack space if you make heavy use or right-associative
				680	operators. For this reason, it is recommended that you use %left
				681	rather than %right whenever possible.</p>
				682
				683	<h4>The <tt>%name</tt> directive</h4>
				684
				685	<p>By default, the functions generated by Lemon all begin with the
				686	five-character string ``Parse''. You can change this string to something
				687	different using the %name directive. For instance:</p>
				688
				689	<p><pre>
				690	%name Abcde
				691	</pre></p>
				692
				693	<p>Putting this directive in the grammar file will cause Lemon to generate
				694	functions named
				695	<ul>
				696	<li> AbcdeAlloc(),
				697	<li> AbcdeFree(),
				698	<li> AbcdeTrace(), and
				699	<li> Abcde().
				700	</ul>
				701	The %name directive allows you to generator two or more different
				702	parsers and link them all into the same executable.
				703	</p>
				704
				705	<h4>The <tt>%nonassoc</tt> directive</h4>
				706
				707	<p>This directive is used to assign non-associative precedence to
				708	one or more terminal symbols. See the section on precedence rules
				709	or on the %left directive for additional information.</p>
				710
				711	<h4>The <tt>%parse_accept</tt> directive</h4>
				712
				713	<p>The %parse_accept directive specifies a block of C code that is
				714	executed whenever the parser accepts its input string. To ``accept''
				715	an input string means that the parser was able to process all tokens
				716	without error.</p>
				717
				718	<p>For example:</p>
				719
				720	<p><pre>
				721	%parse_accept {
				722	printf("parsing complete!\n");
				723	}
				724	</pre></p>
				725
				726
				727	<h4>The <tt>%parse_failure</tt> directive</h4>
				728
				729	<p>The %parse_failure directive specifies a block of C code that
				730	is executed whenever the parser fails complete. This code is not
				731	executed until the parser has tried and failed to resolve an input
				732	error using is usual error recovery strategy. The routine is
				733	only invoked when parsing is unable to continue.</p>
				734
				735	<p><pre>
				736	%parse_failure {
				737	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				738	}
				739	</pre></p>
				740
				741	<h4>The <tt>%right</tt> directive</h4>
				742
				743	<p>This directive is used to assign right-associative precedence to
				744	one or more terminal symbols. See the section on precedence rules
				745	or on the %left directive for additional information.</p>
				746
				747	<h4>The <tt>%stack_overflow</tt> directive</h4>
				748
				749	<p>The %stack_overflow directive specifies a block of C code that
				750	is executed if the parser's internal stack ever overflows. Typically
				751	this just prints an error message. After a stack overflow, the parser
				752	will be unable to continue and must be reset.</p>
				753
				754	<p><pre>
				755	%stack_overflow {
				756	fprintf(stderr,"Giving up. Parser stack overflow\n");
				757	}
				758	</pre></p>
				759
				760	<p>You can help prevent parser stack overflows by avoiding the use
				761	of right recursion and right-precedence operators in your grammar.
				762	Use left recursion and and left-precedence operators instead, to
				763	encourage rules to reduce sooner and keep the stack size down.
				764	For example, do rules like this:
				765	<pre>
				766	list ::= list element. // left-recursion. Good!
				767	list ::= .
				768	</pre>
				769	Not like this:
				770	<pre>
				771	list ::= element list. // right-recursion. Bad!
				772	list ::= .
				773	</pre>
				774
				775	<h4>The <tt>%stack_size</tt> directive</h4>
				776
				777	<p>If stack overflow is a problem and you can't resolve the trouble
				778	by using left-recursion, then you might want to increase the size
				779	of the parser's stack using this directive. Put an positive integer
				780	after the %stack_size directive and Lemon will generate a parse
				781	with a stack of the requested size. The default value is 100.</p>
				782
				783	<p><pre>
				784	%stack_size 2000
				785	</pre></p>
				786
				787	<h4>The <tt>%start_symbol</tt> directive</h4>
				788
				789	<p>By default, the start-symbol for the grammar that Lemon generates
				790	is the first non-terminal that appears in the grammar file. But you
				791	can choose a different start-symbol using the %start_symbol directive.</p>
				792
				793	<p><pre>
				794	%start_symbol prog
				795	</pre></p>
				796
				797	<h4>The <tt>%token_destructor</tt> directive</h4>
				798
				799	<p>The %destructor directive assigns a destructor to a non-terminal
				800	symbol. (See the description of the %destructor directive above.)
				801	This directive does the same thing for all terminal symbols.</p>
				802
				803	<p>Unlike non-terminal symbols which may each have a different data type
				804	for their values, terminals all use the same data type (defined by
				805	the %token_type directive) and so they use a common destructor. Other
				806	than that, the token destructor works just like the non-terminal
				807	destructors.</p>
				808
				809	<h4>The <tt>%token_prefix</tt> directive</h4>
				810
				811	<p>Lemon generates #defines that assign small integer constants
				812	to each terminal symbol in the grammar. If desired, Lemon will
				813	add a prefix specified by this directive
				814	to each of the #defines it generates.
				815	So if the default output of Lemon looked like this:
				816	<pre>
				817	#define AND 1
				818	#define MINUS 2
				819	#define OR 3
				820	#define PLUS 4
				821	</pre>
				822	You can insert a statement into the grammar like this:
				823	<pre>
				824	%token_prefix TOKEN_
				825	</pre>
				826	to cause Lemon to produce these symbols instead:
				827	<pre>
				828	#define TOKEN_AND 1
				829	#define TOKEN_MINUS 2
				830	#define TOKEN_OR 3
				831	#define TOKEN_PLUS 4
				832	</pre>
				833
				834	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				835
				836	<p>These directives are used to specify the data types for values
				837	on the parser's stack associated with terminal and non-terminal
				838	symbols. The values of all terminal symbols must be of the same
				839	type. This turns out to be the same data type as the 3rd parameter
				840	to the Parse() function generated by Lemon. Typically, you will
				841	make the value of a terminal symbol by a pointer to some kind of
				842	token structure. Like this:</p>
				843
				844	<p><pre>
				845	%token_type {Token*}
				846	</pre></p>
				847
				848	<p>If the data type of terminals is not specified, the default value
				849	is ``int''.</p>
				850
				851	<p>Non-terminal symbols can each have their own data types. Typically
				852	the data type of a non-terminal is a pointer to the root of a parse-tree
				853	structure that contains all information about that non-terminal.
				854	For example:</p>
				855
				856	<p><pre>
				857	%type expr {Expr*}
				858	</pre></p>
				859
				860	<p>Each entry on the parser's stack is actually a union containing
				861	instances of all data types for every non-terminal and terminal symbol.
				862	Lemon will automatically use the correct element of this union depending
				863	on what the corresponding non-terminal or terminal symbol is. But
				864	the grammar designer should keep in mind that the size of the union
				865	will be the size of its largest element. So if you have a single
				866	non-terminal whose data type requires 1K of storage, then your 100
				867	entry parser stack will require 100K of heap space. If you are willing
				868	and able to pay that price, fine. You just need to know.</p>
				869
				870	<h3>Error Processing</h3>
				871
				872	<p>After extensive experimentation over several years, it has been
				873	discovered that the error recovery strategy used by yacc is about
				874	as good as it gets. And so that is what Lemon uses.</p>
				875
				876	<p>When a Lemon-generated parser encounters a syntax error, it
				877	first invokes the code specified by the %syntax_error directive, if
				878	any. It then enters its error recovery strategy. The error recovery
				879	strategy is to begin popping the parsers stack until it enters a
				880	state where it is permitted to shift a special non-terminal symbol
				881	named ``error''. It then shifts this non-terminal and continues
				882	parsing. But the %syntax_error routine will not be called again
				883	until at least three new tokens have been successfully shifted.</p>
				884
				885	<p>If the parser pops its stack until the stack is empty, and it still
				886	is unable to shift the error symbol, then the %parse_failed routine
				887	is invoked and the parser resets itself to its start state, ready
				888	to begin parsing a new file. This is what will happen at the very
				889	first syntax error, of course, if there are no instances of the
				890	``error'' non-terminal in your grammar.</p>
				891
				892	</body>
				893	</html>