Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 2b8c6136401e3599ffcbc7c606dda786d47d04f4 [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
				5	<body bgcolor=white>
				6	<h1 align=center>The Lemon Parser Generator</h1>
				7
				8	<p>Lemon is an LALR(1) parser generator for C or C++.
				9	It does the same job as ``bison'' and ``yacc''.
				10	But lemon is not another bison or yacc clone. It
				11	uses a different grammar syntax which is designed to
				12	reduce the number of coding errors. Lemon also uses a more
				13	sophisticated parsing engine that is faster than yacc and
				14	bison and which is both reentrant and thread-safe.
				15	Furthermore, Lemon implements features that can be used
				16	to eliminate resource leaks, making is suitable for use
				17	in long-running programs such as graphical user interfaces
				18	or embedded controllers.</p>
				19
				20	<p>This document is an introduction to the Lemon
				21	parser generator.</p>
				22
				23	<h2>Theory of Operation</h2>
				24
				25	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				26	for a particular language into C code that implements a parser for
				27	that language.
				28	The program has two inputs:
				29	<ul>
				30	<li>The grammar specification.
				31	<li>A parser template file.
				32	</ul>
				33	Typically, only the grammar specification is supplied by the programmer.
				34	Lemon comes with a default parser template which works fine for most
				35	applications. But the user is free to substitute a different parser
				36	template if desired.</p>
				37
				38	<p>Depending on command-line options, Lemon will generate between
				39	one and three files of outputs.
				40	<ul>
				41	<li>C code to implement the parser.
				42	<li>A header file defining an integer ID for each terminal symbol.
				43	<li>An information file that describes the states of the generated parser
				44	automaton.
				45	</ul>
				46	By default, all three of these output files are generated.
				47	The header file is suppressed if the ``-m'' command-line option is
				48	used and the report file is omitted when ``-q'' is selected.</p>
				49
				50	<p>The grammar specification file uses a ``.y'' suffix, by convention.
				51	In the examples used in this document, we'll assume the name of the
				52	grammar file is ``gram.y''. A typical use of Lemon would be the
				53	following command:
				54	<pre>
				55	lemon gram.y
				56	</pre>
				57	This command will generate three output files named ``gram.c'',
				58	``gram.h'' and ``gram.out''.
				59	The first is C code to implement the parser. The second
				60	is the header file that defines numerical values for all
				61	terminal symbols, and the last is the report that explains
				62	the states used by the parser automaton.</p>
				63
				64	<h3>Command Line Options</h3>
				65
				66	<p>The behavior of Lemon can be modified using command-line options.
				67	You can obtain a list of the available command-line options together
				68	with a brief explanation of what each does by typing
				69	<pre>
				70	lemon -?
				71	</pre>
				72	As of this writing, the following command-line options are supported:
				73	<ul>
				74	<li><tt>-b</tt>
				75	<li><tt>-c</tt>
				76	<li><tt>-g</tt>
				77	<li><tt>-m</tt>
				78	<li><tt>-q</tt>
				79	<li><tt>-s</tt>
				80	<li><tt>-x</tt>
				81	</ul>
				82	The ``-b'' option reduces the amount of text in the report file by
				83	printing only the basis of each parser state, rather than the full
				84	configuration.
				85	The ``-c'' option suppresses action table compression. Using -c
				86	will make the parser a little larger and slower but it will detect
				87	syntax errors sooner.
				88	The ``-g'' option causes no output files to be generated at all.
				89	Instead, the input grammar file is printed on standard output but
				90	with all comments, actions and other extraneous text deleted. This
				91	is a useful way to get a quick summary of a grammar.
				92	The ``-m'' option causes the output C source file to be compatible
				93	with the ``makeheaders'' program.
				94	Makeheaders is a program that automatically generates header files
				95	from C source code. When the ``-m'' option is used, the header
				96	file is not output since the makeheaders program will take care
				97	of generated all header files automatically.
				98	The ``-q'' option suppresses the report file.
				99	Using ``-s'' causes a brief summary of parser statistics to be
				100	printed. Like this:
				101	<pre>
				102	Parser statistics: 74 terminals, 70 nonterminals, 179 rules
				103	340 states, 2026 parser table entries, 0 conflicts
				104	</pre>
				105	Finally, the ``-x'' option causes Lemon to print its version number
drh	b19a2bc	2001-09-16 00:13:26 +0000	[diff] [blame]	106	and then stops without attempting to read the grammar or generate a parser.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	107
				108	<h3>The Parser Interface</h3>
				109
				110	<p>Lemon doesn't generate a complete, working program. It only generates
				111	a few subroutines that implement a parser. This section describes
				112	the interface to those subroutines. It is up to the programmer to
				113	call these subroutines in an appropriate way in order to produce a
				114	complete system.</p>
				115
				116	<p>Before a program begins using a Lemon-generated parser, the program
				117	must first create the parser.
				118	A new parser is created as follows:
				119	<pre>
				120	void *pParser = ParseAlloc( malloc );
				121	</pre>
				122	The ParseAlloc() routine allocates and initializes a new parser and
				123	returns a pointer to it.
				124	The actual data structure used to represent a parser is opaque --
				125	its internal structure is not visible or usable by the calling routine.
				126	For this reason, the ParseAlloc() routine returns a pointer to void
				127	rather than a pointer to some particular structure.
				128	The sole argument to the ParseAlloc() routine is a pointer to the
				129	subroutine used to allocate memory. Typically this means ``malloc()''.</p>
				130
				131	<p>After a program is finished using a parser, it can reclaim all
				132	memory allocated by that parser by calling
				133	<pre>
				134	ParseFree(pParser, free);
				135	</pre>
				136	The first argument is the same pointer returned by ParseAlloc(). The
				137	second argument is a pointer to the function used to release bulk
				138	memory back to the system.</p>
				139
				140	<p>After a parser has been allocated using ParseAlloc(), the programmer
				141	must supply the parser with a sequence of tokens (terminal symbols) to
				142	be parsed. This is accomplished by calling the following function
				143	once for each token:
				144	<pre>
				145	Parse(pParser, hTokenID, sTokenData, pArg);
				146	</pre>
				147	The first argument to the Parse() routine is the pointer returned by
				148	ParseAlloc().
				149	The second argument is a small positive integer that tells the parse the
				150	type of the next token in the data stream.
				151	There is one token type for each terminal symbol in the grammar.
				152	The gram.h file generated by Lemon contains #define statements that
				153	map symbolic terminal symbol names into appropriate integer values.
				154	(A value of 0 for the second argument is a special flag to the
				155	parser to indicate that the end of input has been reached.)
				156	The third argument is the value of the given token. By default,
				157	the type of the third argument is integer, but the grammar will
				158	usually redefine this type to be some kind of structure.
				159	Typically the second argument will be a broad category of tokens
				160	such as ``identifier'' or ``number'' and the third argument will
				161	be the name of the identifier or the value of the number.</p>
				162
				163	<p>The Parse() function may have either three or four arguments,
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	164	depending on the grammar. If the grammar specification file requests
				165	it (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>),
				166	the Parse() function will have a fourth parameter that can be
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	167	of any type chosen by the programmer. The parser doesn't do anything
				168	with this argument except to pass it through to action routines.
				169	This is a convenient mechanism for passing state information down
				170	to the action routines without having to use global variables.</p>
				171
				172	<p>A typical use of a Lemon parser might look something like the
				173	following:
				174	<pre>
				175	01 ParseTree ParseFile(const char zFilename){
				176	02 Tokenizer *pTokenizer;
				177	03 void *pParser;
				178	04 Token sToken;
				179	05 int hTokenId;
				180	06 ParserState sState;
				181	07
				182	08 pTokenizer = TokenizerCreate(zFilename);
				183	09 pParser = ParseAlloc( malloc );
				184	10 InitParserState(&sState);
				185	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				186	12 Parse(pParser, hTokenId, sToken, &sState);
				187	13 }
				188	14 Parse(pParser, 0, sToken, &sState);
				189	15 ParseFree(pParser, free );
				190	16 TokenizerFree(pTokenizer);
				191	17 return sState.treeRoot;
				192	18 }
				193	</pre>
				194	This example shows a user-written routine that parses a file of
				195	text and returns a pointer to the parse tree.
				196	(We've omitted all error-handling from this example to keep it
				197	simple.)
				198	We assume the existence of some kind of tokenizer which is created
				199	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				200	on line 16. The GetNextToken() function on line 11 retrieves the
				201	next token from the input file and puts its type in the
				202	integer variable hTokenId. The sToken variable is assumed to be
				203	some kind of structure that contains details about each token,
				204	such as its complete text, what line it occurs on, etc. </p>
				205
				206	<p>This example also assumes the existence of structure of type
				207	ParserState that holds state information about a particular parse.
				208	An instance of such a structure is created on line 6 and initialized
				209	on line 10. A pointer to this structure is passed into the Parse()
				210	routine as the optional 4th argument.
				211	The action routine specified by the grammar for the parser can use
				212	the ParserState structure to hold whatever information is useful and
				213	appropriate. In the example, we note that the treeRoot field of
				214	the ParserState structure is left pointing to the root of the parse
				215	tree.</p>
				216
				217	<p>The core of this example as it relates to Lemon is as follows:
				218	<pre>
				219	ParseFile(){
				220	pParser = ParseAlloc( malloc );
				221	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
				222	Parse(pParser, hTokenId, sToken);
				223	}
				224	Parse(pParser, 0, sToken);
				225	ParseFree(pParser, free );
				226	}
				227	</pre>
				228	Basically, what a program has to do to use a Lemon-generated parser
				229	is first create the parser, then send it lots of tokens obtained by
				230	tokenizing an input source. When the end of input is reached, the
				231	Parse() routine should be called one last time with a token type
				232	of 0. This step is necessary to inform the parser that the end of
				233	input has been reached. Finally, we reclaim memory used by the
				234	parser by calling ParseFree().</p>
				235
				236	<p>There is one other interface routine that should be mentioned
				237	before we move on.
				238	The ParseTrace() function can be used to generate debugging output
				239	from the parser. A prototype for this routine is as follows:
				240	<pre>
				241	ParseTrace(FILE stream, char zPrefix);
				242	</pre>
				243	After this routine is called, a short (one-line) message is written
				244	to the designated output stream every time the parser changes states
				245	or calls an action routine. Each such message is prefaced using
				246	the text given by zPrefix. This debugging output can be turned off
				247	by calling ParseTrace() again with a first argument of NULL (0).</p>
				248
				249	<h3>Differences With YACC and BISON</h3>
				250
				251	<p>Programmers who have previously used the yacc or bison parser
				252	generator will notice several important differences between yacc and/or
				253	bison and Lemon.
				254	<ul>
				255	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				256	the tokenizer calls the parser.
				257	<li>Lemon uses no global variables. Yacc and bison use global variables
				258	to pass information between the tokenizer and parser.
				259	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				260	and bison do not.
				261	</ul>
				262	These differences may cause some initial confusion for programmers
				263	with prior yacc and bison experience.
				264	But after years of experience using Lemon, I firmly
				265	believe that the Lemon way of doing things is better.</p>
				266
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	267	<p><i>Updated as of 2016-02-16:</i>
				268	The text above was written in the 1990s.
				269	We are told that Bison has lately been enhanced to support the
				270	tokenizer-calls-parser paradigm used by Lemon, and to obviate the
				271	need for global variables.</p>
				272
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	273	<h2>Input File Syntax</h2>
				274
				275	<p>The main purpose of the grammar specification file for Lemon is
				276	to define the grammar for the parser. But the input file also
				277	specifies additional information Lemon requires to do its job.
				278	Most of the work in using Lemon is in writing an appropriate
				279	grammar file.</p>
				280
				281	<p>The grammar file for lemon is, for the most part, free format.
				282	It does not have sections or divisions like yacc or bison. Any
				283	declaration can occur at any point in the file.
				284	Lemon ignores whitespace (except where it is needed to separate
				285	tokens) and it honors the same commenting conventions as C and C++.</p>
				286
				287	<h3>Terminals and Nonterminals</h3>
				288
				289	<p>A terminal symbol (token) is any string of alphanumeric
				290	and underscore characters
				291	that begins with an upper case letter.
drh	c8eee5e	2011-07-30 23:50:12 +0000	[diff] [blame]	292	A terminal can contain lowercase letters after the first character,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	293	but the usual convention is to make terminals all upper case.
				294	A nonterminal, on the other hand, is any string of alphanumeric
				295	and underscore characters than begins with a lower case letter.
				296	Again, the usual convention is to make nonterminals use all lower
				297	case letters.</p>
				298
				299	<p>In Lemon, terminal and nonterminal symbols do not need to
				300	be declared or identified in a separate section of the grammar file.
				301	Lemon is able to generate a list of all terminals and nonterminals
				302	by examining the grammar rules, and it can always distinguish a
				303	terminal from a nonterminal by checking the case of the first
				304	character of the name.</p>
				305
				306	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				307	names or to be individual characters included in single quotes, like
				308	this: ')' or '$'. Lemon does not allow this alternative form for
				309	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				310	must have alphanumeric names.</p>
				311
				312	<h3>Grammar Rules</h3>
				313
				314	<p>The main component of a Lemon grammar file is a sequence of grammar
				315	rules.
				316	Each grammar rule consists of a nonterminal symbol followed by
				317	the special symbol ``::='' and then a list of terminals and/or nonterminals.
				318	The rule is terminated by a period.
				319	The list of terminals and nonterminals on the right-hand side of the
				320	rule can be empty.
				321	Rules can occur in any order, except that the left-hand side of the
				322	first rule is assumed to be the start symbol for the grammar (unless
				323	specified otherwise using the <tt>%start</tt> directive described below.)
				324	A typical sequence of grammar rules might look something like this:
				325	<pre>
				326	expr ::= expr PLUS expr.
				327	expr ::= expr TIMES expr.
				328	expr ::= LPAREN expr RPAREN.
				329	expr ::= VALUE.
				330	</pre>
				331	</p>
				332
				333	<p>There is one non-terminal in this example, ``expr'', and five
				334	terminal symbols or tokens: ``PLUS'', ``TIMES'', ``LPAREN'',
				335	``RPAREN'' and ``VALUE''.</p>
				336
				337	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				338	of C code that will be executed whenever a grammar rule is reduced
				339	by the parser.
				340	In Lemon, this action is specified by putting the C code (contained
				341	within curly braces <tt>{...}</tt>) immediately after the
				342	period that closes the rule.
				343	For example:
				344	<pre>
				345	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				346	</pre>
				347	</p>
				348
				349	<p>In order to be useful, grammar actions must normally be linked to
				350	their associated grammar rules.
				351	In yacc and bison, this is accomplished by embedding a ``$$'' in the
				352	action to stand for the value of the left-hand side of the rule and
				353	symbols ``$1'', ``$2'', and so forth to stand for the value of
				354	the terminal or nonterminal at position 1, 2 and so forth on the
				355	right-hand side of the rule.
				356	This idea is very powerful, but it is also very error-prone. The
				357	single most common source of errors in a yacc or bison grammar is
				358	to miscount the number of symbols on the right-hand side of a grammar
				359	rule and say ``$7'' when you really mean ``$8''.</p>
				360
				361	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				362	names to each symbol in a grammar rule and then using those symbolic
				363	names in the action.
				364	In yacc or bison, one would write this:
				365	<pre>
				366	expr -> expr PLUS expr { $$ = $1 + $3; };
				367	</pre>
				368	But in Lemon, the same rule becomes the following:
				369	<pre>
				370	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				371	</pre>
				372	In the Lemon rule, any symbol in parentheses after a grammar rule
				373	symbol becomes a place holder for that symbol in the grammar rule.
				374	This place holder can then be used in the associated C action to
				375	stand for the value of that symbol.<p>
				376
				377	<p>The Lemon notation for linking a grammar rule with its reduce
				378	action is superior to yacc/bison on several counts.
				379	First, as mentioned above, the Lemon method avoids the need to
				380	count grammar symbols.
				381	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				382	includes a linking symbol in parentheses but that linking symbol
				383	is not actually used in the reduce action, then an error message
				384	is generated.
				385	For example, the rule
				386	<pre>
				387	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				388	</pre>
				389	will generate an error because the linking symbol ``C'' is used
				390	in the grammar rule but not in the reduce action.</p>
				391
				392	<p>The Lemon notation for linking grammar rules to reduce actions
				393	also facilitates the use of destructors for reclaiming memory
				394	allocated by the values of terminals and nonterminals on the
				395	right-hand side of a rule.</p>
				396
				397	<h3>Precedence Rules</h3>
				398
				399	<p>Lemon resolves parsing ambiguities in exactly the same way as
				400	yacc and bison. A shift-reduce conflict is resolved in favor
				401	of the shift, and a reduce-reduce conflict is resolved by reducing
				402	whichever rule comes first in the grammar file.</p>
				403
				404	<p>Just like in
				405	yacc and bison, Lemon allows a measure of control
				406	over the resolution of paring conflicts using precedence rules.
				407	A precedence value can be assigned to any terminal symbol
				408	using the %left, %right or %nonassoc directives. Terminal symbols
				409	mentioned in earlier directives have a lower precedence that
				410	terminal symbols mentioned in later directives. For example:</p>
				411
				412	<p><pre>
				413	%left AND.
				414	%left OR.
				415	%nonassoc EQ NE GT GE LT LE.
				416	%left PLUS MINUS.
				417	%left TIMES DIVIDE MOD.
				418	%right EXP NOT.
				419	</pre></p>
				420
				421	<p>In the preceding sequence of directives, the AND operator is
				422	defined to have the lowest precedence. The OR operator is one
				423	precedence level higher. And so forth. Hence, the grammar would
				424	attempt to group the ambiguous expression
				425	<pre>
				426	a AND b OR c
				427	</pre>
				428	like this
				429	<pre>
				430	a AND (b OR c).
				431	</pre>
				432	The associativity (left, right or nonassoc) is used to determine
				433	the grouping when the precedence is the same. AND is left-associative
				434	in our example, so
				435	<pre>
				436	a AND b AND c
				437	</pre>
				438	is parsed like this
				439	<pre>
				440	(a AND b) AND c.
				441	</pre>
				442	The EXP operator is right-associative, though, so
				443	<pre>
				444	a EXP b EXP c
				445	</pre>
				446	is parsed like this
				447	<pre>
				448	a EXP (b EXP c).
				449	</pre>
				450	The nonassoc precedence is used for non-associative operators.
				451	So
				452	<pre>
				453	a EQ b EQ c
				454	</pre>
				455	is an error.</p>
				456
				457	<p>The precedence of non-terminals is transferred to rules as follows:
				458	The precedence of a grammar rule is equal to the precedence of the
				459	left-most terminal symbol in the rule for which a precedence is
				460	defined. This is normally what you want, but in those cases where
				461	you want to precedence of a grammar rule to be something different,
				462	you can specify an alternative precedence symbol by putting the
				463	symbol in square braces after the period at the end of the rule and
				464	before any C-code. For example:</p>
				465
				466	<p><pre>
				467	expr = MINUS expr. [NOT]
				468	</pre></p>
				469
				470	<p>This rule has a precedence equal to that of the NOT symbol, not the
				471	MINUS symbol as would have been the case by default.</p>
				472
				473	<p>With the knowledge of how precedence is assigned to terminal
				474	symbols and individual
				475	grammar rules, we can now explain precisely how parsing conflicts
				476	are resolved in Lemon. Shift-reduce conflicts are resolved
				477	as follows:
				478	<ul>
				479	<li> If either the token to be shifted or the rule to be reduced
				480	lacks precedence information, then resolve in favor of the
				481	shift, but report a parsing conflict.
				482	<li> If the precedence of the token to be shifted is greater than
				483	the precedence of the rule to reduce, then resolve in favor
				484	of the shift. No parsing conflict is reported.
				485	<li> If the precedence of the token it be shifted is less than the
				486	precedence of the rule to reduce, then resolve in favor of the
				487	reduce action. No parsing conflict is reported.
				488	<li> If the precedences are the same and the shift token is
				489	right-associative, then resolve in favor of the shift.
				490	No parsing conflict is reported.
mistachkin	d557843	2012-08-25 10:01:29 +0000	[diff] [blame]	491	<li> If the precedences are the same the shift token is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	492	left-associative, then resolve in favor of the reduce.
				493	No parsing conflict is reported.
				494	<li> Otherwise, resolve the conflict by doing the shift and
				495	report the parsing conflict.
				496	</ul>
				497	Reduce-reduce conflicts are resolved this way:
				498	<ul>
				499	<li> If either reduce rule
				500	lacks precedence information, then resolve in favor of the
				501	rule that appears first in the grammar and report a parsing
				502	conflict.
				503	<li> If both rules have precedence and the precedence is different
				504	then resolve the dispute in favor of the rule with the highest
				505	precedence and do not report a conflict.
				506	<li> Otherwise, resolve the conflict by reducing by the rule that
				507	appears first in the grammar and report a parsing conflict.
				508	</ul>
				509
				510	<h3>Special Directives</h3>
				511
				512	<p>The input grammar to Lemon consists of grammar rules and special
				513	directives. We've described all the grammar rules, so now we'll
				514	talk about the special directives.</p>
				515
				516	<p>Directives in lemon can occur in any order. You can put them before
				517	the grammar rules, or after the grammar rules, or in the mist of the
				518	grammar rules. It doesn't matter. The relative order of
				519	directives used to assign precedence to terminals is important, but
				520	other than that, the order of directives in Lemon is arbitrary.</p>
				521
				522	<p>Lemon supports the following special directives:
				523	<ul>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	524	<li><tt>%code</tt>
				525	<li><tt>%default_destructor</tt>
				526	<li><tt>%default_type</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	527	<li><tt>%destructor</tt>
				528	<li><tt>%extra_argument</tt>
				529	<li><tt>%include</tt>
				530	<li><tt>%left</tt>
				531	<li><tt>%name</tt>
				532	<li><tt>%nonassoc</tt>
				533	<li><tt>%parse_accept</tt>
				534	<li><tt>%parse_failure </tt>
				535	<li><tt>%right</tt>
				536	<li><tt>%stack_overflow</tt>
				537	<li><tt>%stack_size</tt>
				538	<li><tt>%start_symbol</tt>
				539	<li><tt>%syntax_error</tt>
				540	<li><tt>%token_destructor</tt>
				541	<li><tt>%token_prefix</tt>
				542	<li><tt>%token_type</tt>
				543	<li><tt>%type</tt>
				544	</ul>
				545	Each of these directives will be described separately in the
				546	following sections:</p>
				547
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	548	<h4>The <tt>%code</tt> directive</h4>
				549
				550	<p>The %code directive is used to specify addition C/C++ code that
				551	is added to the end of the main output file. This is similar to
				552	the %include directive except that %include is inserted at the
				553	beginning of the main output file.</p>
				554
				555	<p>%code is typically used to include some action routines or perhaps
				556	a tokenizer as part of the output file.</p>
				557
				558	<h4>The <tt>%default_destructor</tt> directive</h4>
				559
				560	<p>The %default_destructor directive specifies a destructor to
				561	use for non-terminals that do not have their own destructor
				562	specified by a separate %destructor directive. See the documentation
				563	on the %destructor directive below for additional information.</p>
				564
				565	<p>In some grammers, many different non-terminal symbols have the
				566	same datatype and hence the same destructor. This directive is
				567	a convenience way to specify the same destructor for all those
				568	non-terminals using a single statement.</p>
				569
				570	<h4>The <tt>%default_type</tt> directive</h4>
				571
				572	<p>The %default_type directive specifies the datatype of non-terminal
				573	symbols that do no have their own datatype defined using a separate
				574	%type directive. See the documentation on %type below for addition
				575	information.</p>
				576
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	577	<h4>The <tt>%destructor</tt> directive</h4>
				578
				579	<p>The %destructor directive is used to specify a destructor for
				580	a non-terminal symbol.
				581	(See also the %token_destructor directive which is used to
				582	specify a destructor for terminal symbols.)</p>
				583
				584	<p>A non-terminal's destructor is called to dispose of the
				585	non-terminal's value whenever the non-terminal is popped from
				586	the stack. This includes all of the following circumstances:
				587	<ul>
				588	<li> When a rule reduces and the value of a non-terminal on
				589	the right-hand side is not linked to C code.
				590	<li> When the stack is popped during error processing.
				591	<li> When the ParseFree() function runs.
				592	</ul>
				593	The destructor can do whatever it wants with the value of
				594	the non-terminal, but its design is to deallocate memory
				595	or other resources held by that non-terminal.</p>
				596
				597	<p>Consider an example:
				598	<pre>
				599	%type nt {void*}
				600	%destructor nt { free($$); }
				601	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				602	</pre>
				603	This example is a bit contrived but it serves to illustrate how
				604	destructors work. The example shows a non-terminal named
				605	``nt'' that holds values of type ``void*''. When the rule for
				606	an ``nt'' reduces, it sets the value of the non-terminal to
				607	space obtained from malloc(). Later, when the nt non-terminal
				608	is popped from the stack, the destructor will fire and call
				609	free() on this malloced space, thus avoiding a memory leak.
				610	(Note that the symbol ``$$'' in the destructor code is replaced
				611	by the value of the non-terminal.)</p>
				612
				613	<p>It is important to note that the value of a non-terminal is passed
				614	to the destructor whenever the non-terminal is removed from the
				615	stack, unless the non-terminal is used in a C-code action. If
				616	the non-terminal is used by C-code, then it is assumed that the
				617	C-code will take care of destroying it if it should really
				618	be destroyed. More commonly, the value is used to build some
				619	larger structure and we don't want to destroy it, which is why
				620	the destructor is not called in this circumstance.</p>
				621
				622	<p>By appropriate use of destructors, it is possible to
				623	build a parser using Lemon that can be used within a long-running
				624	program, such as a GUI, that will not leak memory or other resources.
				625	To do the same using yacc or bison is much more difficult.</p>
				626
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	627	<a name="extraarg"></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	628	<h4>The <tt>%extra_argument</tt> directive</h4>
				629
				630	The %extra_argument directive instructs Lemon to add a 4th parameter
				631	to the parameter list of the Parse() function it generates. Lemon
				632	doesn't do anything itself with this extra argument, but it does
				633	make the argument available to C-code action routines, destructors,
				634	and so forth. For example, if the grammar file contains:</p>
				635
				636	<p><pre>
				637	%extra_argument { MyStruct *pAbc }
				638	</pre></p>
				639
				640	<p>Then the Parse() function generated will have an 4th parameter
				641	of type ``MyStruct*'' and all action routines will have access to
				642	a variable named ``pAbc'' that is the value of the 4th parameter
				643	in the most recent call to Parse().</p>
				644
				645	<h4>The <tt>%include</tt> directive</h4>
				646
				647	<p>The %include directive specifies C code that is included at the
				648	top of the generated parser. You can include any text you want --
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	649	the Lemon parser generator copies it blindly. If you have multiple
				650	%include directives in your grammar file the value of the last
				651	%include directive overwrites all the others.</p.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	652
				653	<p>The %include directive is very handy for getting some extra #include
				654	preprocessor statements at the beginning of the generated parser.
				655	For example:</p>
				656
				657	<p><pre>
				658	%include {#include <unistd.h>}
				659	</pre></p>
				660
				661	<p>This might be needed, for example, if some of the C actions in the
				662	grammar call functions that are prototyed in unistd.h.</p>
				663
				664	<h4>The <tt>%left</tt> directive</h4>
				665
				666	The %left directive is used (along with the %right and
				667	%nonassoc directives) to declare precedences of terminal
				668	symbols. Every terminal symbol whose name appears after
				669	a %left directive but before the next period (``.'') is
				670	given the same left-associative precedence value. Subsequent
				671	%left directives have higher precedence. For example:</p>
				672
				673	<p><pre>
				674	%left AND.
				675	%left OR.
				676	%nonassoc EQ NE GT GE LT LE.
				677	%left PLUS MINUS.
				678	%left TIMES DIVIDE MOD.
				679	%right EXP NOT.
				680	</pre></p>
				681
				682	<p>Note the period that terminates each %left, %right or %nonassoc
				683	directive.</p>
				684
				685	<p>LALR(1) grammars can get into a situation where they require
				686	a large amount of stack space if you make heavy use or right-associative
				687	operators. For this reason, it is recommended that you use %left
				688	rather than %right whenever possible.</p>
				689
				690	<h4>The <tt>%name</tt> directive</h4>
				691
				692	<p>By default, the functions generated by Lemon all begin with the
				693	five-character string ``Parse''. You can change this string to something
				694	different using the %name directive. For instance:</p>
				695
				696	<p><pre>
				697	%name Abcde
				698	</pre></p>
				699
				700	<p>Putting this directive in the grammar file will cause Lemon to generate
				701	functions named
				702	<ul>
				703	<li> AbcdeAlloc(),
				704	<li> AbcdeFree(),
				705	<li> AbcdeTrace(), and
				706	<li> Abcde().
				707	</ul>
				708	The %name directive allows you to generator two or more different
				709	parsers and link them all into the same executable.
				710	</p>
				711
				712	<h4>The <tt>%nonassoc</tt> directive</h4>
				713
				714	<p>This directive is used to assign non-associative precedence to
				715	one or more terminal symbols. See the section on precedence rules
				716	or on the %left directive for additional information.</p>
				717
				718	<h4>The <tt>%parse_accept</tt> directive</h4>
				719
				720	<p>The %parse_accept directive specifies a block of C code that is
				721	executed whenever the parser accepts its input string. To ``accept''
				722	an input string means that the parser was able to process all tokens
				723	without error.</p>
				724
				725	<p>For example:</p>
				726
				727	<p><pre>
				728	%parse_accept {
				729	printf("parsing complete!\n");
				730	}
				731	</pre></p>
				732
				733
				734	<h4>The <tt>%parse_failure</tt> directive</h4>
				735
				736	<p>The %parse_failure directive specifies a block of C code that
				737	is executed whenever the parser fails complete. This code is not
				738	executed until the parser has tried and failed to resolve an input
				739	error using is usual error recovery strategy. The routine is
				740	only invoked when parsing is unable to continue.</p>
				741
				742	<p><pre>
				743	%parse_failure {
				744	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				745	}
				746	</pre></p>
				747
				748	<h4>The <tt>%right</tt> directive</h4>
				749
				750	<p>This directive is used to assign right-associative precedence to
				751	one or more terminal symbols. See the section on precedence rules
				752	or on the %left directive for additional information.</p>
				753
				754	<h4>The <tt>%stack_overflow</tt> directive</h4>
				755
				756	<p>The %stack_overflow directive specifies a block of C code that
				757	is executed if the parser's internal stack ever overflows. Typically
				758	this just prints an error message. After a stack overflow, the parser
				759	will be unable to continue and must be reset.</p>
				760
				761	<p><pre>
				762	%stack_overflow {
				763	fprintf(stderr,"Giving up. Parser stack overflow\n");
				764	}
				765	</pre></p>
				766
				767	<p>You can help prevent parser stack overflows by avoiding the use
				768	of right recursion and right-precedence operators in your grammar.
				769	Use left recursion and and left-precedence operators instead, to
				770	encourage rules to reduce sooner and keep the stack size down.
				771	For example, do rules like this:
				772	<pre>
				773	list ::= list element. // left-recursion. Good!
				774	list ::= .
				775	</pre>
				776	Not like this:
				777	<pre>
				778	list ::= element list. // right-recursion. Bad!
				779	list ::= .
				780	</pre>
				781
				782	<h4>The <tt>%stack_size</tt> directive</h4>
				783
				784	<p>If stack overflow is a problem and you can't resolve the trouble
				785	by using left-recursion, then you might want to increase the size
				786	of the parser's stack using this directive. Put an positive integer
				787	after the %stack_size directive and Lemon will generate a parse
				788	with a stack of the requested size. The default value is 100.</p>
				789
				790	<p><pre>
				791	%stack_size 2000
				792	</pre></p>
				793
				794	<h4>The <tt>%start_symbol</tt> directive</h4>
				795
				796	<p>By default, the start-symbol for the grammar that Lemon generates
				797	is the first non-terminal that appears in the grammar file. But you
				798	can choose a different start-symbol using the %start_symbol directive.</p>
				799
				800	<p><pre>
				801	%start_symbol prog
				802	</pre></p>
				803
				804	<h4>The <tt>%token_destructor</tt> directive</h4>
				805
				806	<p>The %destructor directive assigns a destructor to a non-terminal
				807	symbol. (See the description of the %destructor directive above.)
				808	This directive does the same thing for all terminal symbols.</p>
				809
				810	<p>Unlike non-terminal symbols which may each have a different data type
				811	for their values, terminals all use the same data type (defined by
				812	the %token_type directive) and so they use a common destructor. Other
				813	than that, the token destructor works just like the non-terminal
				814	destructors.</p>
				815
				816	<h4>The <tt>%token_prefix</tt> directive</h4>
				817
				818	<p>Lemon generates #defines that assign small integer constants
				819	to each terminal symbol in the grammar. If desired, Lemon will
				820	add a prefix specified by this directive
				821	to each of the #defines it generates.
				822	So if the default output of Lemon looked like this:
				823	<pre>
				824	#define AND 1
				825	#define MINUS 2
				826	#define OR 3
				827	#define PLUS 4
				828	</pre>
				829	You can insert a statement into the grammar like this:
				830	<pre>
				831	%token_prefix TOKEN_
				832	</pre>
				833	to cause Lemon to produce these symbols instead:
				834	<pre>
				835	#define TOKEN_AND 1
				836	#define TOKEN_MINUS 2
				837	#define TOKEN_OR 3
				838	#define TOKEN_PLUS 4
				839	</pre>
				840
				841	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				842
				843	<p>These directives are used to specify the data types for values
				844	on the parser's stack associated with terminal and non-terminal
				845	symbols. The values of all terminal symbols must be of the same
				846	type. This turns out to be the same data type as the 3rd parameter
				847	to the Parse() function generated by Lemon. Typically, you will
				848	make the value of a terminal symbol by a pointer to some kind of
				849	token structure. Like this:</p>
				850
				851	<p><pre>
				852	%token_type {Token*}
				853	</pre></p>
				854
				855	<p>If the data type of terminals is not specified, the default value
				856	is ``int''.</p>
				857
				858	<p>Non-terminal symbols can each have their own data types. Typically
				859	the data type of a non-terminal is a pointer to the root of a parse-tree
				860	structure that contains all information about that non-terminal.
				861	For example:</p>
				862
				863	<p><pre>
				864	%type expr {Expr*}
				865	</pre></p>
				866
				867	<p>Each entry on the parser's stack is actually a union containing
				868	instances of all data types for every non-terminal and terminal symbol.
				869	Lemon will automatically use the correct element of this union depending
				870	on what the corresponding non-terminal or terminal symbol is. But
				871	the grammar designer should keep in mind that the size of the union
				872	will be the size of its largest element. So if you have a single
				873	non-terminal whose data type requires 1K of storage, then your 100
				874	entry parser stack will require 100K of heap space. If you are willing
				875	and able to pay that price, fine. You just need to know.</p>
				876
				877	<h3>Error Processing</h3>
				878
				879	<p>After extensive experimentation over several years, it has been
				880	discovered that the error recovery strategy used by yacc is about
				881	as good as it gets. And so that is what Lemon uses.</p>
				882
				883	<p>When a Lemon-generated parser encounters a syntax error, it
				884	first invokes the code specified by the %syntax_error directive, if
				885	any. It then enters its error recovery strategy. The error recovery
				886	strategy is to begin popping the parsers stack until it enters a
				887	state where it is permitted to shift a special non-terminal symbol
				888	named ``error''. It then shifts this non-terminal and continues
				889	parsing. But the %syntax_error routine will not be called again
				890	until at least three new tokens have been successfully shifted.</p>
				891
				892	<p>If the parser pops its stack until the stack is empty, and it still
				893	is unable to shift the error symbol, then the %parse_failed routine
				894	is invoked and the parser resets itself to its start state, ready
				895	to begin parsing a new file. This is what will happen at the very
				896	first syntax error, of course, if there are no instances of the
				897	``error'' non-terminal in your grammar.</p>
				898
				899	</body>
				900	</html>