Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: f05c481d7960f2e74897454eab1731adc4cbd6e3 [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
				5	<body bgcolor=white>
				6	<h1 align=center>The Lemon Parser Generator</h1>
				7
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	8	<p>Lemon is an LALR(1) parser generator for C.
				9	It does the same job as "bison" and "yacc".
				10	But lemon is not a bison or yacc clone. Lemon
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	11	uses a different grammar syntax which is designed to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	12	reduce the number of coding errors. Lemon also uses a
				13	parsing engine that is faster than yacc and
				14	bison and which is both reentrant and threadsafe.
				15	(Update: Since the previous sentence was written, bison
				16	has also been updated so that it too can generate a
				17	reentrant and threadsafe parser.)
				18	Lemon also implements features that can be used
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	19	to eliminate resource leaks, making is suitable for use
				20	in long-running programs such as graphical user interfaces
				21	or embedded controllers.</p>
				22
				23	<p>This document is an introduction to the Lemon
				24	parser generator.</p>
				25
drh	c5e56b3	2017-06-01 01:53:19 +0000	[diff] [blame]	26	<h2>Security Note</h2>
				27
				28	<p>The language parser code created by Lemon is very robust and
				29	is well-suited for use in internet-facing applications that need to
				30	safely process maliciously crafted inputs.
				31
				32	<p>The "lemon.exe" command-line tool itself works great when given a valid
				33	input grammar file and almost always gives helpful
				34	error messages for malformed inputs. However, it is possible for
				35	a malicious user to craft a grammar file that will cause
				36	lemon.exe to crash.
				37	We do not see this as a problem, as lemon.exe is not intended to be used
				38	with hostile inputs.
				39	To summarize:</p>
				40
				41	<ul>
				42	<li>Parser code generated by lemon → Robust and secure
				43	<li>The "lemon.exe" command line tool itself → Not so much
				44	</ul>
				45
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	46	<h2>Theory of Operation</h2>
				47
				48	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				49	for a particular language into C code that implements a parser for
				50	that language.
				51	The program has two inputs:
				52	<ul>
				53	<li>The grammar specification.
				54	<li>A parser template file.
				55	</ul>
				56	Typically, only the grammar specification is supplied by the programmer.
				57	Lemon comes with a default parser template which works fine for most
				58	applications. But the user is free to substitute a different parser
				59	template if desired.</p>
				60
				61	<p>Depending on command-line options, Lemon will generate between
				62	one and three files of outputs.
				63	<ul>
				64	<li>C code to implement the parser.
				65	<li>A header file defining an integer ID for each terminal symbol.
				66	<li>An information file that describes the states of the generated parser
				67	automaton.
				68	</ul>
				69	By default, all three of these output files are generated.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	70	The header file is suppressed if the "-m" command-line option is
				71	used and the report file is omitted when "-q" is selected.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	72
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	73	<p>The grammar specification file uses a ".y" suffix, by convention.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	74	In the examples used in this document, we'll assume the name of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	75	grammar file is "gram.y". A typical use of Lemon would be the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	76	following command:
				77	<pre>
				78	lemon gram.y
				79	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	80	This command will generate three output files named "gram.c",
				81	"gram.h" and "gram.out".
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	82	The first is C code to implement the parser. The second
				83	is the header file that defines numerical values for all
				84	terminal symbols, and the last is the report that explains
				85	the states used by the parser automaton.</p>
				86
				87	<h3>Command Line Options</h3>
				88
				89	<p>The behavior of Lemon can be modified using command-line options.
				90	You can obtain a list of the available command-line options together
				91	with a brief explanation of what each does by typing
				92	<pre>
				93	lemon -?
				94	</pre>
				95	As of this writing, the following command-line options are supported:
				96	<ul>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	97	<li><b>-b</b>
				98	Show only the basis for each parser state in the report file.
				99	<li><b>-c</b>
				100	Do not compress the generated action tables.
				101	<li><b>-D<i>name</i></b>
				102	Define C preprocessor macro <i>name</i>. This macro is useable by
				103	"%ifdef" lines in the grammar file.
				104	<li><b>-g</b>
				105	Do not generate a parser. Instead write the input grammar to standard
				106	output with all comments, actions, and other extraneous text removed.
				107	<li><b>-l</b>
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	108	Omit "#line" directives in the generated parser C code.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	109	<li><b>-m</b>
				110	Cause the output C source code to be compatible with the "makeheaders"
				111	program.
				112	<li><b>-p</b>
				113	Display all conflicts that are resolved by
				114	<a href='#precrules'>precedence rules</a>.
				115	<li><b>-q</b>
				116	Suppress generation of the report file.
				117	<li><b>-r</b>
				118	Do not sort or renumber the parser states as part of optimization.
				119	<li><b>-s</b>
				120	Show parser statistics before existing.
				121	<li><b>-T<i>file</i></b>
				122	Use <i>file</i> as the template for the generated C-code parser implementation.
				123	<li><b>-x</b>
				124	Print the Lemon version number.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	125	</ul>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	126
				127	<h3>The Parser Interface</h3>
				128
				129	<p>Lemon doesn't generate a complete, working program. It only generates
				130	a few subroutines that implement a parser. This section describes
				131	the interface to those subroutines. It is up to the programmer to
				132	call these subroutines in an appropriate way in order to produce a
				133	complete system.</p>
				134
				135	<p>Before a program begins using a Lemon-generated parser, the program
				136	must first create the parser.
				137	A new parser is created as follows:
				138	<pre>
				139	void *pParser = ParseAlloc( malloc );
				140	</pre>
				141	The ParseAlloc() routine allocates and initializes a new parser and
				142	returns a pointer to it.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	143	The actual data structure used to represent a parser is opaque —
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	144	its internal structure is not visible or usable by the calling routine.
				145	For this reason, the ParseAlloc() routine returns a pointer to void
				146	rather than a pointer to some particular structure.
				147	The sole argument to the ParseAlloc() routine is a pointer to the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	148	subroutine used to allocate memory. Typically this means malloc().</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	149
				150	<p>After a program is finished using a parser, it can reclaim all
				151	memory allocated by that parser by calling
				152	<pre>
				153	ParseFree(pParser, free);
				154	</pre>
				155	The first argument is the same pointer returned by ParseAlloc(). The
				156	second argument is a pointer to the function used to release bulk
				157	memory back to the system.</p>
				158
				159	<p>After a parser has been allocated using ParseAlloc(), the programmer
				160	must supply the parser with a sequence of tokens (terminal symbols) to
				161	be parsed. This is accomplished by calling the following function
				162	once for each token:
				163	<pre>
				164	Parse(pParser, hTokenID, sTokenData, pArg);
				165	</pre>
				166	The first argument to the Parse() routine is the pointer returned by
				167	ParseAlloc().
				168	The second argument is a small positive integer that tells the parse the
				169	type of the next token in the data stream.
				170	There is one token type for each terminal symbol in the grammar.
				171	The gram.h file generated by Lemon contains #define statements that
				172	map symbolic terminal symbol names into appropriate integer values.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	173	A value of 0 for the second argument is a special flag to the
				174	parser to indicate that the end of input has been reached.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	175	The third argument is the value of the given token. By default,
				176	the type of the third argument is integer, but the grammar will
				177	usually redefine this type to be some kind of structure.
				178	Typically the second argument will be a broad category of tokens
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	179	such as "identifier" or "number" and the third argument will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	180	be the name of the identifier or the value of the number.</p>
				181
				182	<p>The Parse() function may have either three or four arguments,
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	183	depending on the grammar. If the grammar specification file requests
				184	it (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>),
				185	the Parse() function will have a fourth parameter that can be
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	186	of any type chosen by the programmer. The parser doesn't do anything
				187	with this argument except to pass it through to action routines.
				188	This is a convenient mechanism for passing state information down
				189	to the action routines without having to use global variables.</p>
				190
				191	<p>A typical use of a Lemon parser might look something like the
				192	following:
				193	<pre>
				194	01 ParseTree ParseFile(const char zFilename){
				195	02 Tokenizer *pTokenizer;
				196	03 void *pParser;
				197	04 Token sToken;
				198	05 int hTokenId;
				199	06 ParserState sState;
				200	07
				201	08 pTokenizer = TokenizerCreate(zFilename);
				202	09 pParser = ParseAlloc( malloc );
				203	10 InitParserState(&sState);
				204	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				205	12 Parse(pParser, hTokenId, sToken, &sState);
				206	13 }
				207	14 Parse(pParser, 0, sToken, &sState);
				208	15 ParseFree(pParser, free );
				209	16 TokenizerFree(pTokenizer);
				210	17 return sState.treeRoot;
				211	18 }
				212	</pre>
				213	This example shows a user-written routine that parses a file of
				214	text and returns a pointer to the parse tree.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	215	(All error-handling code is omitted from this example to keep it
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	216	simple.)
				217	We assume the existence of some kind of tokenizer which is created
				218	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				219	on line 16. The GetNextToken() function on line 11 retrieves the
				220	next token from the input file and puts its type in the
				221	integer variable hTokenId. The sToken variable is assumed to be
				222	some kind of structure that contains details about each token,
				223	such as its complete text, what line it occurs on, etc. </p>
				224
				225	<p>This example also assumes the existence of structure of type
				226	ParserState that holds state information about a particular parse.
				227	An instance of such a structure is created on line 6 and initialized
				228	on line 10. A pointer to this structure is passed into the Parse()
				229	routine as the optional 4th argument.
				230	The action routine specified by the grammar for the parser can use
				231	the ParserState structure to hold whatever information is useful and
				232	appropriate. In the example, we note that the treeRoot field of
				233	the ParserState structure is left pointing to the root of the parse
				234	tree.</p>
				235
				236	<p>The core of this example as it relates to Lemon is as follows:
				237	<pre>
				238	ParseFile(){
				239	pParser = ParseAlloc( malloc );
				240	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
				241	Parse(pParser, hTokenId, sToken);
				242	}
				243	Parse(pParser, 0, sToken);
				244	ParseFree(pParser, free );
				245	}
				246	</pre>
				247	Basically, what a program has to do to use a Lemon-generated parser
				248	is first create the parser, then send it lots of tokens obtained by
				249	tokenizing an input source. When the end of input is reached, the
				250	Parse() routine should be called one last time with a token type
				251	of 0. This step is necessary to inform the parser that the end of
				252	input has been reached. Finally, we reclaim memory used by the
				253	parser by calling ParseFree().</p>
				254
				255	<p>There is one other interface routine that should be mentioned
				256	before we move on.
				257	The ParseTrace() function can be used to generate debugging output
				258	from the parser. A prototype for this routine is as follows:
				259	<pre>
				260	ParseTrace(FILE stream, char zPrefix);
				261	</pre>
				262	After this routine is called, a short (one-line) message is written
				263	to the designated output stream every time the parser changes states
				264	or calls an action routine. Each such message is prefaced using
				265	the text given by zPrefix. This debugging output can be turned off
				266	by calling ParseTrace() again with a first argument of NULL (0).</p>
				267
				268	<h3>Differences With YACC and BISON</h3>
				269
				270	<p>Programmers who have previously used the yacc or bison parser
				271	generator will notice several important differences between yacc and/or
				272	bison and Lemon.
				273	<ul>
				274	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				275	the tokenizer calls the parser.
				276	<li>Lemon uses no global variables. Yacc and bison use global variables
				277	to pass information between the tokenizer and parser.
				278	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				279	and bison do not.
				280	</ul>
				281	These differences may cause some initial confusion for programmers
				282	with prior yacc and bison experience.
				283	But after years of experience using Lemon, I firmly
				284	believe that the Lemon way of doing things is better.</p>
				285
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	286	<p><i>Updated as of 2016-02-16:</i>
				287	The text above was written in the 1990s.
				288	We are told that Bison has lately been enhanced to support the
				289	tokenizer-calls-parser paradigm used by Lemon, and to obviate the
				290	need for global variables.</p>
				291
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	292	<h2>Input File Syntax</h2>
				293
				294	<p>The main purpose of the grammar specification file for Lemon is
				295	to define the grammar for the parser. But the input file also
				296	specifies additional information Lemon requires to do its job.
				297	Most of the work in using Lemon is in writing an appropriate
				298	grammar file.</p>
				299
				300	<p>The grammar file for lemon is, for the most part, free format.
				301	It does not have sections or divisions like yacc or bison. Any
				302	declaration can occur at any point in the file.
				303	Lemon ignores whitespace (except where it is needed to separate
				304	tokens) and it honors the same commenting conventions as C and C++.</p>
				305
				306	<h3>Terminals and Nonterminals</h3>
				307
				308	<p>A terminal symbol (token) is any string of alphanumeric
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	309	and/or underscore characters
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	310	that begins with an upper case letter.
drh	c8eee5e	2011-07-30 23:50:12 +0000	[diff] [blame]	311	A terminal can contain lowercase letters after the first character,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	312	but the usual convention is to make terminals all upper case.
				313	A nonterminal, on the other hand, is any string of alphanumeric
				314	and underscore characters than begins with a lower case letter.
				315	Again, the usual convention is to make nonterminals use all lower
				316	case letters.</p>
				317
				318	<p>In Lemon, terminal and nonterminal symbols do not need to
				319	be declared or identified in a separate section of the grammar file.
				320	Lemon is able to generate a list of all terminals and nonterminals
				321	by examining the grammar rules, and it can always distinguish a
				322	terminal from a nonterminal by checking the case of the first
				323	character of the name.</p>
				324
				325	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				326	names or to be individual characters included in single quotes, like
				327	this: ')' or '$'. Lemon does not allow this alternative form for
				328	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				329	must have alphanumeric names.</p>
				330
				331	<h3>Grammar Rules</h3>
				332
				333	<p>The main component of a Lemon grammar file is a sequence of grammar
				334	rules.
				335	Each grammar rule consists of a nonterminal symbol followed by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	336	the special symbol "::=" and then a list of terminals and/or nonterminals.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	337	The rule is terminated by a period.
				338	The list of terminals and nonterminals on the right-hand side of the
				339	rule can be empty.
				340	Rules can occur in any order, except that the left-hand side of the
				341	first rule is assumed to be the start symbol for the grammar (unless
				342	specified otherwise using the <tt>%start</tt> directive described below.)
				343	A typical sequence of grammar rules might look something like this:
				344	<pre>
				345	expr ::= expr PLUS expr.
				346	expr ::= expr TIMES expr.
				347	expr ::= LPAREN expr RPAREN.
				348	expr ::= VALUE.
				349	</pre>
				350	</p>
				351
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	352	<p>There is one non-terminal in this example, "expr", and five
				353	terminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
				354	"RPAREN" and "VALUE".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	355
				356	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				357	of C code that will be executed whenever a grammar rule is reduced
				358	by the parser.
				359	In Lemon, this action is specified by putting the C code (contained
				360	within curly braces <tt>{...}</tt>) immediately after the
				361	period that closes the rule.
				362	For example:
				363	<pre>
				364	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				365	</pre>
				366	</p>
				367
				368	<p>In order to be useful, grammar actions must normally be linked to
				369	their associated grammar rules.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	370	In yacc and bison, this is accomplished by embedding a "$$" in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	371	action to stand for the value of the left-hand side of the rule and
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	372	symbols "$1", "$2", and so forth to stand for the value of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	373	the terminal or nonterminal at position 1, 2 and so forth on the
				374	right-hand side of the rule.
				375	This idea is very powerful, but it is also very error-prone. The
				376	single most common source of errors in a yacc or bison grammar is
				377	to miscount the number of symbols on the right-hand side of a grammar
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	378	rule and say "$7" when you really mean "$8".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	379
				380	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				381	names to each symbol in a grammar rule and then using those symbolic
				382	names in the action.
				383	In yacc or bison, one would write this:
				384	<pre>
				385	expr -> expr PLUS expr { $$ = $1 + $3; };
				386	</pre>
				387	But in Lemon, the same rule becomes the following:
				388	<pre>
				389	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				390	</pre>
				391	In the Lemon rule, any symbol in parentheses after a grammar rule
				392	symbol becomes a place holder for that symbol in the grammar rule.
				393	This place holder can then be used in the associated C action to
				394	stand for the value of that symbol.<p>
				395
				396	<p>The Lemon notation for linking a grammar rule with its reduce
				397	action is superior to yacc/bison on several counts.
				398	First, as mentioned above, the Lemon method avoids the need to
				399	count grammar symbols.
				400	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				401	includes a linking symbol in parentheses but that linking symbol
				402	is not actually used in the reduce action, then an error message
				403	is generated.
				404	For example, the rule
				405	<pre>
				406	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				407	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	408	will generate an error because the linking symbol "C" is used
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	409	in the grammar rule but not in the reduce action.</p>
				410
				411	<p>The Lemon notation for linking grammar rules to reduce actions
				412	also facilitates the use of destructors for reclaiming memory
				413	allocated by the values of terminals and nonterminals on the
				414	right-hand side of a rule.</p>
				415
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	416	<a name='precrules'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	417	<h3>Precedence Rules</h3>
				418
				419	<p>Lemon resolves parsing ambiguities in exactly the same way as
				420	yacc and bison. A shift-reduce conflict is resolved in favor
				421	of the shift, and a reduce-reduce conflict is resolved by reducing
				422	whichever rule comes first in the grammar file.</p>
				423
				424	<p>Just like in
				425	yacc and bison, Lemon allows a measure of control
				426	over the resolution of paring conflicts using precedence rules.
				427	A precedence value can be assigned to any terminal symbol
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	428	using the
				429	<a href='#pleft'>%left</a>,
				430	<a href='#pright'>%right</a> or
				431	<a href='#pnonassoc'>%nonassoc</a> directives. Terminal symbols
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	432	mentioned in earlier directives have a lower precedence that
				433	terminal symbols mentioned in later directives. For example:</p>
				434
				435	<p><pre>
				436	%left AND.
				437	%left OR.
				438	%nonassoc EQ NE GT GE LT LE.
				439	%left PLUS MINUS.
				440	%left TIMES DIVIDE MOD.
				441	%right EXP NOT.
				442	</pre></p>
				443
				444	<p>In the preceding sequence of directives, the AND operator is
				445	defined to have the lowest precedence. The OR operator is one
				446	precedence level higher. And so forth. Hence, the grammar would
				447	attempt to group the ambiguous expression
				448	<pre>
				449	a AND b OR c
				450	</pre>
				451	like this
				452	<pre>
				453	a AND (b OR c).
				454	</pre>
				455	The associativity (left, right or nonassoc) is used to determine
				456	the grouping when the precedence is the same. AND is left-associative
				457	in our example, so
				458	<pre>
				459	a AND b AND c
				460	</pre>
				461	is parsed like this
				462	<pre>
				463	(a AND b) AND c.
				464	</pre>
				465	The EXP operator is right-associative, though, so
				466	<pre>
				467	a EXP b EXP c
				468	</pre>
				469	is parsed like this
				470	<pre>
				471	a EXP (b EXP c).
				472	</pre>
				473	The nonassoc precedence is used for non-associative operators.
				474	So
				475	<pre>
				476	a EQ b EQ c
				477	</pre>
				478	is an error.</p>
				479
				480	<p>The precedence of non-terminals is transferred to rules as follows:
				481	The precedence of a grammar rule is equal to the precedence of the
				482	left-most terminal symbol in the rule for which a precedence is
				483	defined. This is normally what you want, but in those cases where
				484	you want to precedence of a grammar rule to be something different,
				485	you can specify an alternative precedence symbol by putting the
				486	symbol in square braces after the period at the end of the rule and
				487	before any C-code. For example:</p>
				488
				489	<p><pre>
				490	expr = MINUS expr. [NOT]
				491	</pre></p>
				492
				493	<p>This rule has a precedence equal to that of the NOT symbol, not the
				494	MINUS symbol as would have been the case by default.</p>
				495
				496	<p>With the knowledge of how precedence is assigned to terminal
				497	symbols and individual
				498	grammar rules, we can now explain precisely how parsing conflicts
				499	are resolved in Lemon. Shift-reduce conflicts are resolved
				500	as follows:
				501	<ul>
				502	<li> If either the token to be shifted or the rule to be reduced
				503	lacks precedence information, then resolve in favor of the
				504	shift, but report a parsing conflict.
				505	<li> If the precedence of the token to be shifted is greater than
				506	the precedence of the rule to reduce, then resolve in favor
				507	of the shift. No parsing conflict is reported.
				508	<li> If the precedence of the token it be shifted is less than the
				509	precedence of the rule to reduce, then resolve in favor of the
				510	reduce action. No parsing conflict is reported.
				511	<li> If the precedences are the same and the shift token is
				512	right-associative, then resolve in favor of the shift.
				513	No parsing conflict is reported.
mistachkin	d557843	2012-08-25 10:01:29 +0000	[diff] [blame]	514	<li> If the precedences are the same the shift token is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	515	left-associative, then resolve in favor of the reduce.
				516	No parsing conflict is reported.
				517	<li> Otherwise, resolve the conflict by doing the shift and
				518	report the parsing conflict.
				519	</ul>
				520	Reduce-reduce conflicts are resolved this way:
				521	<ul>
				522	<li> If either reduce rule
				523	lacks precedence information, then resolve in favor of the
				524	rule that appears first in the grammar and report a parsing
				525	conflict.
				526	<li> If both rules have precedence and the precedence is different
				527	then resolve the dispute in favor of the rule with the highest
				528	precedence and do not report a conflict.
				529	<li> Otherwise, resolve the conflict by reducing by the rule that
				530	appears first in the grammar and report a parsing conflict.
				531	</ul>
				532
				533	<h3>Special Directives</h3>
				534
				535	<p>The input grammar to Lemon consists of grammar rules and special
				536	directives. We've described all the grammar rules, so now we'll
				537	talk about the special directives.</p>
				538
				539	<p>Directives in lemon can occur in any order. You can put them before
				540	the grammar rules, or after the grammar rules, or in the mist of the
				541	grammar rules. It doesn't matter. The relative order of
				542	directives used to assign precedence to terminals is important, but
				543	other than that, the order of directives in Lemon is arbitrary.</p>
				544
				545	<p>Lemon supports the following special directives:
				546	<ul>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	547	<li><tt>%code</tt>
				548	<li><tt>%default_destructor</tt>
				549	<li><tt>%default_type</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	550	<li><tt>%destructor</tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	551	<li><tt>%endif</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	552	<li><tt>%extra_argument</tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	553	<li><tt>%fallback</tt>
				554	<li><tt>%ifdef</tt>
				555	<li><tt>%ifndef</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	556	<li><tt>%include</tt>
				557	<li><tt>%left</tt>
				558	<li><tt>%name</tt>
				559	<li><tt>%nonassoc</tt>
				560	<li><tt>%parse_accept</tt>
				561	<li><tt>%parse_failure </tt>
				562	<li><tt>%right</tt>
				563	<li><tt>%stack_overflow</tt>
				564	<li><tt>%stack_size</tt>
				565	<li><tt>%start_symbol</tt>
				566	<li><tt>%syntax_error</tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	567	<li><tt>%token_class</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	568	<li><tt>%token_destructor</tt>
				569	<li><tt>%token_prefix</tt>
				570	<li><tt>%token_type</tt>
				571	<li><tt>%type</tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	572	<li><tt>%wildcard</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	573	</ul>
				574	Each of these directives will be described separately in the
				575	following sections:</p>
				576
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	577	<a name='pcode'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	578	<h4>The <tt>%code</tt> directive</h4>
				579
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	580	<p>The %code directive is used to specify addition C code that
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	581	is added to the end of the main output file. This is similar to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	582	the <a href='#pinclude'>%include</a> directive except that %include
				583	is inserted at the beginning of the main output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	584
				585	<p>%code is typically used to include some action routines or perhaps
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	586	a tokenizer or even the "main()" function
				587	as part of the output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	588
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	589	<a name='default_destructor'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	590	<h4>The <tt>%default_destructor</tt> directive</h4>
				591
				592	<p>The %default_destructor directive specifies a destructor to
				593	use for non-terminals that do not have their own destructor
				594	specified by a separate %destructor directive. See the documentation
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	595	on the <a name='#destructor'>%destructor</a> directive below for
				596	additional information.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	597
				598	<p>In some grammers, many different non-terminal symbols have the
				599	same datatype and hence the same destructor. This directive is
				600	a convenience way to specify the same destructor for all those
				601	non-terminals using a single statement.</p>
				602
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	603	<a name='default_type'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	604	<h4>The <tt>%default_type</tt> directive</h4>
				605
				606	<p>The %default_type directive specifies the datatype of non-terminal
				607	symbols that do no have their own datatype defined using a separate
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	608	<a href='#ptype'>%type</a> directive.
				609	</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	610
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	611	<a name='destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	612	<h4>The <tt>%destructor</tt> directive</h4>
				613
				614	<p>The %destructor directive is used to specify a destructor for
				615	a non-terminal symbol.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	616	(See also the <a href='#token_destructor'>%token_destructor</a>
				617	directive which is used to specify a destructor for terminal symbols.)</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	618
				619	<p>A non-terminal's destructor is called to dispose of the
				620	non-terminal's value whenever the non-terminal is popped from
				621	the stack. This includes all of the following circumstances:
				622	<ul>
				623	<li> When a rule reduces and the value of a non-terminal on
				624	the right-hand side is not linked to C code.
				625	<li> When the stack is popped during error processing.
				626	<li> When the ParseFree() function runs.
				627	</ul>
				628	The destructor can do whatever it wants with the value of
				629	the non-terminal, but its design is to deallocate memory
				630	or other resources held by that non-terminal.</p>
				631
				632	<p>Consider an example:
				633	<pre>
				634	%type nt {void*}
				635	%destructor nt { free($$); }
				636	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				637	</pre>
				638	This example is a bit contrived but it serves to illustrate how
				639	destructors work. The example shows a non-terminal named
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	640	"nt" that holds values of type "void*". When the rule for
				641	an "nt" reduces, it sets the value of the non-terminal to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	642	space obtained from malloc(). Later, when the nt non-terminal
				643	is popped from the stack, the destructor will fire and call
				644	free() on this malloced space, thus avoiding a memory leak.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	645	(Note that the symbol "$$" in the destructor code is replaced
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	646	by the value of the non-terminal.)</p>
				647
				648	<p>It is important to note that the value of a non-terminal is passed
				649	to the destructor whenever the non-terminal is removed from the
				650	stack, unless the non-terminal is used in a C-code action. If
				651	the non-terminal is used by C-code, then it is assumed that the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	652	C-code will take care of destroying it.
				653	More commonly, the value is used to build some
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	654	larger structure and we don't want to destroy it, which is why
				655	the destructor is not called in this circumstance.</p>
				656
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	657	<p>Destructors help avoid memory leaks by automatically freeing
				658	allocated objects when they go out of scope.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	659	To do the same using yacc or bison is much more difficult.</p>
				660
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	661	<a name="extraarg"></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	662	<h4>The <tt>%extra_argument</tt> directive</h4>
				663
				664	The %extra_argument directive instructs Lemon to add a 4th parameter
				665	to the parameter list of the Parse() function it generates. Lemon
				666	doesn't do anything itself with this extra argument, but it does
				667	make the argument available to C-code action routines, destructors,
				668	and so forth. For example, if the grammar file contains:</p>
				669
				670	<p><pre>
				671	%extra_argument { MyStruct *pAbc }
				672	</pre></p>
				673
				674	<p>Then the Parse() function generated will have an 4th parameter
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	675	of type "MyStruct*" and all action routines will have access to
				676	a variable named "pAbc" that is the value of the 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	677	in the most recent call to Parse().</p>
				678
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	679	<a name='pfallback'></a>
				680	<h4>The <tt>%fallback</tt> directive</h4>
				681
				682	<p>The %fallback directive specifies an alternative meaning for one
				683	or more tokens. The alternative meaning is tried if the original token
				684	would have generated a syntax error.
				685
				686	<p>The %fallback directive was added to support robust parsing of SQL
				687	syntax in <a href="https://www.sqlite.org/">SQLite</a>.
				688	The SQL language contains a large assortment of keywords, each of which
				689	appears as a different token to the language parser. SQL contains so
				690	many keywords, that it can be difficult for programmers to keep up with
				691	them all. Programmers will, therefore, sometimes mistakenly use an
				692	obscure language keyword for an identifier. The %fallback directive
				693	provides a mechanism to tell the parser: "If you are unable to parse
				694	this keyword, try treating it as an identifier instead."
				695
				696	<p>The syntax of %fallback is as follows:
				697
				698	<blockquote>
				699	<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b>
				700	</blockquote>
				701
				702	<p>In words, the %fallback directive is followed by a list of token names
				703	terminated by a period. The first token name is the fallback token - the
				704	token to which all the other tokens fall back to. The second and subsequent
				705	arguments are tokens which fall back to the token identified by the first
				706	argument.
				707
				708	<a name='pifdef'></a>
				709	<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives.</h4>
				710
				711	<p>The %ifdef, %ifndef, and %endif directives are similar to
				712	#ifdef, #ifndef, and #endif in the C-preprocessor, just not as general.
				713	Each of these directives must begin at the left margin. No whitespace
				714	is allowed between the "%" and the directive name.
				715
				716	<p>Grammar text in between "%ifdef MACRO" and the next nested "%endif" is
				717	ignored unless the "-DMACRO" command-line option is used. Grammar text
				718	betwen "%ifndef MACRO" and the next nested "%endif" is included except when
				719	the "-DMACRO" command-line option is used.
				720
				721	<p>Note that the argument to %ifdef and %ifndef must be a single
				722	preprocessor symbol name, not a general expression. There is no "%else"
				723	directive.
				724
				725
				726	<a name='pinclude'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	727	<h4>The <tt>%include</tt> directive</h4>
				728
				729	<p>The %include directive specifies C code that is included at the
				730	top of the generated parser. You can include any text you want --
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	731	the Lemon parser generator copies it blindly. If you have multiple
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	732	%include directives in your grammar file, their values are concatenated
				733	so that all %include code ultimately appears near the top of the
				734	generated parser, in the same order as it appeared in the grammer.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	735
				736	<p>The %include directive is very handy for getting some extra #include
				737	preprocessor statements at the beginning of the generated parser.
				738	For example:</p>
				739
				740	<p><pre>
				741	%include {#include <unistd.h>}
				742	</pre></p>
				743
				744	<p>This might be needed, for example, if some of the C actions in the
				745	grammar call functions that are prototyed in unistd.h.</p>
				746
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	747	<a name='pleft'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	748	<h4>The <tt>%left</tt> directive</h4>
				749
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	750	The %left directive is used (along with the <a href='#pright'>%right</a> and
				751	<a href='#pnonassoc'>%nonassoc</a> directives) to declare precedences of
				752	terminal symbols. Every terminal symbol whose name appears after
				753	a %left directive but before the next period (".") is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	754	given the same left-associative precedence value. Subsequent
				755	%left directives have higher precedence. For example:</p>
				756
				757	<p><pre>
				758	%left AND.
				759	%left OR.
				760	%nonassoc EQ NE GT GE LT LE.
				761	%left PLUS MINUS.
				762	%left TIMES DIVIDE MOD.
				763	%right EXP NOT.
				764	</pre></p>
				765
				766	<p>Note the period that terminates each %left, %right or %nonassoc
				767	directive.</p>
				768
				769	<p>LALR(1) grammars can get into a situation where they require
				770	a large amount of stack space if you make heavy use or right-associative
				771	operators. For this reason, it is recommended that you use %left
				772	rather than %right whenever possible.</p>
				773
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	774	<a name='pname'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	775	<h4>The <tt>%name</tt> directive</h4>
				776
				777	<p>By default, the functions generated by Lemon all begin with the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	778	five-character string "Parse". You can change this string to something
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	779	different using the %name directive. For instance:</p>
				780
				781	<p><pre>
				782	%name Abcde
				783	</pre></p>
				784
				785	<p>Putting this directive in the grammar file will cause Lemon to generate
				786	functions named
				787	<ul>
				788	<li> AbcdeAlloc(),
				789	<li> AbcdeFree(),
				790	<li> AbcdeTrace(), and
				791	<li> Abcde().
				792	</ul>
				793	The %name directive allows you to generator two or more different
				794	parsers and link them all into the same executable.
				795	</p>
				796
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	797	<a name='pnonassoc'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	798	<h4>The <tt>%nonassoc</tt> directive</h4>
				799
				800	<p>This directive is used to assign non-associative precedence to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	801	one or more terminal symbols. See the section on
				802	<a href='#precrules'>precedence rules</a>
				803	or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	804
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	805	<a name='parse_accept'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	806	<h4>The <tt>%parse_accept</tt> directive</h4>
				807
				808	<p>The %parse_accept directive specifies a block of C code that is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	809	executed whenever the parser accepts its input string. To "accept"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	810	an input string means that the parser was able to process all tokens
				811	without error.</p>
				812
				813	<p>For example:</p>
				814
				815	<p><pre>
				816	%parse_accept {
				817	printf("parsing complete!\n");
				818	}
				819	</pre></p>
				820
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	821	<a name='parse_failure'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	822	<h4>The <tt>%parse_failure</tt> directive</h4>
				823
				824	<p>The %parse_failure directive specifies a block of C code that
				825	is executed whenever the parser fails complete. This code is not
				826	executed until the parser has tried and failed to resolve an input
				827	error using is usual error recovery strategy. The routine is
				828	only invoked when parsing is unable to continue.</p>
				829
				830	<p><pre>
				831	%parse_failure {
				832	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				833	}
				834	</pre></p>
				835
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	836	<a name='pright'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	837	<h4>The <tt>%right</tt> directive</h4>
				838
				839	<p>This directive is used to assign right-associative precedence to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	840	one or more terminal symbols. See the section on
				841	<a href='#precrules'>precedence rules</a>
				842	or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	843
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	844	<a name='stack_overflow'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	845	<h4>The <tt>%stack_overflow</tt> directive</h4>
				846
				847	<p>The %stack_overflow directive specifies a block of C code that
				848	is executed if the parser's internal stack ever overflows. Typically
				849	this just prints an error message. After a stack overflow, the parser
				850	will be unable to continue and must be reset.</p>
				851
				852	<p><pre>
				853	%stack_overflow {
				854	fprintf(stderr,"Giving up. Parser stack overflow\n");
				855	}
				856	</pre></p>
				857
				858	<p>You can help prevent parser stack overflows by avoiding the use
				859	of right recursion and right-precedence operators in your grammar.
				860	Use left recursion and and left-precedence operators instead, to
				861	encourage rules to reduce sooner and keep the stack size down.
				862	For example, do rules like this:
				863	<pre>
				864	list ::= list element. // left-recursion. Good!
				865	list ::= .
				866	</pre>
				867	Not like this:
				868	<pre>
				869	list ::= element list. // right-recursion. Bad!
				870	list ::= .
				871	</pre>
				872
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	873	<a name='stack_size'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	874	<h4>The <tt>%stack_size</tt> directive</h4>
				875
				876	<p>If stack overflow is a problem and you can't resolve the trouble
				877	by using left-recursion, then you might want to increase the size
				878	of the parser's stack using this directive. Put an positive integer
				879	after the %stack_size directive and Lemon will generate a parse
				880	with a stack of the requested size. The default value is 100.</p>
				881
				882	<p><pre>
				883	%stack_size 2000
				884	</pre></p>
				885
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	886	<a name='start_symbol'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	887	<h4>The <tt>%start_symbol</tt> directive</h4>
				888
				889	<p>By default, the start-symbol for the grammar that Lemon generates
				890	is the first non-terminal that appears in the grammar file. But you
				891	can choose a different start-symbol using the %start_symbol directive.</p>
				892
				893	<p><pre>
				894	%start_symbol prog
				895	</pre></p>
				896
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	897	<a name='token_destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	898	<h4>The <tt>%token_destructor</tt> directive</h4>
				899
				900	<p>The %destructor directive assigns a destructor to a non-terminal
				901	symbol. (See the description of the %destructor directive above.)
				902	This directive does the same thing for all terminal symbols.</p>
				903
				904	<p>Unlike non-terminal symbols which may each have a different data type
				905	for their values, terminals all use the same data type (defined by
				906	the %token_type directive) and so they use a common destructor. Other
				907	than that, the token destructor works just like the non-terminal
				908	destructors.</p>
				909
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	910	<a name='token_prefix'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	911	<h4>The <tt>%token_prefix</tt> directive</h4>
				912
				913	<p>Lemon generates #defines that assign small integer constants
				914	to each terminal symbol in the grammar. If desired, Lemon will
				915	add a prefix specified by this directive
				916	to each of the #defines it generates.
				917	So if the default output of Lemon looked like this:
				918	<pre>
				919	#define AND 1
				920	#define MINUS 2
				921	#define OR 3
				922	#define PLUS 4
				923	</pre>
				924	You can insert a statement into the grammar like this:
				925	<pre>
				926	%token_prefix TOKEN_
				927	</pre>
				928	to cause Lemon to produce these symbols instead:
				929	<pre>
				930	#define TOKEN_AND 1
				931	#define TOKEN_MINUS 2
				932	#define TOKEN_OR 3
				933	#define TOKEN_PLUS 4
				934	</pre>
				935
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	936	<a name='token_type'></a><a name='ptype'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	937	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				938
				939	<p>These directives are used to specify the data types for values
				940	on the parser's stack associated with terminal and non-terminal
				941	symbols. The values of all terminal symbols must be of the same
				942	type. This turns out to be the same data type as the 3rd parameter
				943	to the Parse() function generated by Lemon. Typically, you will
				944	make the value of a terminal symbol by a pointer to some kind of
				945	token structure. Like this:</p>
				946
				947	<p><pre>
				948	%token_type {Token*}
				949	</pre></p>
				950
				951	<p>If the data type of terminals is not specified, the default value
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	952	is "void*".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	953
				954	<p>Non-terminal symbols can each have their own data types. Typically
				955	the data type of a non-terminal is a pointer to the root of a parse-tree
				956	structure that contains all information about that non-terminal.
				957	For example:</p>
				958
				959	<p><pre>
				960	%type expr {Expr*}
				961	</pre></p>
				962
				963	<p>Each entry on the parser's stack is actually a union containing
				964	instances of all data types for every non-terminal and terminal symbol.
				965	Lemon will automatically use the correct element of this union depending
				966	on what the corresponding non-terminal or terminal symbol is. But
				967	the grammar designer should keep in mind that the size of the union
				968	will be the size of its largest element. So if you have a single
				969	non-terminal whose data type requires 1K of storage, then your 100
				970	entry parser stack will require 100K of heap space. If you are willing
				971	and able to pay that price, fine. You just need to know.</p>
				972
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	973	<a name='pwildcard'></a>
				974	<h4>The <tt>%wildcard</tt> directive</h4>
				975
				976	<p>The %wildcard directive is followed by a single token name and a
				977	period. This directive specifies that the identified token should
				978	match any input token.
				979
				980	<p>When the generated parser has the choice of matching an input against
				981	the wildcard token and some other token, the other token is always used.
				982	The wildcard token is only matched if there are no other alternatives.
				983
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	984	<h3>Error Processing</h3>
				985
				986	<p>After extensive experimentation over several years, it has been
				987	discovered that the error recovery strategy used by yacc is about
				988	as good as it gets. And so that is what Lemon uses.</p>
				989
				990	<p>When a Lemon-generated parser encounters a syntax error, it
				991	first invokes the code specified by the %syntax_error directive, if
				992	any. It then enters its error recovery strategy. The error recovery
				993	strategy is to begin popping the parsers stack until it enters a
				994	state where it is permitted to shift a special non-terminal symbol
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	995	named "error". It then shifts this non-terminal and continues
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	996	parsing. But the %syntax_error routine will not be called again
				997	until at least three new tokens have been successfully shifted.</p>
				998
				999	<p>If the parser pops its stack until the stack is empty, and it still
				1000	is unable to shift the error symbol, then the %parse_failed routine
				1001	is invoked and the parser resets itself to its start state, ready
				1002	to begin parsing a new file. This is what will happen at the very
				1003	first syntax error, of course, if there are no instances of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1004	"error" non-terminal in your grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1005
				1006	</body>
				1007	</html>