Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 714cbfa5b2e6b42cd7e129bb63ed6f4f6d8c0146 [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	5	<body bgcolor='white'>
				6	<h1 align='center'>The Lemon Parser Generator</h1>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	7
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	8	<p>Lemon is an LALR(1) parser generator for C.
				9	It does the same job as "bison" and "yacc".
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	10	But Lemon is not a bison or yacc clone. Lemon
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	11	uses a different grammar syntax which is designed to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	12	reduce the number of coding errors. Lemon also uses a
				13	parsing engine that is faster than yacc and
				14	bison and which is both reentrant and threadsafe.
				15	(Update: Since the previous sentence was written, bison
				16	has also been updated so that it too can generate a
				17	reentrant and threadsafe parser.)
				18	Lemon also implements features that can be used
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	19	to eliminate resource leaks, making it suitable for use
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	20	in long-running programs such as graphical user interfaces
				21	or embedded controllers.</p>
				22
				23	<p>This document is an introduction to the Lemon
				24	parser generator.</p>
				25
drh	c5e56b3	2017-06-01 01:53:19 +0000	[diff] [blame]	26	<h2>Security Note</h2>
				27
				28	<p>The language parser code created by Lemon is very robust and
				29	is well-suited for use in internet-facing applications that need to
				30	safely process maliciously crafted inputs.
				31
				32	<p>The "lemon.exe" command-line tool itself works great when given a valid
				33	input grammar file and almost always gives helpful
				34	error messages for malformed inputs. However, it is possible for
				35	a malicious user to craft a grammar file that will cause
				36	lemon.exe to crash.
				37	We do not see this as a problem, as lemon.exe is not intended to be used
				38	with hostile inputs.
				39	To summarize:</p>
				40
				41	<ul>
				42	<li>Parser code generated by lemon → Robust and secure
				43	<li>The "lemon.exe" command line tool itself → Not so much
				44	</ul>
				45
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	46	<h2>Theory of Operation</h2>
				47
				48	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				49	for a particular language into C code that implements a parser for
				50	that language.
				51	The program has two inputs:
				52	<ul>
				53	<li>The grammar specification.
				54	<li>A parser template file.
				55	</ul>
				56	Typically, only the grammar specification is supplied by the programmer.
				57	Lemon comes with a default parser template which works fine for most
				58	applications. But the user is free to substitute a different parser
				59	template if desired.</p>
				60
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	61	<p>Depending on command-line options, Lemon will generate up to
				62	three output files.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	63	<ul>
				64	<li>C code to implement the parser.
				65	<li>A header file defining an integer ID for each terminal symbol.
				66	<li>An information file that describes the states of the generated parser
				67	automaton.
				68	</ul>
				69	By default, all three of these output files are generated.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	70	The header file is suppressed if the "-m" command-line option is
				71	used and the report file is omitted when "-q" is selected.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	72
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	73	<p>The grammar specification file uses a ".y" suffix, by convention.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	74	In the examples used in this document, we'll assume the name of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	75	grammar file is "gram.y". A typical use of Lemon would be the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	76	following command:
				77	<pre>
				78	lemon gram.y
				79	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	80	This command will generate three output files named "gram.c",
				81	"gram.h" and "gram.out".
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	82	The first is C code to implement the parser. The second
				83	is the header file that defines numerical values for all
				84	terminal symbols, and the last is the report that explains
				85	the states used by the parser automaton.</p>
				86
				87	<h3>Command Line Options</h3>
				88
				89	<p>The behavior of Lemon can be modified using command-line options.
				90	You can obtain a list of the available command-line options together
				91	with a brief explanation of what each does by typing
				92	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	93	lemon "-?"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	94	</pre>
				95	As of this writing, the following command-line options are supported:
				96	<ul>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	97	<li><b>-b</b>
				98	Show only the basis for each parser state in the report file.
				99	<li><b>-c</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	100	Do not compress the generated action tables. The parser will be a
				101	little larger and slower, but it will detect syntax errors sooner.
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	102	<li><b>-d</b><i>directory</i>
				103	Write all output files into <i>directory</i>. Normally, output files
				104	are written into the directory that contains the input grammar file.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	105	<li><b>-D<i>name</i></b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	106	Define C preprocessor macro <i>name</i>. This macro is usable by
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	107	"<tt><a href='#pifdef'>%ifdef</a></tt>",
				108	"<tt><a href='#pifdef'>%ifndef</a></tt>", and
				109	"<tt><a href="#pifdef">%if</a></tt> lines
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	110	in the grammar file.
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	111	<li><b>-E</b>
				112	Run the "%if" preprocessor step only and print the revised grammar
				113	file.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	114	<li><b>-g</b>
				115	Do not generate a parser. Instead write the input grammar to standard
				116	output with all comments, actions, and other extraneous text removed.
				117	<li><b>-l</b>
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	118	Omit "#line" directives in the generated parser C code.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	119	<li><b>-m</b>
				120	Cause the output C source code to be compatible with the "makeheaders"
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	121	program.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	122	<li><b>-p</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	123	Display all conflicts that are resolved by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	124	<a href='#precrules'>precedence rules</a>.
				125	<li><b>-q</b>
				126	Suppress generation of the report file.
				127	<li><b>-r</b>
				128	Do not sort or renumber the parser states as part of optimization.
				129	<li><b>-s</b>
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	130	Show parser statistics before exiting.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	131	<li><b>-T<i>file</i></b>
				132	Use <i>file</i> as the template for the generated C-code parser implementation.
				133	<li><b>-x</b>
				134	Print the Lemon version number.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	135	</ul>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	136
				137	<h3>The Parser Interface</h3>
				138
				139	<p>Lemon doesn't generate a complete, working program. It only generates
				140	a few subroutines that implement a parser. This section describes
				141	the interface to those subroutines. It is up to the programmer to
				142	call these subroutines in an appropriate way in order to produce a
				143	complete system.</p>
				144
				145	<p>Before a program begins using a Lemon-generated parser, the program
				146	must first create the parser.
				147	A new parser is created as follows:
				148	<pre>
				149	void *pParser = ParseAlloc( malloc );
				150	</pre>
				151	The ParseAlloc() routine allocates and initializes a new parser and
				152	returns a pointer to it.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	153	The actual data structure used to represent a parser is opaque —
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	154	its internal structure is not visible or usable by the calling routine.
				155	For this reason, the ParseAlloc() routine returns a pointer to void
				156	rather than a pointer to some particular structure.
				157	The sole argument to the ParseAlloc() routine is a pointer to the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	158	subroutine used to allocate memory. Typically this means malloc().</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	159
				160	<p>After a program is finished using a parser, it can reclaim all
				161	memory allocated by that parser by calling
				162	<pre>
				163	ParseFree(pParser, free);
				164	</pre>
				165	The first argument is the same pointer returned by ParseAlloc(). The
				166	second argument is a pointer to the function used to release bulk
				167	memory back to the system.</p>
				168
				169	<p>After a parser has been allocated using ParseAlloc(), the programmer
				170	must supply the parser with a sequence of tokens (terminal symbols) to
				171	be parsed. This is accomplished by calling the following function
				172	once for each token:
				173	<pre>
				174	Parse(pParser, hTokenID, sTokenData, pArg);
				175	</pre>
				176	The first argument to the Parse() routine is the pointer returned by
				177	ParseAlloc().
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	178	The second argument is a small positive integer that tells the parser the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	179	type of the next token in the data stream.
				180	There is one token type for each terminal symbol in the grammar.
				181	The gram.h file generated by Lemon contains #define statements that
				182	map symbolic terminal symbol names into appropriate integer values.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	183	A value of 0 for the second argument is a special flag to the
				184	parser to indicate that the end of input has been reached.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	185	The third argument is the value of the given token. By default,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	186	the type of the third argument is "void*", but the grammar will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	187	usually redefine this type to be some kind of structure.
				188	Typically the second argument will be a broad category of tokens
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	189	such as "identifier" or "number" and the third argument will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	190	be the name of the identifier or the value of the number.</p>
				191
				192	<p>The Parse() function may have either three or four arguments,
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	193	depending on the grammar. If the grammar specification file requests
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	194	it (via the <tt><a href='#extraarg'>%extra_argument</a></tt> directive),
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	195	the Parse() function will have a fourth parameter that can be
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	196	of any type chosen by the programmer. The parser doesn't do anything
				197	with this argument except to pass it through to action routines.
				198	This is a convenient mechanism for passing state information down
				199	to the action routines without having to use global variables.</p>
				200
				201	<p>A typical use of a Lemon parser might look something like the
				202	following:
				203	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	204	1 ParseTree ParseFile(const char zFilename){
				205	2 Tokenizer *pTokenizer;
				206	3 void *pParser;
				207	4 Token sToken;
				208	5 int hTokenId;
				209	6 ParserState sState;
				210	7
				211	8 pTokenizer = TokenizerCreate(zFilename);
				212	9 pParser = ParseAlloc( malloc );
				213	10 InitParserState(&sState);
				214	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				215	12 Parse(pParser, hTokenId, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	216	13 }
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	217	14 Parse(pParser, 0, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	218	15 ParseFree(pParser, free );
				219	16 TokenizerFree(pTokenizer);
				220	17 return sState.treeRoot;
				221	18 }
				222	</pre>
				223	This example shows a user-written routine that parses a file of
				224	text and returns a pointer to the parse tree.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	225	(All error-handling code is omitted from this example to keep it
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	226	simple.)
				227	We assume the existence of some kind of tokenizer which is created
				228	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				229	on line 16. The GetNextToken() function on line 11 retrieves the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	230	next token from the input file and puts its type in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	231	integer variable hTokenId. The sToken variable is assumed to be
				232	some kind of structure that contains details about each token,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	233	such as its complete text, what line it occurs on, etc.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	234
				235	<p>This example also assumes the existence of structure of type
				236	ParserState that holds state information about a particular parse.
				237	An instance of such a structure is created on line 6 and initialized
				238	on line 10. A pointer to this structure is passed into the Parse()
				239	routine as the optional 4th argument.
				240	The action routine specified by the grammar for the parser can use
				241	the ParserState structure to hold whatever information is useful and
				242	appropriate. In the example, we note that the treeRoot field of
				243	the ParserState structure is left pointing to the root of the parse
				244	tree.</p>
				245
				246	<p>The core of this example as it relates to Lemon is as follows:
				247	<pre>
				248	ParseFile(){
				249	pParser = ParseAlloc( malloc );
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	250	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	251	Parse(pParser, hTokenId, sToken);
				252	}
				253	Parse(pParser, 0, sToken);
				254	ParseFree(pParser, free );
				255	}
				256	</pre>
				257	Basically, what a program has to do to use a Lemon-generated parser
				258	is first create the parser, then send it lots of tokens obtained by
				259	tokenizing an input source. When the end of input is reached, the
				260	Parse() routine should be called one last time with a token type
				261	of 0. This step is necessary to inform the parser that the end of
				262	input has been reached. Finally, we reclaim memory used by the
				263	parser by calling ParseFree().</p>
				264
				265	<p>There is one other interface routine that should be mentioned
				266	before we move on.
				267	The ParseTrace() function can be used to generate debugging output
				268	from the parser. A prototype for this routine is as follows:
				269	<pre>
				270	ParseTrace(FILE stream, char zPrefix);
				271	</pre>
				272	After this routine is called, a short (one-line) message is written
				273	to the designated output stream every time the parser changes states
				274	or calls an action routine. Each such message is prefaced using
				275	the text given by zPrefix. This debugging output can be turned off
				276	by calling ParseTrace() again with a first argument of NULL (0).</p>
				277
				278	<h3>Differences With YACC and BISON</h3>
				279
				280	<p>Programmers who have previously used the yacc or bison parser
				281	generator will notice several important differences between yacc and/or
				282	bison and Lemon.
				283	<ul>
				284	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				285	the tokenizer calls the parser.
				286	<li>Lemon uses no global variables. Yacc and bison use global variables
				287	to pass information between the tokenizer and parser.
				288	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				289	and bison do not.
				290	</ul>
				291	These differences may cause some initial confusion for programmers
				292	with prior yacc and bison experience.
				293	But after years of experience using Lemon, I firmly
				294	believe that the Lemon way of doing things is better.</p>
				295
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	296	<p><i>Updated as of 2016-02-16:</i>
				297	The text above was written in the 1990s.
				298	We are told that Bison has lately been enhanced to support the
				299	tokenizer-calls-parser paradigm used by Lemon, and to obviate the
				300	need for global variables.</p>
				301
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	302	<h2>Input File Syntax</h2>
				303
				304	<p>The main purpose of the grammar specification file for Lemon is
				305	to define the grammar for the parser. But the input file also
				306	specifies additional information Lemon requires to do its job.
				307	Most of the work in using Lemon is in writing an appropriate
				308	grammar file.</p>
				309
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	310	<p>The grammar file for Lemon is, for the most part, free format.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	311	It does not have sections or divisions like yacc or bison. Any
				312	declaration can occur at any point in the file.
				313	Lemon ignores whitespace (except where it is needed to separate
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	314	tokens), and it honors the same commenting conventions as C and C++.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	315
				316	<h3>Terminals and Nonterminals</h3>
				317
				318	<p>A terminal symbol (token) is any string of alphanumeric
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	319	and/or underscore characters
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	320	that begins with an uppercase letter.
drh	c8eee5e	2011-07-30 23:50:12 +0000	[diff] [blame]	321	A terminal can contain lowercase letters after the first character,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	322	but the usual convention is to make terminals all uppercase.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	323	A nonterminal, on the other hand, is any string of alphanumeric
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	324	and underscore characters than begins with a lowercase letter.
				325	Again, the usual convention is to make nonterminals use all lowercase
				326	letters.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	327
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	328	<p>In Lemon, terminal and nonterminal symbols do not need to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	329	be declared or identified in a separate section of the grammar file.
				330	Lemon is able to generate a list of all terminals and nonterminals
				331	by examining the grammar rules, and it can always distinguish a
				332	terminal from a nonterminal by checking the case of the first
				333	character of the name.</p>
				334
				335	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				336	names or to be individual characters included in single quotes, like
				337	this: ')' or '$'. Lemon does not allow this alternative form for
				338	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				339	must have alphanumeric names.</p>
				340
				341	<h3>Grammar Rules</h3>
				342
				343	<p>The main component of a Lemon grammar file is a sequence of grammar
				344	rules.
				345	Each grammar rule consists of a nonterminal symbol followed by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	346	the special symbol "::=" and then a list of terminals and/or nonterminals.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	347	The rule is terminated by a period.
				348	The list of terminals and nonterminals on the right-hand side of the
				349	rule can be empty.
				350	Rules can occur in any order, except that the left-hand side of the
				351	first rule is assumed to be the start symbol for the grammar (unless
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	352	specified otherwise using the <tt><a href='#start_symbol'>%start_symbol</a></tt>
				353	directive described below.)
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	354	A typical sequence of grammar rules might look something like this:
				355	<pre>
				356	expr ::= expr PLUS expr.
				357	expr ::= expr TIMES expr.
				358	expr ::= LPAREN expr RPAREN.
				359	expr ::= VALUE.
				360	</pre>
				361	</p>
				362
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	363	<p>There is one non-terminal in this example, "expr", and five
				364	terminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
				365	"RPAREN" and "VALUE".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	366
				367	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				368	of C code that will be executed whenever a grammar rule is reduced
				369	by the parser.
				370	In Lemon, this action is specified by putting the C code (contained
				371	within curly braces <tt>{...}</tt>) immediately after the
				372	period that closes the rule.
				373	For example:
				374	<pre>
				375	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				376	</pre>
				377	</p>
				378
				379	<p>In order to be useful, grammar actions must normally be linked to
				380	their associated grammar rules.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	381	In yacc and bison, this is accomplished by embedding a "$$" in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	382	action to stand for the value of the left-hand side of the rule and
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	383	symbols "$1", "$2", and so forth to stand for the value of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	384	the terminal or nonterminal at position 1, 2 and so forth on the
				385	right-hand side of the rule.
				386	This idea is very powerful, but it is also very error-prone. The
				387	single most common source of errors in a yacc or bison grammar is
				388	to miscount the number of symbols on the right-hand side of a grammar
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	389	rule and say "$7" when you really mean "$8".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	390
				391	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				392	names to each symbol in a grammar rule and then using those symbolic
				393	names in the action.
				394	In yacc or bison, one would write this:
				395	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	396	expr -> expr PLUS expr { $$ = $1 + $3; };
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	397	</pre>
				398	But in Lemon, the same rule becomes the following:
				399	<pre>
				400	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				401	</pre>
				402	In the Lemon rule, any symbol in parentheses after a grammar rule
				403	symbol becomes a place holder for that symbol in the grammar rule.
				404	This place holder can then be used in the associated C action to
				405	stand for the value of that symbol.<p>
				406
				407	<p>The Lemon notation for linking a grammar rule with its reduce
				408	action is superior to yacc/bison on several counts.
				409	First, as mentioned above, the Lemon method avoids the need to
				410	count grammar symbols.
				411	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				412	includes a linking symbol in parentheses but that linking symbol
				413	is not actually used in the reduce action, then an error message
				414	is generated.
				415	For example, the rule
				416	<pre>
				417	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				418	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	419	will generate an error because the linking symbol "C" is used
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	420	in the grammar rule but not in the reduce action.</p>
				421
				422	<p>The Lemon notation for linking grammar rules to reduce actions
				423	also facilitates the use of destructors for reclaiming memory
				424	allocated by the values of terminals and nonterminals on the
				425	right-hand side of a rule.</p>
				426
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	427	<a name='precrules'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	428	<h3>Precedence Rules</h3>
				429
				430	<p>Lemon resolves parsing ambiguities in exactly the same way as
				431	yacc and bison. A shift-reduce conflict is resolved in favor
				432	of the shift, and a reduce-reduce conflict is resolved by reducing
				433	whichever rule comes first in the grammar file.</p>
				434
				435	<p>Just like in
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	436	yacc and bison, Lemon allows a measure of control
				437	over the resolution of parsing conflicts using precedence rules.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	438	A precedence value can be assigned to any terminal symbol
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	439	using the
				440	<tt><a href='#pleft'>%left</a></tt>,
				441	<tt><a href='#pright'>%right</a></tt> or
				442	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives. Terminal symbols
				443	mentioned in earlier directives have a lower precedence than
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	444	terminal symbols mentioned in later directives. For example:</p>
				445
				446	<p><pre>
				447	%left AND.
				448	%left OR.
				449	%nonassoc EQ NE GT GE LT LE.
				450	%left PLUS MINUS.
				451	%left TIMES DIVIDE MOD.
				452	%right EXP NOT.
				453	</pre></p>
				454
				455	<p>In the preceding sequence of directives, the AND operator is
				456	defined to have the lowest precedence. The OR operator is one
				457	precedence level higher. And so forth. Hence, the grammar would
				458	attempt to group the ambiguous expression
				459	<pre>
				460	a AND b OR c
				461	</pre>
				462	like this
				463	<pre>
				464	a AND (b OR c).
				465	</pre>
				466	The associativity (left, right or nonassoc) is used to determine
				467	the grouping when the precedence is the same. AND is left-associative
				468	in our example, so
				469	<pre>
				470	a AND b AND c
				471	</pre>
				472	is parsed like this
				473	<pre>
				474	(a AND b) AND c.
				475	</pre>
				476	The EXP operator is right-associative, though, so
				477	<pre>
				478	a EXP b EXP c
				479	</pre>
				480	is parsed like this
				481	<pre>
				482	a EXP (b EXP c).
				483	</pre>
				484	The nonassoc precedence is used for non-associative operators.
				485	So
				486	<pre>
				487	a EQ b EQ c
				488	</pre>
				489	is an error.</p>
				490
				491	<p>The precedence of non-terminals is transferred to rules as follows:
				492	The precedence of a grammar rule is equal to the precedence of the
				493	left-most terminal symbol in the rule for which a precedence is
				494	defined. This is normally what you want, but in those cases where
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	495	you want the precedence of a grammar rule to be something different,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	496	you can specify an alternative precedence symbol by putting the
				497	symbol in square braces after the period at the end of the rule and
				498	before any C-code. For example:</p>
				499
				500	<p><pre>
				501	expr = MINUS expr. [NOT]
				502	</pre></p>
				503
				504	<p>This rule has a precedence equal to that of the NOT symbol, not the
				505	MINUS symbol as would have been the case by default.</p>
				506
				507	<p>With the knowledge of how precedence is assigned to terminal
				508	symbols and individual
				509	grammar rules, we can now explain precisely how parsing conflicts
				510	are resolved in Lemon. Shift-reduce conflicts are resolved
				511	as follows:
				512	<ul>
				513	<li> If either the token to be shifted or the rule to be reduced
				514	lacks precedence information, then resolve in favor of the
				515	shift, but report a parsing conflict.
				516	<li> If the precedence of the token to be shifted is greater than
				517	the precedence of the rule to reduce, then resolve in favor
				518	of the shift. No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	519	<li> If the precedence of the token to be shifted is less than the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	520	precedence of the rule to reduce, then resolve in favor of the
				521	reduce action. No parsing conflict is reported.
				522	<li> If the precedences are the same and the shift token is
				523	right-associative, then resolve in favor of the shift.
				524	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	525	<li> If the precedences are the same and the shift token is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	526	left-associative, then resolve in favor of the reduce.
				527	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	528	<li> Otherwise, resolve the conflict by doing the shift, and
				529	report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	530	</ul>
				531	Reduce-reduce conflicts are resolved this way:
				532	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	533	<li> If either reduce rule
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	534	lacks precedence information, then resolve in favor of the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	535	rule that appears first in the grammar, and report a parsing
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	536	conflict.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	537	<li> If both rules have precedence and the precedence is different,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	538	then resolve the dispute in favor of the rule with the highest
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	539	precedence, and do not report a conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	540	<li> Otherwise, resolve the conflict by reducing by the rule that
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	541	appears first in the grammar, and report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	542	</ul>
				543
				544	<h3>Special Directives</h3>
				545
				546	<p>The input grammar to Lemon consists of grammar rules and special
				547	directives. We've described all the grammar rules, so now we'll
				548	talk about the special directives.</p>
				549
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	550	<p>Directives in Lemon can occur in any order. You can put them before
				551	the grammar rules, or after the grammar rules, or in the midst of the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	552	grammar rules. It doesn't matter. The relative order of
				553	directives used to assign precedence to terminals is important, but
				554	other than that, the order of directives in Lemon is arbitrary.</p>
				555
				556	<p>Lemon supports the following special directives:
				557	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	558	<li><tt><a href='#pcode'>%code</a></tt>
				559	<li><tt><a href='#default_destructor'>%default_destructor</a></tt>
				560	<li><tt><a href='#default_type'>%default_type</a></tt>
				561	<li><tt><a href='#destructor'>%destructor</a></tt>
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	562	<li><tt><a href='#pifdef'>%else</a></tt>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	563	<li><tt><a href='#pifdef'>%endif</a></tt>
				564	<li><tt><a href='#extraarg'>%extra_argument</a></tt>
				565	<li><tt><a href='#pfallback'>%fallback</a></tt>
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	566	<li><tt><a href='#pifdef'>%if</a></tt>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	567	<li><tt><a href='#pifdef'>%ifdef</a></tt>
				568	<li><tt><a href='#pifdef'>%ifndef</a></tt>
				569	<li><tt><a href='#pinclude'>%include</a></tt>
				570	<li><tt><a href='#pleft'>%left</a></tt>
				571	<li><tt><a href='#pname'>%name</a></tt>
				572	<li><tt><a href='#pnonassoc'>%nonassoc</a></tt>
				573	<li><tt><a href='#parse_accept'>%parse_accept</a></tt>
				574	<li><tt><a href='#parse_failure'>%parse_failure</a></tt>
				575	<li><tt><a href='#pright'>%right</a></tt>
				576	<li><tt><a href='#stack_overflow'>%stack_overflow</a></tt>
				577	<li><tt><a href='#stack_size'>%stack_size</a></tt>
				578	<li><tt><a href='#start_symbol'>%start_symbol</a></tt>
				579	<li><tt><a href='#syntax_error'>%syntax_error</a></tt>
				580	<li><tt><a href='#token_class'>%token_class</a></tt>
				581	<li><tt><a href='#token_destructor'>%token_destructor</a></tt>
				582	<li><tt><a href='#token_prefix'>%token_prefix</a></tt>
				583	<li><tt><a href='#token_type'>%token_type</a></tt>
				584	<li><tt><a href='#ptype'>%type</a></tt>
				585	<li><tt><a href='#pwildcard'>%wildcard</a></tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	586	</ul>
				587	Each of these directives will be described separately in the
				588	following sections:</p>
				589
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	590	<a name='pcode'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	591	<h4>The <tt>%code</tt> directive</h4>
				592
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	593	<p>The <tt>%code</tt> directive is used to specify additional C code that
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	594	is added to the end of the main output file. This is similar to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	595	the <tt><a href='#pinclude'>%include</a></tt> directive except that
				596	<tt>%include</tt> is inserted at the beginning of the main output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	597
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	598	<p><tt>%code</tt> is typically used to include some action routines or perhaps
				599	a tokenizer or even the "main()" function
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	600	as part of the output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	601
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	602	<a name='default_destructor'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	603	<h4>The <tt>%default_destructor</tt> directive</h4>
				604
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	605	<p>The <tt>%default_destructor</tt> directive specifies a destructor to
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	606	use for non-terminals that do not have their own destructor
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	607	specified by a separate <tt>%destructor</tt> directive. See the documentation
				608	on the <tt><a name='#destructor'>%destructor</a></tt> directive below for
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	609	additional information.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	610
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	611	<p>In some grammars, many different non-terminal symbols have the
				612	same data type and hence the same destructor. This directive is
				613	a convenient way to specify the same destructor for all those
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	614	non-terminals using a single statement.</p>
				615
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	616	<a name='default_type'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	617	<h4>The <tt>%default_type</tt> directive</h4>
				618
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	619	<p>The <tt>%default_type</tt> directive specifies the data type of non-terminal
				620	symbols that do not have their own data type defined using a separate
				621	<tt><a href='#ptype'>%type</a></tt> directive.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	622
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	623	<a name='destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	624	<h4>The <tt>%destructor</tt> directive</h4>
				625
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	626	<p>The <tt>%destructor</tt> directive is used to specify a destructor for
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	627	a non-terminal symbol.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	628	(See also the <tt><a href='#token_destructor'>%token_destructor</a></tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	629	directive which is used to specify a destructor for terminal symbols.)</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	630
				631	<p>A non-terminal's destructor is called to dispose of the
				632	non-terminal's value whenever the non-terminal is popped from
				633	the stack. This includes all of the following circumstances:
				634	<ul>
				635	<li> When a rule reduces and the value of a non-terminal on
				636	the right-hand side is not linked to C code.
				637	<li> When the stack is popped during error processing.
				638	<li> When the ParseFree() function runs.
				639	</ul>
				640	The destructor can do whatever it wants with the value of
				641	the non-terminal, but its design is to deallocate memory
				642	or other resources held by that non-terminal.</p>
				643
				644	<p>Consider an example:
				645	<pre>
				646	%type nt {void*}
				647	%destructor nt { free($$); }
				648	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				649	</pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	650	This example is a bit contrived, but it serves to illustrate how
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	651	destructors work. The example shows a non-terminal named
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	652	"nt" that holds values of type "void*". When the rule for
				653	an "nt" reduces, it sets the value of the non-terminal to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	654	space obtained from malloc(). Later, when the nt non-terminal
				655	is popped from the stack, the destructor will fire and call
				656	free() on this malloced space, thus avoiding a memory leak.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	657	(Note that the symbol "$$" in the destructor code is replaced
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	658	by the value of the non-terminal.)</p>
				659
				660	<p>It is important to note that the value of a non-terminal is passed
				661	to the destructor whenever the non-terminal is removed from the
				662	stack, unless the non-terminal is used in a C-code action. If
				663	the non-terminal is used by C-code, then it is assumed that the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	664	C-code will take care of destroying it.
				665	More commonly, the value is used to build some
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	666	larger structure, and we don't want to destroy it, which is why
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	667	the destructor is not called in this circumstance.</p>
				668
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	669	<p>Destructors help avoid memory leaks by automatically freeing
				670	allocated objects when they go out of scope.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	671	To do the same using yacc or bison is much more difficult.</p>
				672
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	673	<a name='extraarg'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	674	<h4>The <tt>%extra_argument</tt> directive</h4>
				675
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	676	The <tt>%extra_argument</tt> directive instructs Lemon to add a 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	677	to the parameter list of the Parse() function it generates. Lemon
				678	doesn't do anything itself with this extra argument, but it does
				679	make the argument available to C-code action routines, destructors,
				680	and so forth. For example, if the grammar file contains:</p>
				681
				682	<p><pre>
				683	%extra_argument { MyStruct *pAbc }
				684	</pre></p>
				685
				686	<p>Then the Parse() function generated will have an 4th parameter
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	687	of type "MyStruct*" and all action routines will have access to
				688	a variable named "pAbc" that is the value of the 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	689	in the most recent call to Parse().</p>
				690
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	691	<p>The <tt>%extra_context</tt> directive works the same except that it
				692	is passed in on the ParseAlloc() or ParseInit() routines instead of
				693	on Parse().
				694
				695	<a name='extractx'></a>
				696	<h4>The <tt>%extra_context</tt> directive</h4>
				697
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	698	The <tt>%extra_context</tt> directive instructs Lemon to add a 2nd parameter
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	699	to the parameter list of the ParseAlloc() and ParseInif() functions. Lemon
				700	doesn't do anything itself with these extra argument, but it does
				701	store the value make it available to C-code action routines, destructors,
				702	and so forth. For example, if the grammar file contains:</p>
				703
				704	<p><pre>
				705	%extra_context { MyStruct *pAbc }
				706	</pre></p>
				707
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	708	<p>Then the ParseAlloc() and ParseInit() functions will have an 2nd parameter
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	709	of type "MyStruct*" and all action routines will have access to
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	710	a variable named "pAbc" that is the value of that 2nd parameter.</p>
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	711
				712	<p>The <tt>%extra_argument</tt> directive works the same except that it
				713	is passed in on the Parse() routine instead of on ParseAlloc()/ParseInit().
				714
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	715	<a name='pfallback'></a>
				716	<h4>The <tt>%fallback</tt> directive</h4>
				717
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	718	<p>The <tt>%fallback</tt> directive specifies an alternative meaning for one
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	719	or more tokens. The alternative meaning is tried if the original token
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	720	would have generated a syntax error.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	721
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	722	<p>The <tt>%fallback</tt> directive was added to support robust parsing of SQL
				723	syntax in <a href='https://www.sqlite.org/'>SQLite</a>.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	724	The SQL language contains a large assortment of keywords, each of which
				725	appears as a different token to the language parser. SQL contains so
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	726	many keywords that it can be difficult for programmers to keep up with
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	727	them all. Programmers will, therefore, sometimes mistakenly use an
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	728	obscure language keyword for an identifier. The <tt>%fallback</tt> directive
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	729	provides a mechanism to tell the parser: "If you are unable to parse
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	730	this keyword, try treating it as an identifier instead."</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	731
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	732	<p>The syntax of <tt>%fallback</tt> is as follows:
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	733
				734	<blockquote>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	735	<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b>
				736	</blockquote></p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	737
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	738	<p>In words, the <tt>%fallback</tt> directive is followed by a list of token
				739	names terminated by a period.
				740	The first token name is the fallback token — the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	741	token to which all the other tokens fall back to. The second and subsequent
				742	arguments are tokens which fall back to the token identified by the first
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	743	argument.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	744
				745	<a name='pifdef'></a>
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	746	<h4>The <tt>%if</tt> directive and its friends</h4>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	747
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	748	<p>The <tt>%if</tt>, <tt>%ifdef</tt>, <tt>%ifndef</tt>, <tt>%else</tt>,
				749	and <tt>%endif</tt> directives
				750	are similar to #if, #ifdef, #ifndef, #else, and #endif in the C-preprocessor,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	751	just not as general.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	752	Each of these directives must begin at the left margin. No whitespace
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	753	is allowed between the "%" and the directive name.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	754
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	755	<p>Grammar text in between "<tt>%ifdef MACRO</tt>" and the next nested
				756	"<tt>%endif</tt>" is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	757	ignored unless the "-DMACRO" command-line option is used. Grammar text
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	758	betwen "<tt>%ifndef MACRO</tt>" and the next nested "<tt>%endif</tt>" is
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	759	included except when the "-DMACRO" command-line option is used.<p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	760
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	761	<p>The text in between "<tt>%if</tt> <i>CONDITIONAL</i>" and its
				762	corresponding <tt>%endif</tt> is included only if <i>CONDITIONAL</i>
				763	is true. The CONDITION is one or more macro names, optionally connected
				764	using the "\|\|" and "&&" binary operators, the "!" unary operator,
				765	and grouped using balanced parentheses. Each term is true if the
				766	corresponding macro exists, and false if it does not exist.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	767
drh	5f0d37b	2020-07-03 18:07:22 +0000	[diff] [blame]	768	<p>An optional "<tt>%else</tt>" directive can occur anywhere in between a
				769	<tt>%ifdef</tt>, <tt>%ifndef</tt>, or <tt>%if</tt> directive and
				770	its corresponding <tt>%endif</tt>.</p>
				771
				772	<p>Note that the argument to <tt>%ifdef</tt> and <tt>%ifndef</tt> is
				773	intended to be a single preprocessor symbol name, not a general expression.
				774	Use the "<tt>%if</tt>" directive for general expressions.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	775
				776	<a name='pinclude'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	777	<h4>The <tt>%include</tt> directive</h4>
				778
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	779	<p>The <tt>%include</tt> directive specifies C code that is included at the
				780	top of the generated parser. You can include any text you want —
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	781	the Lemon parser generator copies it blindly. If you have multiple
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	782	<tt>%include</tt> directives in your grammar file, their values are concatenated
				783	so that all <tt>%include</tt> code ultimately appears near the top of the
				784	generated parser, in the same order as it appeared in the grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	785
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	786	<p>The <tt>%include</tt> directive is very handy for getting some extra #include
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	787	preprocessor statements at the beginning of the generated parser.
				788	For example:</p>
				789
				790	<p><pre>
				791	%include {#include <unistd.h>}
				792	</pre></p>
				793
				794	<p>This might be needed, for example, if some of the C actions in the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	795	grammar call functions that are prototyped in unistd.h.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	796
drh	60ce5d3	2018-11-27 14:34:33 +0000	[diff] [blame]	797	<p>Use the <tt><a href="#pcode">%code</a></tt> directive to add code to
				798	the end of the generated parser.</p>
				799
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	800	<a name='pleft'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	801	<h4>The <tt>%left</tt> directive</h4>
				802
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	803	The <tt>%left</tt> directive is used (along with the
				804	<tt><a href='#pright'>%right</a></tt> and
				805	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives) to declare
				806	precedences of terminal symbols.
				807	Every terminal symbol whose name appears after
				808	a <tt>%left</tt> directive but before the next period (".") is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	809	given the same left-associative precedence value. Subsequent
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	810	<tt>%left</tt> directives have higher precedence. For example:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	811
				812	<p><pre>
				813	%left AND.
				814	%left OR.
				815	%nonassoc EQ NE GT GE LT LE.
				816	%left PLUS MINUS.
				817	%left TIMES DIVIDE MOD.
				818	%right EXP NOT.
				819	</pre></p>
				820
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	821	<p>Note the period that terminates each <tt>%left</tt>,
				822	<tt>%right</tt> or <tt>%nonassoc</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	823	directive.</p>
				824
				825	<p>LALR(1) grammars can get into a situation where they require
				826	a large amount of stack space if you make heavy use or right-associative
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	827	operators. For this reason, it is recommended that you use <tt>%left</tt>
				828	rather than <tt>%right</tt> whenever possible.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	829
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	830	<a name='pname'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	831	<h4>The <tt>%name</tt> directive</h4>
				832
				833	<p>By default, the functions generated by Lemon all begin with the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	834	five-character string "Parse". You can change this string to something
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	835	different using the <tt>%name</tt> directive. For instance:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	836
				837	<p><pre>
				838	%name Abcde
				839	</pre></p>
				840
				841	<p>Putting this directive in the grammar file will cause Lemon to generate
				842	functions named
				843	<ul>
				844	<li> AbcdeAlloc(),
				845	<li> AbcdeFree(),
				846	<li> AbcdeTrace(), and
				847	<li> Abcde().
				848	</ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	849	The <tt>%name</tt> directive allows you to generate two or more different
				850	parsers and link them all into the same executable.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	851
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	852	<a name='pnonassoc'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	853	<h4>The <tt>%nonassoc</tt> directive</h4>
				854
				855	<p>This directive is used to assign non-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	856	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	857	<a href='#precrules'>precedence rules</a>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	858	or on the <tt><a href='#pleft'>%left</a></tt> directive
				859	for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	860
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	861	<a name='parse_accept'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	862	<h4>The <tt>%parse_accept</tt> directive</h4>
				863
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	864	<p>The <tt>%parse_accept</tt> directive specifies a block of C code that is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	865	executed whenever the parser accepts its input string. To "accept"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	866	an input string means that the parser was able to process all tokens
				867	without error.</p>
				868
				869	<p>For example:</p>
				870
				871	<p><pre>
				872	%parse_accept {
				873	printf("parsing complete!\n");
				874	}
				875	</pre></p>
				876
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	877	<a name='parse_failure'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	878	<h4>The <tt>%parse_failure</tt> directive</h4>
				879
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	880	<p>The <tt>%parse_failure</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	881	is executed whenever the parser fails complete. This code is not
				882	executed until the parser has tried and failed to resolve an input
				883	error using is usual error recovery strategy. The routine is
				884	only invoked when parsing is unable to continue.</p>
				885
				886	<p><pre>
				887	%parse_failure {
				888	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				889	}
				890	</pre></p>
				891
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	892	<a name='pright'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	893	<h4>The <tt>%right</tt> directive</h4>
				894
				895	<p>This directive is used to assign right-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	896	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	897	<a href='#precrules'>precedence rules</a>
				898	or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	899
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	900	<a name='stack_overflow'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	901	<h4>The <tt>%stack_overflow</tt> directive</h4>
				902
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	903	<p>The <tt>%stack_overflow</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	904	is executed if the parser's internal stack ever overflows. Typically
				905	this just prints an error message. After a stack overflow, the parser
				906	will be unable to continue and must be reset.</p>
				907
				908	<p><pre>
				909	%stack_overflow {
				910	fprintf(stderr,"Giving up. Parser stack overflow\n");
				911	}
				912	</pre></p>
				913
				914	<p>You can help prevent parser stack overflows by avoiding the use
				915	of right recursion and right-precedence operators in your grammar.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	916	Use left recursion and and left-precedence operators instead to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	917	encourage rules to reduce sooner and keep the stack size down.
				918	For example, do rules like this:
				919	<pre>
				920	list ::= list element. // left-recursion. Good!
				921	list ::= .
				922	</pre>
				923	Not like this:
				924	<pre>
				925	list ::= element list. // right-recursion. Bad!
				926	list ::= .
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	927	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	928
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	929	<a name='stack_size'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	930	<h4>The <tt>%stack_size</tt> directive</h4>
				931
				932	<p>If stack overflow is a problem and you can't resolve the trouble
				933	by using left-recursion, then you might want to increase the size
				934	of the parser's stack using this directive. Put an positive integer
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	935	after the <tt>%stack_size</tt> directive and Lemon will generate a parse
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	936	with a stack of the requested size. The default value is 100.</p>
				937
				938	<p><pre>
				939	%stack_size 2000
				940	</pre></p>
				941
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	942	<a name='start_symbol'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	943	<h4>The <tt>%start_symbol</tt> directive</h4>
				944
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	945	<p>By default, the start symbol for the grammar that Lemon generates
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	946	is the first non-terminal that appears in the grammar file. But you
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	947	can choose a different start symbol using the
				948	<tt>%start_symbol</tt> directive.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	949
				950	<p><pre>
				951	%start_symbol prog
				952	</pre></p>
				953
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	954	<a name='syntax_error'></a>
				955	<h4>The <tt>%syntax_error</tt> directive</h4>
				956
				957	<p>See <a href='#error_processing'>Error Processing</a>.</p>
				958
				959	<a name='token_class'></a>
				960	<h4>The <tt>%token_class</tt> directive</h4>
				961
				962	<p>Undocumented. Appears to be related to the MULTITERMINAL concept.
				963	<a href='http://sqlite.org/src/fdiff?v1=796930d5fc2036c7&v2=624b24c5dc048e09&sbs=0'>Implementation</a>.</p>
				964
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	965	<a name='token_destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	966	<h4>The <tt>%token_destructor</tt> directive</h4>
				967
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	968	<p>The <tt>%destructor</tt> directive assigns a destructor to a non-terminal
				969	symbol. (See the description of the
				970	<tt><a href='%destructor'>%destructor</a></tt> directive above.)
				971	The <tt>%token_destructor</tt> directive does the same thing
				972	for all terminal symbols.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	973
				974	<p>Unlike non-terminal symbols which may each have a different data type
				975	for their values, terminals all use the same data type (defined by
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	976	the <tt><a href='#token_type'>%token_type</a></tt> directive)
				977	and so they use a common destructor.
				978	Other than that, the token destructor works just like the non-terminal
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	979	destructors.</p>
				980
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	981	<a name='token_prefix'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	982	<h4>The <tt>%token_prefix</tt> directive</h4>
				983
				984	<p>Lemon generates #defines that assign small integer constants
				985	to each terminal symbol in the grammar. If desired, Lemon will
				986	add a prefix specified by this directive
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	987	to each of the #defines it generates.</p>
				988
				989	<p>So if the default output of Lemon looked like this:
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	990	<pre>
				991	#define AND 1
				992	#define MINUS 2
				993	#define OR 3
				994	#define PLUS 4
				995	</pre>
				996	You can insert a statement into the grammar like this:
				997	<pre>
				998	%token_prefix TOKEN_
				999	</pre>
				1000	to cause Lemon to produce these symbols instead:
				1001	<pre>
				1002	#define TOKEN_AND 1
				1003	#define TOKEN_MINUS 2
				1004	#define TOKEN_OR 3
				1005	#define TOKEN_PLUS 4
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1006	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1007
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1008	<a name='token_type'></a><a name='ptype'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1009	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				1010
				1011	<p>These directives are used to specify the data types for values
				1012	on the parser's stack associated with terminal and non-terminal
				1013	symbols. The values of all terminal symbols must be of the same
				1014	type. This turns out to be the same data type as the 3rd parameter
				1015	to the Parse() function generated by Lemon. Typically, you will
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	1016	make the value of a terminal symbol be a pointer to some kind of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1017	token structure. Like this:</p>
				1018
				1019	<p><pre>
				1020	%token_type {Token*}
				1021	</pre></p>
				1022
				1023	<p>If the data type of terminals is not specified, the default value
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	1024	is "void*".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1025
				1026	<p>Non-terminal symbols can each have their own data types. Typically
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1027	the data type of a non-terminal is a pointer to the root of a parse tree
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1028	structure that contains all information about that non-terminal.
				1029	For example:</p>
				1030
				1031	<p><pre>
				1032	%type expr {Expr*}
				1033	</pre></p>
				1034
				1035	<p>Each entry on the parser's stack is actually a union containing
				1036	instances of all data types for every non-terminal and terminal symbol.
				1037	Lemon will automatically use the correct element of this union depending
				1038	on what the corresponding non-terminal or terminal symbol is. But
				1039	the grammar designer should keep in mind that the size of the union
				1040	will be the size of its largest element. So if you have a single
				1041	non-terminal whose data type requires 1K of storage, then your 100
				1042	entry parser stack will require 100K of heap space. If you are willing
				1043	and able to pay that price, fine. You just need to know.</p>
				1044
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1045	<a name='pwildcard'></a>
				1046	<h4>The <tt>%wildcard</tt> directive</h4>
				1047
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1048	<p>The <tt>%wildcard</tt> directive is followed by a single token name and a
				1049	period. This directive specifies that the identified token should
				1050	match any input token.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1051
				1052	<p>When the generated parser has the choice of matching an input against
				1053	the wildcard token and some other token, the other token is always used.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1054	The wildcard token is only matched if there are no alternatives.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1055
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1056	<a name='error_processing'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1057	<h3>Error Processing</h3>
				1058
				1059	<p>After extensive experimentation over several years, it has been
				1060	discovered that the error recovery strategy used by yacc is about
				1061	as good as it gets. And so that is what Lemon uses.</p>
				1062
				1063	<p>When a Lemon-generated parser encounters a syntax error, it
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1064	first invokes the code specified by the <tt>%syntax_error</tt> directive, if
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1065	any. It then enters its error recovery strategy. The error recovery
				1066	strategy is to begin popping the parsers stack until it enters a
				1067	state where it is permitted to shift a special non-terminal symbol
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1068	named "error". It then shifts this non-terminal and continues
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1069	parsing. The <tt>%syntax_error</tt> routine will not be called again
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1070	until at least three new tokens have been successfully shifted.</p>
				1071
				1072	<p>If the parser pops its stack until the stack is empty, and it still
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1073	is unable to shift the error symbol, then the
				1074	<tt><a href='#parse_failure'>%parse_failure</a></tt> routine
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1075	is invoked and the parser resets itself to its start state, ready
				1076	to begin parsing a new file. This is what will happen at the very
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1077	first syntax error, of course, if there are no instances of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1078	"error" non-terminal in your grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1079
				1080	</body>
				1081	</html>