Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 17988deef4d8dc6cdef90ba47fd5494e18a47884 [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	5	<body bgcolor='white'>
				6	<h1 align='center'>The Lemon Parser Generator</h1>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	7
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	8	<p>Lemon is an LALR(1) parser generator for C.
				9	It does the same job as "bison" and "yacc".
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	10	But Lemon is not a bison or yacc clone. Lemon
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	11	uses a different grammar syntax which is designed to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	12	reduce the number of coding errors. Lemon also uses a
				13	parsing engine that is faster than yacc and
				14	bison and which is both reentrant and threadsafe.
				15	(Update: Since the previous sentence was written, bison
				16	has also been updated so that it too can generate a
				17	reentrant and threadsafe parser.)
				18	Lemon also implements features that can be used
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	19	to eliminate resource leaks, making it suitable for use
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	20	in long-running programs such as graphical user interfaces
				21	or embedded controllers.</p>
				22
				23	<p>This document is an introduction to the Lemon
				24	parser generator.</p>
				25
drh	c5e56b3	2017-06-01 01:53:19 +0000	[diff] [blame]	26	<h2>Security Note</h2>
				27
				28	<p>The language parser code created by Lemon is very robust and
				29	is well-suited for use in internet-facing applications that need to
				30	safely process maliciously crafted inputs.
				31
				32	<p>The "lemon.exe" command-line tool itself works great when given a valid
				33	input grammar file and almost always gives helpful
				34	error messages for malformed inputs. However, it is possible for
				35	a malicious user to craft a grammar file that will cause
				36	lemon.exe to crash.
				37	We do not see this as a problem, as lemon.exe is not intended to be used
				38	with hostile inputs.
				39	To summarize:</p>
				40
				41	<ul>
				42	<li>Parser code generated by lemon → Robust and secure
				43	<li>The "lemon.exe" command line tool itself → Not so much
				44	</ul>
				45
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	46	<h2>Theory of Operation</h2>
				47
				48	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				49	for a particular language into C code that implements a parser for
				50	that language.
				51	The program has two inputs:
				52	<ul>
				53	<li>The grammar specification.
				54	<li>A parser template file.
				55	</ul>
				56	Typically, only the grammar specification is supplied by the programmer.
				57	Lemon comes with a default parser template which works fine for most
				58	applications. But the user is free to substitute a different parser
				59	template if desired.</p>
				60
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	61	<p>Depending on command-line options, Lemon will generate up to
				62	three output files.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	63	<ul>
				64	<li>C code to implement the parser.
				65	<li>A header file defining an integer ID for each terminal symbol.
				66	<li>An information file that describes the states of the generated parser
				67	automaton.
				68	</ul>
				69	By default, all three of these output files are generated.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	70	The header file is suppressed if the "-m" command-line option is
				71	used and the report file is omitted when "-q" is selected.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	72
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	73	<p>The grammar specification file uses a ".y" suffix, by convention.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	74	In the examples used in this document, we'll assume the name of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	75	grammar file is "gram.y". A typical use of Lemon would be the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	76	following command:
				77	<pre>
				78	lemon gram.y
				79	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	80	This command will generate three output files named "gram.c",
				81	"gram.h" and "gram.out".
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	82	The first is C code to implement the parser. The second
				83	is the header file that defines numerical values for all
				84	terminal symbols, and the last is the report that explains
				85	the states used by the parser automaton.</p>
				86
				87	<h3>Command Line Options</h3>
				88
				89	<p>The behavior of Lemon can be modified using command-line options.
				90	You can obtain a list of the available command-line options together
				91	with a brief explanation of what each does by typing
				92	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	93	lemon "-?"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	94	</pre>
				95	As of this writing, the following command-line options are supported:
				96	<ul>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	97	<li><b>-b</b>
				98	Show only the basis for each parser state in the report file.
				99	<li><b>-c</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	100	Do not compress the generated action tables. The parser will be a
				101	little larger and slower, but it will detect syntax errors sooner.
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	102	<li><b>-d</b><i>directory</i>
				103	Write all output files into <i>directory</i>. Normally, output files
				104	are written into the directory that contains the input grammar file.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	105	<li><b>-D<i>name</i></b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	106	Define C preprocessor macro <i>name</i>. This macro is usable by
				107	"<tt><a href='#pifdef'>%ifdef</a></tt>" and
				108	"<tt><a href='#pifdef'>%ifndef</a></tt>" lines
				109	in the grammar file.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	110	<li><b>-g</b>
				111	Do not generate a parser. Instead write the input grammar to standard
				112	output with all comments, actions, and other extraneous text removed.
				113	<li><b>-l</b>
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	114	Omit "#line" directives in the generated parser C code.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	115	<li><b>-m</b>
				116	Cause the output C source code to be compatible with the "makeheaders"
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	117	program.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	118	<li><b>-p</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	119	Display all conflicts that are resolved by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	120	<a href='#precrules'>precedence rules</a>.
				121	<li><b>-q</b>
				122	Suppress generation of the report file.
				123	<li><b>-r</b>
				124	Do not sort or renumber the parser states as part of optimization.
				125	<li><b>-s</b>
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	126	Show parser statistics before exiting.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	127	<li><b>-T<i>file</i></b>
				128	Use <i>file</i> as the template for the generated C-code parser implementation.
				129	<li><b>-x</b>
				130	Print the Lemon version number.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	131	</ul>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	132
				133	<h3>The Parser Interface</h3>
				134
				135	<p>Lemon doesn't generate a complete, working program. It only generates
				136	a few subroutines that implement a parser. This section describes
				137	the interface to those subroutines. It is up to the programmer to
				138	call these subroutines in an appropriate way in order to produce a
				139	complete system.</p>
				140
				141	<p>Before a program begins using a Lemon-generated parser, the program
				142	must first create the parser.
				143	A new parser is created as follows:
				144	<pre>
				145	void *pParser = ParseAlloc( malloc );
				146	</pre>
				147	The ParseAlloc() routine allocates and initializes a new parser and
				148	returns a pointer to it.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	149	The actual data structure used to represent a parser is opaque —
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	150	its internal structure is not visible or usable by the calling routine.
				151	For this reason, the ParseAlloc() routine returns a pointer to void
				152	rather than a pointer to some particular structure.
				153	The sole argument to the ParseAlloc() routine is a pointer to the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	154	subroutine used to allocate memory. Typically this means malloc().</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	155
				156	<p>After a program is finished using a parser, it can reclaim all
				157	memory allocated by that parser by calling
				158	<pre>
				159	ParseFree(pParser, free);
				160	</pre>
				161	The first argument is the same pointer returned by ParseAlloc(). The
				162	second argument is a pointer to the function used to release bulk
				163	memory back to the system.</p>
				164
				165	<p>After a parser has been allocated using ParseAlloc(), the programmer
				166	must supply the parser with a sequence of tokens (terminal symbols) to
				167	be parsed. This is accomplished by calling the following function
				168	once for each token:
				169	<pre>
				170	Parse(pParser, hTokenID, sTokenData, pArg);
				171	</pre>
				172	The first argument to the Parse() routine is the pointer returned by
				173	ParseAlloc().
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	174	The second argument is a small positive integer that tells the parser the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	175	type of the next token in the data stream.
				176	There is one token type for each terminal symbol in the grammar.
				177	The gram.h file generated by Lemon contains #define statements that
				178	map symbolic terminal symbol names into appropriate integer values.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	179	A value of 0 for the second argument is a special flag to the
				180	parser to indicate that the end of input has been reached.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	181	The third argument is the value of the given token. By default,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	182	the type of the third argument is "void*", but the grammar will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	183	usually redefine this type to be some kind of structure.
				184	Typically the second argument will be a broad category of tokens
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	185	such as "identifier" or "number" and the third argument will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	186	be the name of the identifier or the value of the number.</p>
				187
				188	<p>The Parse() function may have either three or four arguments,
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	189	depending on the grammar. If the grammar specification file requests
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	190	it (via the <tt><a href='#extraarg'>%extra_argument</a></tt> directive),
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	191	the Parse() function will have a fourth parameter that can be
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	192	of any type chosen by the programmer. The parser doesn't do anything
				193	with this argument except to pass it through to action routines.
				194	This is a convenient mechanism for passing state information down
				195	to the action routines without having to use global variables.</p>
				196
				197	<p>A typical use of a Lemon parser might look something like the
				198	following:
				199	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	200	1 ParseTree ParseFile(const char zFilename){
				201	2 Tokenizer *pTokenizer;
				202	3 void *pParser;
				203	4 Token sToken;
				204	5 int hTokenId;
				205	6 ParserState sState;
				206	7
				207	8 pTokenizer = TokenizerCreate(zFilename);
				208	9 pParser = ParseAlloc( malloc );
				209	10 InitParserState(&sState);
				210	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				211	12 Parse(pParser, hTokenId, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	212	13 }
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	213	14 Parse(pParser, 0, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	214	15 ParseFree(pParser, free );
				215	16 TokenizerFree(pTokenizer);
				216	17 return sState.treeRoot;
				217	18 }
				218	</pre>
				219	This example shows a user-written routine that parses a file of
				220	text and returns a pointer to the parse tree.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	221	(All error-handling code is omitted from this example to keep it
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	222	simple.)
				223	We assume the existence of some kind of tokenizer which is created
				224	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				225	on line 16. The GetNextToken() function on line 11 retrieves the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	226	next token from the input file and puts its type in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	227	integer variable hTokenId. The sToken variable is assumed to be
				228	some kind of structure that contains details about each token,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	229	such as its complete text, what line it occurs on, etc.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	230
				231	<p>This example also assumes the existence of structure of type
				232	ParserState that holds state information about a particular parse.
				233	An instance of such a structure is created on line 6 and initialized
				234	on line 10. A pointer to this structure is passed into the Parse()
				235	routine as the optional 4th argument.
				236	The action routine specified by the grammar for the parser can use
				237	the ParserState structure to hold whatever information is useful and
				238	appropriate. In the example, we note that the treeRoot field of
				239	the ParserState structure is left pointing to the root of the parse
				240	tree.</p>
				241
				242	<p>The core of this example as it relates to Lemon is as follows:
				243	<pre>
				244	ParseFile(){
				245	pParser = ParseAlloc( malloc );
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	246	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	247	Parse(pParser, hTokenId, sToken);
				248	}
				249	Parse(pParser, 0, sToken);
				250	ParseFree(pParser, free );
				251	}
				252	</pre>
				253	Basically, what a program has to do to use a Lemon-generated parser
				254	is first create the parser, then send it lots of tokens obtained by
				255	tokenizing an input source. When the end of input is reached, the
				256	Parse() routine should be called one last time with a token type
				257	of 0. This step is necessary to inform the parser that the end of
				258	input has been reached. Finally, we reclaim memory used by the
				259	parser by calling ParseFree().</p>
				260
				261	<p>There is one other interface routine that should be mentioned
				262	before we move on.
				263	The ParseTrace() function can be used to generate debugging output
				264	from the parser. A prototype for this routine is as follows:
				265	<pre>
				266	ParseTrace(FILE stream, char zPrefix);
				267	</pre>
				268	After this routine is called, a short (one-line) message is written
				269	to the designated output stream every time the parser changes states
				270	or calls an action routine. Each such message is prefaced using
				271	the text given by zPrefix. This debugging output can be turned off
				272	by calling ParseTrace() again with a first argument of NULL (0).</p>
				273
				274	<h3>Differences With YACC and BISON</h3>
				275
				276	<p>Programmers who have previously used the yacc or bison parser
				277	generator will notice several important differences between yacc and/or
				278	bison and Lemon.
				279	<ul>
				280	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				281	the tokenizer calls the parser.
				282	<li>Lemon uses no global variables. Yacc and bison use global variables
				283	to pass information between the tokenizer and parser.
				284	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				285	and bison do not.
				286	</ul>
				287	These differences may cause some initial confusion for programmers
				288	with prior yacc and bison experience.
				289	But after years of experience using Lemon, I firmly
				290	believe that the Lemon way of doing things is better.</p>
				291
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	292	<p><i>Updated as of 2016-02-16:</i>
				293	The text above was written in the 1990s.
				294	We are told that Bison has lately been enhanced to support the
				295	tokenizer-calls-parser paradigm used by Lemon, and to obviate the
				296	need for global variables.</p>
				297
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	298	<h2>Input File Syntax</h2>
				299
				300	<p>The main purpose of the grammar specification file for Lemon is
				301	to define the grammar for the parser. But the input file also
				302	specifies additional information Lemon requires to do its job.
				303	Most of the work in using Lemon is in writing an appropriate
				304	grammar file.</p>
				305
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	306	<p>The grammar file for Lemon is, for the most part, free format.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	307	It does not have sections or divisions like yacc or bison. Any
				308	declaration can occur at any point in the file.
				309	Lemon ignores whitespace (except where it is needed to separate
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	310	tokens), and it honors the same commenting conventions as C and C++.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	311
				312	<h3>Terminals and Nonterminals</h3>
				313
				314	<p>A terminal symbol (token) is any string of alphanumeric
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	315	and/or underscore characters
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	316	that begins with an uppercase letter.
drh	c8eee5e	2011-07-30 23:50:12 +0000	[diff] [blame]	317	A terminal can contain lowercase letters after the first character,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	318	but the usual convention is to make terminals all uppercase.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	319	A nonterminal, on the other hand, is any string of alphanumeric
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	320	and underscore characters than begins with a lowercase letter.
				321	Again, the usual convention is to make nonterminals use all lowercase
				322	letters.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	323
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	324	<p>In Lemon, terminal and nonterminal symbols do not need to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	325	be declared or identified in a separate section of the grammar file.
				326	Lemon is able to generate a list of all terminals and nonterminals
				327	by examining the grammar rules, and it can always distinguish a
				328	terminal from a nonterminal by checking the case of the first
				329	character of the name.</p>
				330
				331	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				332	names or to be individual characters included in single quotes, like
				333	this: ')' or '$'. Lemon does not allow this alternative form for
				334	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				335	must have alphanumeric names.</p>
				336
				337	<h3>Grammar Rules</h3>
				338
				339	<p>The main component of a Lemon grammar file is a sequence of grammar
				340	rules.
				341	Each grammar rule consists of a nonterminal symbol followed by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	342	the special symbol "::=" and then a list of terminals and/or nonterminals.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	343	The rule is terminated by a period.
				344	The list of terminals and nonterminals on the right-hand side of the
				345	rule can be empty.
				346	Rules can occur in any order, except that the left-hand side of the
				347	first rule is assumed to be the start symbol for the grammar (unless
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	348	specified otherwise using the <tt><a href='#start_symbol'>%start_symbol</a></tt>
				349	directive described below.)
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	350	A typical sequence of grammar rules might look something like this:
				351	<pre>
				352	expr ::= expr PLUS expr.
				353	expr ::= expr TIMES expr.
				354	expr ::= LPAREN expr RPAREN.
				355	expr ::= VALUE.
				356	</pre>
				357	</p>
				358
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	359	<p>There is one non-terminal in this example, "expr", and five
				360	terminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
				361	"RPAREN" and "VALUE".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	362
				363	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				364	of C code that will be executed whenever a grammar rule is reduced
				365	by the parser.
				366	In Lemon, this action is specified by putting the C code (contained
				367	within curly braces <tt>{...}</tt>) immediately after the
				368	period that closes the rule.
				369	For example:
				370	<pre>
				371	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				372	</pre>
				373	</p>
				374
				375	<p>In order to be useful, grammar actions must normally be linked to
				376	their associated grammar rules.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	377	In yacc and bison, this is accomplished by embedding a "$$" in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	378	action to stand for the value of the left-hand side of the rule and
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	379	symbols "$1", "$2", and so forth to stand for the value of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	380	the terminal or nonterminal at position 1, 2 and so forth on the
				381	right-hand side of the rule.
				382	This idea is very powerful, but it is also very error-prone. The
				383	single most common source of errors in a yacc or bison grammar is
				384	to miscount the number of symbols on the right-hand side of a grammar
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	385	rule and say "$7" when you really mean "$8".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	386
				387	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				388	names to each symbol in a grammar rule and then using those symbolic
				389	names in the action.
				390	In yacc or bison, one would write this:
				391	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	392	expr -> expr PLUS expr { $$ = $1 + $3; };
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	393	</pre>
				394	But in Lemon, the same rule becomes the following:
				395	<pre>
				396	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				397	</pre>
				398	In the Lemon rule, any symbol in parentheses after a grammar rule
				399	symbol becomes a place holder for that symbol in the grammar rule.
				400	This place holder can then be used in the associated C action to
				401	stand for the value of that symbol.<p>
				402
				403	<p>The Lemon notation for linking a grammar rule with its reduce
				404	action is superior to yacc/bison on several counts.
				405	First, as mentioned above, the Lemon method avoids the need to
				406	count grammar symbols.
				407	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				408	includes a linking symbol in parentheses but that linking symbol
				409	is not actually used in the reduce action, then an error message
				410	is generated.
				411	For example, the rule
				412	<pre>
				413	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				414	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	415	will generate an error because the linking symbol "C" is used
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	416	in the grammar rule but not in the reduce action.</p>
				417
				418	<p>The Lemon notation for linking grammar rules to reduce actions
				419	also facilitates the use of destructors for reclaiming memory
				420	allocated by the values of terminals and nonterminals on the
				421	right-hand side of a rule.</p>
				422
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	423	<a name='precrules'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	424	<h3>Precedence Rules</h3>
				425
				426	<p>Lemon resolves parsing ambiguities in exactly the same way as
				427	yacc and bison. A shift-reduce conflict is resolved in favor
				428	of the shift, and a reduce-reduce conflict is resolved by reducing
				429	whichever rule comes first in the grammar file.</p>
				430
				431	<p>Just like in
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	432	yacc and bison, Lemon allows a measure of control
				433	over the resolution of parsing conflicts using precedence rules.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	434	A precedence value can be assigned to any terminal symbol
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	435	using the
				436	<tt><a href='#pleft'>%left</a></tt>,
				437	<tt><a href='#pright'>%right</a></tt> or
				438	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives. Terminal symbols
				439	mentioned in earlier directives have a lower precedence than
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	440	terminal symbols mentioned in later directives. For example:</p>
				441
				442	<p><pre>
				443	%left AND.
				444	%left OR.
				445	%nonassoc EQ NE GT GE LT LE.
				446	%left PLUS MINUS.
				447	%left TIMES DIVIDE MOD.
				448	%right EXP NOT.
				449	</pre></p>
				450
				451	<p>In the preceding sequence of directives, the AND operator is
				452	defined to have the lowest precedence. The OR operator is one
				453	precedence level higher. And so forth. Hence, the grammar would
				454	attempt to group the ambiguous expression
				455	<pre>
				456	a AND b OR c
				457	</pre>
				458	like this
				459	<pre>
				460	a AND (b OR c).
				461	</pre>
				462	The associativity (left, right or nonassoc) is used to determine
				463	the grouping when the precedence is the same. AND is left-associative
				464	in our example, so
				465	<pre>
				466	a AND b AND c
				467	</pre>
				468	is parsed like this
				469	<pre>
				470	(a AND b) AND c.
				471	</pre>
				472	The EXP operator is right-associative, though, so
				473	<pre>
				474	a EXP b EXP c
				475	</pre>
				476	is parsed like this
				477	<pre>
				478	a EXP (b EXP c).
				479	</pre>
				480	The nonassoc precedence is used for non-associative operators.
				481	So
				482	<pre>
				483	a EQ b EQ c
				484	</pre>
				485	is an error.</p>
				486
				487	<p>The precedence of non-terminals is transferred to rules as follows:
				488	The precedence of a grammar rule is equal to the precedence of the
				489	left-most terminal symbol in the rule for which a precedence is
				490	defined. This is normally what you want, but in those cases where
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	491	you want the precedence of a grammar rule to be something different,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	492	you can specify an alternative precedence symbol by putting the
				493	symbol in square braces after the period at the end of the rule and
				494	before any C-code. For example:</p>
				495
				496	<p><pre>
				497	expr = MINUS expr. [NOT]
				498	</pre></p>
				499
				500	<p>This rule has a precedence equal to that of the NOT symbol, not the
				501	MINUS symbol as would have been the case by default.</p>
				502
				503	<p>With the knowledge of how precedence is assigned to terminal
				504	symbols and individual
				505	grammar rules, we can now explain precisely how parsing conflicts
				506	are resolved in Lemon. Shift-reduce conflicts are resolved
				507	as follows:
				508	<ul>
				509	<li> If either the token to be shifted or the rule to be reduced
				510	lacks precedence information, then resolve in favor of the
				511	shift, but report a parsing conflict.
				512	<li> If the precedence of the token to be shifted is greater than
				513	the precedence of the rule to reduce, then resolve in favor
				514	of the shift. No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	515	<li> If the precedence of the token to be shifted is less than the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	516	precedence of the rule to reduce, then resolve in favor of the
				517	reduce action. No parsing conflict is reported.
				518	<li> If the precedences are the same and the shift token is
				519	right-associative, then resolve in favor of the shift.
				520	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	521	<li> If the precedences are the same and the shift token is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	522	left-associative, then resolve in favor of the reduce.
				523	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	524	<li> Otherwise, resolve the conflict by doing the shift, and
				525	report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	526	</ul>
				527	Reduce-reduce conflicts are resolved this way:
				528	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	529	<li> If either reduce rule
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	530	lacks precedence information, then resolve in favor of the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	531	rule that appears first in the grammar, and report a parsing
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	532	conflict.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	533	<li> If both rules have precedence and the precedence is different,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	534	then resolve the dispute in favor of the rule with the highest
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	535	precedence, and do not report a conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	536	<li> Otherwise, resolve the conflict by reducing by the rule that
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	537	appears first in the grammar, and report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	538	</ul>
				539
				540	<h3>Special Directives</h3>
				541
				542	<p>The input grammar to Lemon consists of grammar rules and special
				543	directives. We've described all the grammar rules, so now we'll
				544	talk about the special directives.</p>
				545
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	546	<p>Directives in Lemon can occur in any order. You can put them before
				547	the grammar rules, or after the grammar rules, or in the midst of the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	548	grammar rules. It doesn't matter. The relative order of
				549	directives used to assign precedence to terminals is important, but
				550	other than that, the order of directives in Lemon is arbitrary.</p>
				551
				552	<p>Lemon supports the following special directives:
				553	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	554	<li><tt><a href='#pcode'>%code</a></tt>
				555	<li><tt><a href='#default_destructor'>%default_destructor</a></tt>
				556	<li><tt><a href='#default_type'>%default_type</a></tt>
				557	<li><tt><a href='#destructor'>%destructor</a></tt>
				558	<li><tt><a href='#pifdef'>%endif</a></tt>
				559	<li><tt><a href='#extraarg'>%extra_argument</a></tt>
				560	<li><tt><a href='#pfallback'>%fallback</a></tt>
				561	<li><tt><a href='#pifdef'>%ifdef</a></tt>
				562	<li><tt><a href='#pifdef'>%ifndef</a></tt>
				563	<li><tt><a href='#pinclude'>%include</a></tt>
				564	<li><tt><a href='#pleft'>%left</a></tt>
				565	<li><tt><a href='#pname'>%name</a></tt>
				566	<li><tt><a href='#pnonassoc'>%nonassoc</a></tt>
				567	<li><tt><a href='#parse_accept'>%parse_accept</a></tt>
				568	<li><tt><a href='#parse_failure'>%parse_failure</a></tt>
				569	<li><tt><a href='#pright'>%right</a></tt>
				570	<li><tt><a href='#stack_overflow'>%stack_overflow</a></tt>
				571	<li><tt><a href='#stack_size'>%stack_size</a></tt>
				572	<li><tt><a href='#start_symbol'>%start_symbol</a></tt>
				573	<li><tt><a href='#syntax_error'>%syntax_error</a></tt>
				574	<li><tt><a href='#token_class'>%token_class</a></tt>
				575	<li><tt><a href='#token_destructor'>%token_destructor</a></tt>
				576	<li><tt><a href='#token_prefix'>%token_prefix</a></tt>
				577	<li><tt><a href='#token_type'>%token_type</a></tt>
				578	<li><tt><a href='#ptype'>%type</a></tt>
				579	<li><tt><a href='#pwildcard'>%wildcard</a></tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	580	</ul>
				581	Each of these directives will be described separately in the
				582	following sections:</p>
				583
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	584	<a name='pcode'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	585	<h4>The <tt>%code</tt> directive</h4>
				586
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	587	<p>The <tt>%code</tt> directive is used to specify additional C code that
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	588	is added to the end of the main output file. This is similar to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	589	the <tt><a href='#pinclude'>%include</a></tt> directive except that
				590	<tt>%include</tt> is inserted at the beginning of the main output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	591
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	592	<p><tt>%code</tt> is typically used to include some action routines or perhaps
				593	a tokenizer or even the "main()" function
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	594	as part of the output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	595
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	596	<a name='default_destructor'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	597	<h4>The <tt>%default_destructor</tt> directive</h4>
				598
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	599	<p>The <tt>%default_destructor</tt> directive specifies a destructor to
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	600	use for non-terminals that do not have their own destructor
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	601	specified by a separate <tt>%destructor</tt> directive. See the documentation
				602	on the <tt><a name='#destructor'>%destructor</a></tt> directive below for
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	603	additional information.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	604
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	605	<p>In some grammars, many different non-terminal symbols have the
				606	same data type and hence the same destructor. This directive is
				607	a convenient way to specify the same destructor for all those
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	608	non-terminals using a single statement.</p>
				609
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	610	<a name='default_type'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	611	<h4>The <tt>%default_type</tt> directive</h4>
				612
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	613	<p>The <tt>%default_type</tt> directive specifies the data type of non-terminal
				614	symbols that do not have their own data type defined using a separate
				615	<tt><a href='#ptype'>%type</a></tt> directive.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	616
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	617	<a name='destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	618	<h4>The <tt>%destructor</tt> directive</h4>
				619
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	620	<p>The <tt>%destructor</tt> directive is used to specify a destructor for
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	621	a non-terminal symbol.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	622	(See also the <tt><a href='#token_destructor'>%token_destructor</a></tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	623	directive which is used to specify a destructor for terminal symbols.)</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	624
				625	<p>A non-terminal's destructor is called to dispose of the
				626	non-terminal's value whenever the non-terminal is popped from
				627	the stack. This includes all of the following circumstances:
				628	<ul>
				629	<li> When a rule reduces and the value of a non-terminal on
				630	the right-hand side is not linked to C code.
				631	<li> When the stack is popped during error processing.
				632	<li> When the ParseFree() function runs.
				633	</ul>
				634	The destructor can do whatever it wants with the value of
				635	the non-terminal, but its design is to deallocate memory
				636	or other resources held by that non-terminal.</p>
				637
				638	<p>Consider an example:
				639	<pre>
				640	%type nt {void*}
				641	%destructor nt { free($$); }
				642	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				643	</pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	644	This example is a bit contrived, but it serves to illustrate how
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	645	destructors work. The example shows a non-terminal named
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	646	"nt" that holds values of type "void*". When the rule for
				647	an "nt" reduces, it sets the value of the non-terminal to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	648	space obtained from malloc(). Later, when the nt non-terminal
				649	is popped from the stack, the destructor will fire and call
				650	free() on this malloced space, thus avoiding a memory leak.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	651	(Note that the symbol "$$" in the destructor code is replaced
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	652	by the value of the non-terminal.)</p>
				653
				654	<p>It is important to note that the value of a non-terminal is passed
				655	to the destructor whenever the non-terminal is removed from the
				656	stack, unless the non-terminal is used in a C-code action. If
				657	the non-terminal is used by C-code, then it is assumed that the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	658	C-code will take care of destroying it.
				659	More commonly, the value is used to build some
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	660	larger structure, and we don't want to destroy it, which is why
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	661	the destructor is not called in this circumstance.</p>
				662
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	663	<p>Destructors help avoid memory leaks by automatically freeing
				664	allocated objects when they go out of scope.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	665	To do the same using yacc or bison is much more difficult.</p>
				666
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	667	<a name='extraarg'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	668	<h4>The <tt>%extra_argument</tt> directive</h4>
				669
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	670	The <tt>%extra_argument</tt> directive instructs Lemon to add a 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	671	to the parameter list of the Parse() function it generates. Lemon
				672	doesn't do anything itself with this extra argument, but it does
				673	make the argument available to C-code action routines, destructors,
				674	and so forth. For example, if the grammar file contains:</p>
				675
				676	<p><pre>
				677	%extra_argument { MyStruct *pAbc }
				678	</pre></p>
				679
				680	<p>Then the Parse() function generated will have an 4th parameter
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	681	of type "MyStruct*" and all action routines will have access to
				682	a variable named "pAbc" that is the value of the 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	683	in the most recent call to Parse().</p>
				684
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	685	<p>The <tt>%extra_context</tt> directive works the same except that it
				686	is passed in on the ParseAlloc() or ParseInit() routines instead of
				687	on Parse().
				688
				689	<a name='extractx'></a>
				690	<h4>The <tt>%extra_context</tt> directive</h4>
				691
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	692	The <tt>%extra_context</tt> directive instructs Lemon to add a 2nd parameter
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	693	to the parameter list of the ParseAlloc() and ParseInif() functions. Lemon
				694	doesn't do anything itself with these extra argument, but it does
				695	store the value make it available to C-code action routines, destructors,
				696	and so forth. For example, if the grammar file contains:</p>
				697
				698	<p><pre>
				699	%extra_context { MyStruct *pAbc }
				700	</pre></p>
				701
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	702	<p>Then the ParseAlloc() and ParseInit() functions will have an 2nd parameter
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	703	of type "MyStruct*" and all action routines will have access to
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	704	a variable named "pAbc" that is the value of that 2nd parameter.</p>
drh	fb32c44	2018-04-21 13:51:42 +0000	[diff] [blame]	705
				706	<p>The <tt>%extra_argument</tt> directive works the same except that it
				707	is passed in on the Parse() routine instead of on ParseAlloc()/ParseInit().
				708
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	709	<a name='pfallback'></a>
				710	<h4>The <tt>%fallback</tt> directive</h4>
				711
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	712	<p>The <tt>%fallback</tt> directive specifies an alternative meaning for one
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	713	or more tokens. The alternative meaning is tried if the original token
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	714	would have generated a syntax error.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	715
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	716	<p>The <tt>%fallback</tt> directive was added to support robust parsing of SQL
				717	syntax in <a href='https://www.sqlite.org/'>SQLite</a>.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	718	The SQL language contains a large assortment of keywords, each of which
				719	appears as a different token to the language parser. SQL contains so
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	720	many keywords that it can be difficult for programmers to keep up with
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	721	them all. Programmers will, therefore, sometimes mistakenly use an
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	722	obscure language keyword for an identifier. The <tt>%fallback</tt> directive
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	723	provides a mechanism to tell the parser: "If you are unable to parse
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	724	this keyword, try treating it as an identifier instead."</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	725
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	726	<p>The syntax of <tt>%fallback</tt> is as follows:
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	727
				728	<blockquote>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	729	<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b>
				730	</blockquote></p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	731
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	732	<p>In words, the <tt>%fallback</tt> directive is followed by a list of token
				733	names terminated by a period.
				734	The first token name is the fallback token — the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	735	token to which all the other tokens fall back to. The second and subsequent
				736	arguments are tokens which fall back to the token identified by the first
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	737	argument.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	738
				739	<a name='pifdef'></a>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	740	<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives</h4>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	741
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	742	<p>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives
				743	are similar to #ifdef, #ifndef, and #endif in the C-preprocessor,
				744	just not as general.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	745	Each of these directives must begin at the left margin. No whitespace
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	746	is allowed between the "%" and the directive name.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	747
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	748	<p>Grammar text in between "<tt>%ifdef MACRO</tt>" and the next nested
				749	"<tt>%endif</tt>" is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	750	ignored unless the "-DMACRO" command-line option is used. Grammar text
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	751	betwen "<tt>%ifndef MACRO</tt>" and the next nested "<tt>%endif</tt>" is
				752	included except when the "-DMACRO" command-line option is used.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	753
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	754	<p>Note that the argument to <tt>%ifdef</tt> and <tt>%ifndef</tt> must
				755	be a single preprocessor symbol name, not a general expression.
				756	There is no "<tt>%else</tt>" directive.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	757
				758
				759	<a name='pinclude'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	760	<h4>The <tt>%include</tt> directive</h4>
				761
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	762	<p>The <tt>%include</tt> directive specifies C code that is included at the
				763	top of the generated parser. You can include any text you want —
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	764	the Lemon parser generator copies it blindly. If you have multiple
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	765	<tt>%include</tt> directives in your grammar file, their values are concatenated
				766	so that all <tt>%include</tt> code ultimately appears near the top of the
				767	generated parser, in the same order as it appeared in the grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	768
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	769	<p>The <tt>%include</tt> directive is very handy for getting some extra #include
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	770	preprocessor statements at the beginning of the generated parser.
				771	For example:</p>
				772
				773	<p><pre>
				774	%include {#include <unistd.h>}
				775	</pre></p>
				776
				777	<p>This might be needed, for example, if some of the C actions in the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	778	grammar call functions that are prototyped in unistd.h.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	779
drh	60ce5d3	2018-11-27 14:34:33 +0000	[diff] [blame]	780	<p>Use the <tt><a href="#pcode">%code</a></tt> directive to add code to
				781	the end of the generated parser.</p>
				782
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	783	<a name='pleft'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	784	<h4>The <tt>%left</tt> directive</h4>
				785
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	786	The <tt>%left</tt> directive is used (along with the
				787	<tt><a href='#pright'>%right</a></tt> and
				788	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives) to declare
				789	precedences of terminal symbols.
				790	Every terminal symbol whose name appears after
				791	a <tt>%left</tt> directive but before the next period (".") is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	792	given the same left-associative precedence value. Subsequent
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	793	<tt>%left</tt> directives have higher precedence. For example:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	794
				795	<p><pre>
				796	%left AND.
				797	%left OR.
				798	%nonassoc EQ NE GT GE LT LE.
				799	%left PLUS MINUS.
				800	%left TIMES DIVIDE MOD.
				801	%right EXP NOT.
				802	</pre></p>
				803
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	804	<p>Note the period that terminates each <tt>%left</tt>,
				805	<tt>%right</tt> or <tt>%nonassoc</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	806	directive.</p>
				807
				808	<p>LALR(1) grammars can get into a situation where they require
				809	a large amount of stack space if you make heavy use or right-associative
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	810	operators. For this reason, it is recommended that you use <tt>%left</tt>
				811	rather than <tt>%right</tt> whenever possible.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	812
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	813	<a name='pname'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	814	<h4>The <tt>%name</tt> directive</h4>
				815
				816	<p>By default, the functions generated by Lemon all begin with the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	817	five-character string "Parse". You can change this string to something
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	818	different using the <tt>%name</tt> directive. For instance:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	819
				820	<p><pre>
				821	%name Abcde
				822	</pre></p>
				823
				824	<p>Putting this directive in the grammar file will cause Lemon to generate
				825	functions named
				826	<ul>
				827	<li> AbcdeAlloc(),
				828	<li> AbcdeFree(),
				829	<li> AbcdeTrace(), and
				830	<li> Abcde().
				831	</ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	832	The <tt>%name</tt> directive allows you to generate two or more different
				833	parsers and link them all into the same executable.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	834
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	835	<a name='pnonassoc'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	836	<h4>The <tt>%nonassoc</tt> directive</h4>
				837
				838	<p>This directive is used to assign non-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	839	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	840	<a href='#precrules'>precedence rules</a>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	841	or on the <tt><a href='#pleft'>%left</a></tt> directive
				842	for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	843
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	844	<a name='parse_accept'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	845	<h4>The <tt>%parse_accept</tt> directive</h4>
				846
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	847	<p>The <tt>%parse_accept</tt> directive specifies a block of C code that is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	848	executed whenever the parser accepts its input string. To "accept"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	849	an input string means that the parser was able to process all tokens
				850	without error.</p>
				851
				852	<p>For example:</p>
				853
				854	<p><pre>
				855	%parse_accept {
				856	printf("parsing complete!\n");
				857	}
				858	</pre></p>
				859
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	860	<a name='parse_failure'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	861	<h4>The <tt>%parse_failure</tt> directive</h4>
				862
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	863	<p>The <tt>%parse_failure</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	864	is executed whenever the parser fails complete. This code is not
				865	executed until the parser has tried and failed to resolve an input
				866	error using is usual error recovery strategy. The routine is
				867	only invoked when parsing is unable to continue.</p>
				868
				869	<p><pre>
				870	%parse_failure {
				871	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				872	}
				873	</pre></p>
				874
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	875	<a name='pright'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	876	<h4>The <tt>%right</tt> directive</h4>
				877
				878	<p>This directive is used to assign right-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	879	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	880	<a href='#precrules'>precedence rules</a>
				881	or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	882
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	883	<a name='stack_overflow'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	884	<h4>The <tt>%stack_overflow</tt> directive</h4>
				885
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	886	<p>The <tt>%stack_overflow</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	887	is executed if the parser's internal stack ever overflows. Typically
				888	this just prints an error message. After a stack overflow, the parser
				889	will be unable to continue and must be reset.</p>
				890
				891	<p><pre>
				892	%stack_overflow {
				893	fprintf(stderr,"Giving up. Parser stack overflow\n");
				894	}
				895	</pre></p>
				896
				897	<p>You can help prevent parser stack overflows by avoiding the use
				898	of right recursion and right-precedence operators in your grammar.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	899	Use left recursion and and left-precedence operators instead to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	900	encourage rules to reduce sooner and keep the stack size down.
				901	For example, do rules like this:
				902	<pre>
				903	list ::= list element. // left-recursion. Good!
				904	list ::= .
				905	</pre>
				906	Not like this:
				907	<pre>
				908	list ::= element list. // right-recursion. Bad!
				909	list ::= .
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	910	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	911
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	912	<a name='stack_size'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	913	<h4>The <tt>%stack_size</tt> directive</h4>
				914
				915	<p>If stack overflow is a problem and you can't resolve the trouble
				916	by using left-recursion, then you might want to increase the size
				917	of the parser's stack using this directive. Put an positive integer
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	918	after the <tt>%stack_size</tt> directive and Lemon will generate a parse
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	919	with a stack of the requested size. The default value is 100.</p>
				920
				921	<p><pre>
				922	%stack_size 2000
				923	</pre></p>
				924
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	925	<a name='start_symbol'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	926	<h4>The <tt>%start_symbol</tt> directive</h4>
				927
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	928	<p>By default, the start symbol for the grammar that Lemon generates
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	929	is the first non-terminal that appears in the grammar file. But you
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	930	can choose a different start symbol using the
				931	<tt>%start_symbol</tt> directive.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	932
				933	<p><pre>
				934	%start_symbol prog
				935	</pre></p>
				936
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	937	<a name='syntax_error'></a>
				938	<h4>The <tt>%syntax_error</tt> directive</h4>
				939
				940	<p>See <a href='#error_processing'>Error Processing</a>.</p>
				941
				942	<a name='token_class'></a>
				943	<h4>The <tt>%token_class</tt> directive</h4>
				944
				945	<p>Undocumented. Appears to be related to the MULTITERMINAL concept.
				946	<a href='http://sqlite.org/src/fdiff?v1=796930d5fc2036c7&v2=624b24c5dc048e09&sbs=0'>Implementation</a>.</p>
				947
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	948	<a name='token_destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	949	<h4>The <tt>%token_destructor</tt> directive</h4>
				950
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	951	<p>The <tt>%destructor</tt> directive assigns a destructor to a non-terminal
				952	symbol. (See the description of the
				953	<tt><a href='%destructor'>%destructor</a></tt> directive above.)
				954	The <tt>%token_destructor</tt> directive does the same thing
				955	for all terminal symbols.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	956
				957	<p>Unlike non-terminal symbols which may each have a different data type
				958	for their values, terminals all use the same data type (defined by
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	959	the <tt><a href='#token_type'>%token_type</a></tt> directive)
				960	and so they use a common destructor.
				961	Other than that, the token destructor works just like the non-terminal
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	962	destructors.</p>
				963
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	964	<a name='token_prefix'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	965	<h4>The <tt>%token_prefix</tt> directive</h4>
				966
				967	<p>Lemon generates #defines that assign small integer constants
				968	to each terminal symbol in the grammar. If desired, Lemon will
				969	add a prefix specified by this directive
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	970	to each of the #defines it generates.</p>
				971
				972	<p>So if the default output of Lemon looked like this:
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	973	<pre>
				974	#define AND 1
				975	#define MINUS 2
				976	#define OR 3
				977	#define PLUS 4
				978	</pre>
				979	You can insert a statement into the grammar like this:
				980	<pre>
				981	%token_prefix TOKEN_
				982	</pre>
				983	to cause Lemon to produce these symbols instead:
				984	<pre>
				985	#define TOKEN_AND 1
				986	#define TOKEN_MINUS 2
				987	#define TOKEN_OR 3
				988	#define TOKEN_PLUS 4
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	989	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	990
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	991	<a name='token_type'></a><a name='ptype'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	992	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				993
				994	<p>These directives are used to specify the data types for values
				995	on the parser's stack associated with terminal and non-terminal
				996	symbols. The values of all terminal symbols must be of the same
				997	type. This turns out to be the same data type as the 3rd parameter
				998	to the Parse() function generated by Lemon. Typically, you will
drh	ed5e668	2020-03-09 01:02:45 +0000	[diff] [blame]	999	make the value of a terminal symbol be a pointer to some kind of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1000	token structure. Like this:</p>
				1001
				1002	<p><pre>
				1003	%token_type {Token*}
				1004	</pre></p>
				1005
				1006	<p>If the data type of terminals is not specified, the default value
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	1007	is "void*".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1008
				1009	<p>Non-terminal symbols can each have their own data types. Typically
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1010	the data type of a non-terminal is a pointer to the root of a parse tree
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1011	structure that contains all information about that non-terminal.
				1012	For example:</p>
				1013
				1014	<p><pre>
				1015	%type expr {Expr*}
				1016	</pre></p>
				1017
				1018	<p>Each entry on the parser's stack is actually a union containing
				1019	instances of all data types for every non-terminal and terminal symbol.
				1020	Lemon will automatically use the correct element of this union depending
				1021	on what the corresponding non-terminal or terminal symbol is. But
				1022	the grammar designer should keep in mind that the size of the union
				1023	will be the size of its largest element. So if you have a single
				1024	non-terminal whose data type requires 1K of storage, then your 100
				1025	entry parser stack will require 100K of heap space. If you are willing
				1026	and able to pay that price, fine. You just need to know.</p>
				1027
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1028	<a name='pwildcard'></a>
				1029	<h4>The <tt>%wildcard</tt> directive</h4>
				1030
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1031	<p>The <tt>%wildcard</tt> directive is followed by a single token name and a
				1032	period. This directive specifies that the identified token should
				1033	match any input token.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1034
				1035	<p>When the generated parser has the choice of matching an input against
				1036	the wildcard token and some other token, the other token is always used.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1037	The wildcard token is only matched if there are no alternatives.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1038
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1039	<a name='error_processing'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1040	<h3>Error Processing</h3>
				1041
				1042	<p>After extensive experimentation over several years, it has been
				1043	discovered that the error recovery strategy used by yacc is about
				1044	as good as it gets. And so that is what Lemon uses.</p>
				1045
				1046	<p>When a Lemon-generated parser encounters a syntax error, it
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1047	first invokes the code specified by the <tt>%syntax_error</tt> directive, if
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1048	any. It then enters its error recovery strategy. The error recovery
				1049	strategy is to begin popping the parsers stack until it enters a
				1050	state where it is permitted to shift a special non-terminal symbol
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1051	named "error". It then shifts this non-terminal and continues
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1052	parsing. The <tt>%syntax_error</tt> routine will not be called again
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1053	until at least three new tokens have been successfully shifted.</p>
				1054
				1055	<p>If the parser pops its stack until the stack is empty, and it still
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1056	is unable to shift the error symbol, then the
				1057	<tt><a href='#parse_failure'>%parse_failure</a></tt> routine
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1058	is invoked and the parser resets itself to its start state, ready
				1059	to begin parsing a new file. This is what will happen at the very
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1060	first syntax error, of course, if there are no instances of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1061	"error" non-terminal in your grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1062
				1063	</body>
				1064	</html>