Blame - doc/lemon.html - chromium.googlesource.com/chromium/deps/sqlite

blob: 3ed85176f7ee9c5149dbcfb905a8aca7135a55bc [file] [log] [blame]

drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<title>The Lemon Parser Generator</title>
				4	</head>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	5	<body bgcolor='white'>
				6	<h1 align='center'>The Lemon Parser Generator</h1>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	7
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	8	<p>Lemon is an LALR(1) parser generator for C.
				9	It does the same job as "bison" and "yacc".
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	10	But Lemon is not a bison or yacc clone. Lemon
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	11	uses a different grammar syntax which is designed to
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	12	reduce the number of coding errors. Lemon also uses a
				13	parsing engine that is faster than yacc and
				14	bison and which is both reentrant and threadsafe.
				15	(Update: Since the previous sentence was written, bison
				16	has also been updated so that it too can generate a
				17	reentrant and threadsafe parser.)
				18	Lemon also implements features that can be used
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	19	to eliminate resource leaks, making it suitable for use
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	20	in long-running programs such as graphical user interfaces
				21	or embedded controllers.</p>
				22
				23	<p>This document is an introduction to the Lemon
				24	parser generator.</p>
				25
drh	c5e56b3	2017-06-01 01:53:19 +0000	[diff] [blame]	26	<h2>Security Note</h2>
				27
				28	<p>The language parser code created by Lemon is very robust and
				29	is well-suited for use in internet-facing applications that need to
				30	safely process maliciously crafted inputs.
				31
				32	<p>The "lemon.exe" command-line tool itself works great when given a valid
				33	input grammar file and almost always gives helpful
				34	error messages for malformed inputs. However, it is possible for
				35	a malicious user to craft a grammar file that will cause
				36	lemon.exe to crash.
				37	We do not see this as a problem, as lemon.exe is not intended to be used
				38	with hostile inputs.
				39	To summarize:</p>
				40
				41	<ul>
				42	<li>Parser code generated by lemon → Robust and secure
				43	<li>The "lemon.exe" command line tool itself → Not so much
				44	</ul>
				45
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	46	<h2>Theory of Operation</h2>
				47
				48	<p>The main goal of Lemon is to translate a context free grammar (CFG)
				49	for a particular language into C code that implements a parser for
				50	that language.
				51	The program has two inputs:
				52	<ul>
				53	<li>The grammar specification.
				54	<li>A parser template file.
				55	</ul>
				56	Typically, only the grammar specification is supplied by the programmer.
				57	Lemon comes with a default parser template which works fine for most
				58	applications. But the user is free to substitute a different parser
				59	template if desired.</p>
				60
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	61	<p>Depending on command-line options, Lemon will generate up to
				62	three output files.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	63	<ul>
				64	<li>C code to implement the parser.
				65	<li>A header file defining an integer ID for each terminal symbol.
				66	<li>An information file that describes the states of the generated parser
				67	automaton.
				68	</ul>
				69	By default, all three of these output files are generated.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	70	The header file is suppressed if the "-m" command-line option is
				71	used and the report file is omitted when "-q" is selected.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	72
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	73	<p>The grammar specification file uses a ".y" suffix, by convention.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	74	In the examples used in this document, we'll assume the name of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	75	grammar file is "gram.y". A typical use of Lemon would be the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	76	following command:
				77	<pre>
				78	lemon gram.y
				79	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	80	This command will generate three output files named "gram.c",
				81	"gram.h" and "gram.out".
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	82	The first is C code to implement the parser. The second
				83	is the header file that defines numerical values for all
				84	terminal symbols, and the last is the report that explains
				85	the states used by the parser automaton.</p>
				86
				87	<h3>Command Line Options</h3>
				88
				89	<p>The behavior of Lemon can be modified using command-line options.
				90	You can obtain a list of the available command-line options together
				91	with a brief explanation of what each does by typing
				92	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	93	lemon "-?"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	94	</pre>
				95	As of this writing, the following command-line options are supported:
				96	<ul>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	97	<li><b>-b</b>
				98	Show only the basis for each parser state in the report file.
				99	<li><b>-c</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	100	Do not compress the generated action tables. The parser will be a
				101	little larger and slower, but it will detect syntax errors sooner.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	102	<li><b>-D<i>name</i></b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	103	Define C preprocessor macro <i>name</i>. This macro is usable by
				104	"<tt><a href='#pifdef'>%ifdef</a></tt>" and
				105	"<tt><a href='#pifdef'>%ifndef</a></tt>" lines
				106	in the grammar file.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	107	<li><b>-g</b>
				108	Do not generate a parser. Instead write the input grammar to standard
				109	output with all comments, actions, and other extraneous text removed.
				110	<li><b>-l</b>
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	111	Omit "#line" directives in the generated parser C code.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	112	<li><b>-m</b>
				113	Cause the output C source code to be compatible with the "makeheaders"
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	114	program.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	115	<li><b>-p</b>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	116	Display all conflicts that are resolved by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	117	<a href='#precrules'>precedence rules</a>.
				118	<li><b>-q</b>
				119	Suppress generation of the report file.
				120	<li><b>-r</b>
				121	Do not sort or renumber the parser states as part of optimization.
				122	<li><b>-s</b>
				123	Show parser statistics before existing.
				124	<li><b>-T<i>file</i></b>
				125	Use <i>file</i> as the template for the generated C-code parser implementation.
				126	<li><b>-x</b>
				127	Print the Lemon version number.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	128	</ul>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	129
				130	<h3>The Parser Interface</h3>
				131
				132	<p>Lemon doesn't generate a complete, working program. It only generates
				133	a few subroutines that implement a parser. This section describes
				134	the interface to those subroutines. It is up to the programmer to
				135	call these subroutines in an appropriate way in order to produce a
				136	complete system.</p>
				137
				138	<p>Before a program begins using a Lemon-generated parser, the program
				139	must first create the parser.
				140	A new parser is created as follows:
				141	<pre>
				142	void *pParser = ParseAlloc( malloc );
				143	</pre>
				144	The ParseAlloc() routine allocates and initializes a new parser and
				145	returns a pointer to it.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	146	The actual data structure used to represent a parser is opaque —
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	147	its internal structure is not visible or usable by the calling routine.
				148	For this reason, the ParseAlloc() routine returns a pointer to void
				149	rather than a pointer to some particular structure.
				150	The sole argument to the ParseAlloc() routine is a pointer to the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	151	subroutine used to allocate memory. Typically this means malloc().</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	152
				153	<p>After a program is finished using a parser, it can reclaim all
				154	memory allocated by that parser by calling
				155	<pre>
				156	ParseFree(pParser, free);
				157	</pre>
				158	The first argument is the same pointer returned by ParseAlloc(). The
				159	second argument is a pointer to the function used to release bulk
				160	memory back to the system.</p>
				161
				162	<p>After a parser has been allocated using ParseAlloc(), the programmer
				163	must supply the parser with a sequence of tokens (terminal symbols) to
				164	be parsed. This is accomplished by calling the following function
				165	once for each token:
				166	<pre>
				167	Parse(pParser, hTokenID, sTokenData, pArg);
				168	</pre>
				169	The first argument to the Parse() routine is the pointer returned by
				170	ParseAlloc().
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	171	The second argument is a small positive integer that tells the parser the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	172	type of the next token in the data stream.
				173	There is one token type for each terminal symbol in the grammar.
				174	The gram.h file generated by Lemon contains #define statements that
				175	map symbolic terminal symbol names into appropriate integer values.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	176	A value of 0 for the second argument is a special flag to the
				177	parser to indicate that the end of input has been reached.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	178	The third argument is the value of the given token. By default,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	179	the type of the third argument is "void*", but the grammar will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	180	usually redefine this type to be some kind of structure.
				181	Typically the second argument will be a broad category of tokens
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	182	such as "identifier" or "number" and the third argument will
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	183	be the name of the identifier or the value of the number.</p>
				184
				185	<p>The Parse() function may have either three or four arguments,
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	186	depending on the grammar. If the grammar specification file requests
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	187	it (via the <tt><a href='#extraarg'>%extra_argument</a></tt> directive),
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	188	the Parse() function will have a fourth parameter that can be
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	189	of any type chosen by the programmer. The parser doesn't do anything
				190	with this argument except to pass it through to action routines.
				191	This is a convenient mechanism for passing state information down
				192	to the action routines without having to use global variables.</p>
				193
				194	<p>A typical use of a Lemon parser might look something like the
				195	following:
				196	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	197	1 ParseTree ParseFile(const char zFilename){
				198	2 Tokenizer *pTokenizer;
				199	3 void *pParser;
				200	4 Token sToken;
				201	5 int hTokenId;
				202	6 ParserState sState;
				203	7
				204	8 pTokenizer = TokenizerCreate(zFilename);
				205	9 pParser = ParseAlloc( malloc );
				206	10 InitParserState(&sState);
				207	11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
				208	12 Parse(pParser, hTokenId, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	209	13 }
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	210	14 Parse(pParser, 0, sToken, &sState);
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	211	15 ParseFree(pParser, free );
				212	16 TokenizerFree(pTokenizer);
				213	17 return sState.treeRoot;
				214	18 }
				215	</pre>
				216	This example shows a user-written routine that parses a file of
				217	text and returns a pointer to the parse tree.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	218	(All error-handling code is omitted from this example to keep it
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	219	simple.)
				220	We assume the existence of some kind of tokenizer which is created
				221	using TokenizerCreate() on line 8 and deleted by TokenizerFree()
				222	on line 16. The GetNextToken() function on line 11 retrieves the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	223	next token from the input file and puts its type in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	224	integer variable hTokenId. The sToken variable is assumed to be
				225	some kind of structure that contains details about each token,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	226	such as its complete text, what line it occurs on, etc.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	227
				228	<p>This example also assumes the existence of structure of type
				229	ParserState that holds state information about a particular parse.
				230	An instance of such a structure is created on line 6 and initialized
				231	on line 10. A pointer to this structure is passed into the Parse()
				232	routine as the optional 4th argument.
				233	The action routine specified by the grammar for the parser can use
				234	the ParserState structure to hold whatever information is useful and
				235	appropriate. In the example, we note that the treeRoot field of
				236	the ParserState structure is left pointing to the root of the parse
				237	tree.</p>
				238
				239	<p>The core of this example as it relates to Lemon is as follows:
				240	<pre>
				241	ParseFile(){
				242	pParser = ParseAlloc( malloc );
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	243	while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	244	Parse(pParser, hTokenId, sToken);
				245	}
				246	Parse(pParser, 0, sToken);
				247	ParseFree(pParser, free );
				248	}
				249	</pre>
				250	Basically, what a program has to do to use a Lemon-generated parser
				251	is first create the parser, then send it lots of tokens obtained by
				252	tokenizing an input source. When the end of input is reached, the
				253	Parse() routine should be called one last time with a token type
				254	of 0. This step is necessary to inform the parser that the end of
				255	input has been reached. Finally, we reclaim memory used by the
				256	parser by calling ParseFree().</p>
				257
				258	<p>There is one other interface routine that should be mentioned
				259	before we move on.
				260	The ParseTrace() function can be used to generate debugging output
				261	from the parser. A prototype for this routine is as follows:
				262	<pre>
				263	ParseTrace(FILE stream, char zPrefix);
				264	</pre>
				265	After this routine is called, a short (one-line) message is written
				266	to the designated output stream every time the parser changes states
				267	or calls an action routine. Each such message is prefaced using
				268	the text given by zPrefix. This debugging output can be turned off
				269	by calling ParseTrace() again with a first argument of NULL (0).</p>
				270
				271	<h3>Differences With YACC and BISON</h3>
				272
				273	<p>Programmers who have previously used the yacc or bison parser
				274	generator will notice several important differences between yacc and/or
				275	bison and Lemon.
				276	<ul>
				277	<li>In yacc and bison, the parser calls the tokenizer. In Lemon,
				278	the tokenizer calls the parser.
				279	<li>Lemon uses no global variables. Yacc and bison use global variables
				280	to pass information between the tokenizer and parser.
				281	<li>Lemon allows multiple parsers to be running simultaneously. Yacc
				282	and bison do not.
				283	</ul>
				284	These differences may cause some initial confusion for programmers
				285	with prior yacc and bison experience.
				286	But after years of experience using Lemon, I firmly
				287	believe that the Lemon way of doing things is better.</p>
				288
drh	45f31be	2016-02-16 21:19:49 +0000	[diff] [blame]	289	<p><i>Updated as of 2016-02-16:</i>
				290	The text above was written in the 1990s.
				291	We are told that Bison has lately been enhanced to support the
				292	tokenizer-calls-parser paradigm used by Lemon, and to obviate the
				293	need for global variables.</p>
				294
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	295	<h2>Input File Syntax</h2>
				296
				297	<p>The main purpose of the grammar specification file for Lemon is
				298	to define the grammar for the parser. But the input file also
				299	specifies additional information Lemon requires to do its job.
				300	Most of the work in using Lemon is in writing an appropriate
				301	grammar file.</p>
				302
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	303	<p>The grammar file for Lemon is, for the most part, free format.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	304	It does not have sections or divisions like yacc or bison. Any
				305	declaration can occur at any point in the file.
				306	Lemon ignores whitespace (except where it is needed to separate
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	307	tokens), and it honors the same commenting conventions as C and C++.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	308
				309	<h3>Terminals and Nonterminals</h3>
				310
				311	<p>A terminal symbol (token) is any string of alphanumeric
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	312	and/or underscore characters
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	313	that begins with an uppercase letter.
drh	c8eee5e	2011-07-30 23:50:12 +0000	[diff] [blame]	314	A terminal can contain lowercase letters after the first character,
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	315	but the usual convention is to make terminals all uppercase.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	316	A nonterminal, on the other hand, is any string of alphanumeric
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	317	and underscore characters than begins with a lowercase letter.
				318	Again, the usual convention is to make nonterminals use all lowercase
				319	letters.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	320
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	321	<p>In Lemon, terminal and nonterminal symbols do not need to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	322	be declared or identified in a separate section of the grammar file.
				323	Lemon is able to generate a list of all terminals and nonterminals
				324	by examining the grammar rules, and it can always distinguish a
				325	terminal from a nonterminal by checking the case of the first
				326	character of the name.</p>
				327
				328	<p>Yacc and bison allow terminal symbols to have either alphanumeric
				329	names or to be individual characters included in single quotes, like
				330	this: ')' or '$'. Lemon does not allow this alternative form for
				331	terminal symbols. With Lemon, all symbols, terminals and nonterminals,
				332	must have alphanumeric names.</p>
				333
				334	<h3>Grammar Rules</h3>
				335
				336	<p>The main component of a Lemon grammar file is a sequence of grammar
				337	rules.
				338	Each grammar rule consists of a nonterminal symbol followed by
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	339	the special symbol "::=" and then a list of terminals and/or nonterminals.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	340	The rule is terminated by a period.
				341	The list of terminals and nonterminals on the right-hand side of the
				342	rule can be empty.
				343	Rules can occur in any order, except that the left-hand side of the
				344	first rule is assumed to be the start symbol for the grammar (unless
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	345	specified otherwise using the <tt><a href='#start_symbol'>%start_symbol</a></tt>
				346	directive described below.)
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	347	A typical sequence of grammar rules might look something like this:
				348	<pre>
				349	expr ::= expr PLUS expr.
				350	expr ::= expr TIMES expr.
				351	expr ::= LPAREN expr RPAREN.
				352	expr ::= VALUE.
				353	</pre>
				354	</p>
				355
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	356	<p>There is one non-terminal in this example, "expr", and five
				357	terminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
				358	"RPAREN" and "VALUE".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	359
				360	<p>Like yacc and bison, Lemon allows the grammar to specify a block
				361	of C code that will be executed whenever a grammar rule is reduced
				362	by the parser.
				363	In Lemon, this action is specified by putting the C code (contained
				364	within curly braces <tt>{...}</tt>) immediately after the
				365	period that closes the rule.
				366	For example:
				367	<pre>
				368	expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
				369	</pre>
				370	</p>
				371
				372	<p>In order to be useful, grammar actions must normally be linked to
				373	their associated grammar rules.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	374	In yacc and bison, this is accomplished by embedding a "$$" in the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	375	action to stand for the value of the left-hand side of the rule and
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	376	symbols "$1", "$2", and so forth to stand for the value of
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	377	the terminal or nonterminal at position 1, 2 and so forth on the
				378	right-hand side of the rule.
				379	This idea is very powerful, but it is also very error-prone. The
				380	single most common source of errors in a yacc or bison grammar is
				381	to miscount the number of symbols on the right-hand side of a grammar
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	382	rule and say "$7" when you really mean "$8".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	383
				384	<p>Lemon avoids the need to count grammar symbols by assigning symbolic
				385	names to each symbol in a grammar rule and then using those symbolic
				386	names in the action.
				387	In yacc or bison, one would write this:
				388	<pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	389	expr -> expr PLUS expr { $$ = $1 + $3; };
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	390	</pre>
				391	But in Lemon, the same rule becomes the following:
				392	<pre>
				393	expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
				394	</pre>
				395	In the Lemon rule, any symbol in parentheses after a grammar rule
				396	symbol becomes a place holder for that symbol in the grammar rule.
				397	This place holder can then be used in the associated C action to
				398	stand for the value of that symbol.<p>
				399
				400	<p>The Lemon notation for linking a grammar rule with its reduce
				401	action is superior to yacc/bison on several counts.
				402	First, as mentioned above, the Lemon method avoids the need to
				403	count grammar symbols.
				404	Secondly, if a terminal or nonterminal in a Lemon grammar rule
				405	includes a linking symbol in parentheses but that linking symbol
				406	is not actually used in the reduce action, then an error message
				407	is generated.
				408	For example, the rule
				409	<pre>
				410	expr(A) ::= expr(B) PLUS expr(C). { A = B; }
				411	</pre>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	412	will generate an error because the linking symbol "C" is used
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	413	in the grammar rule but not in the reduce action.</p>
				414
				415	<p>The Lemon notation for linking grammar rules to reduce actions
				416	also facilitates the use of destructors for reclaiming memory
				417	allocated by the values of terminals and nonterminals on the
				418	right-hand side of a rule.</p>
				419
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	420	<a name='precrules'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	421	<h3>Precedence Rules</h3>
				422
				423	<p>Lemon resolves parsing ambiguities in exactly the same way as
				424	yacc and bison. A shift-reduce conflict is resolved in favor
				425	of the shift, and a reduce-reduce conflict is resolved by reducing
				426	whichever rule comes first in the grammar file.</p>
				427
				428	<p>Just like in
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	429	yacc and bison, Lemon allows a measure of control
				430	over the resolution of parsing conflicts using precedence rules.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	431	A precedence value can be assigned to any terminal symbol
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	432	using the
				433	<tt><a href='#pleft'>%left</a></tt>,
				434	<tt><a href='#pright'>%right</a></tt> or
				435	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives. Terminal symbols
				436	mentioned in earlier directives have a lower precedence than
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	437	terminal symbols mentioned in later directives. For example:</p>
				438
				439	<p><pre>
				440	%left AND.
				441	%left OR.
				442	%nonassoc EQ NE GT GE LT LE.
				443	%left PLUS MINUS.
				444	%left TIMES DIVIDE MOD.
				445	%right EXP NOT.
				446	</pre></p>
				447
				448	<p>In the preceding sequence of directives, the AND operator is
				449	defined to have the lowest precedence. The OR operator is one
				450	precedence level higher. And so forth. Hence, the grammar would
				451	attempt to group the ambiguous expression
				452	<pre>
				453	a AND b OR c
				454	</pre>
				455	like this
				456	<pre>
				457	a AND (b OR c).
				458	</pre>
				459	The associativity (left, right or nonassoc) is used to determine
				460	the grouping when the precedence is the same. AND is left-associative
				461	in our example, so
				462	<pre>
				463	a AND b AND c
				464	</pre>
				465	is parsed like this
				466	<pre>
				467	(a AND b) AND c.
				468	</pre>
				469	The EXP operator is right-associative, though, so
				470	<pre>
				471	a EXP b EXP c
				472	</pre>
				473	is parsed like this
				474	<pre>
				475	a EXP (b EXP c).
				476	</pre>
				477	The nonassoc precedence is used for non-associative operators.
				478	So
				479	<pre>
				480	a EQ b EQ c
				481	</pre>
				482	is an error.</p>
				483
				484	<p>The precedence of non-terminals is transferred to rules as follows:
				485	The precedence of a grammar rule is equal to the precedence of the
				486	left-most terminal symbol in the rule for which a precedence is
				487	defined. This is normally what you want, but in those cases where
				488	you want to precedence of a grammar rule to be something different,
				489	you can specify an alternative precedence symbol by putting the
				490	symbol in square braces after the period at the end of the rule and
				491	before any C-code. For example:</p>
				492
				493	<p><pre>
				494	expr = MINUS expr. [NOT]
				495	</pre></p>
				496
				497	<p>This rule has a precedence equal to that of the NOT symbol, not the
				498	MINUS symbol as would have been the case by default.</p>
				499
				500	<p>With the knowledge of how precedence is assigned to terminal
				501	symbols and individual
				502	grammar rules, we can now explain precisely how parsing conflicts
				503	are resolved in Lemon. Shift-reduce conflicts are resolved
				504	as follows:
				505	<ul>
				506	<li> If either the token to be shifted or the rule to be reduced
				507	lacks precedence information, then resolve in favor of the
				508	shift, but report a parsing conflict.
				509	<li> If the precedence of the token to be shifted is greater than
				510	the precedence of the rule to reduce, then resolve in favor
				511	of the shift. No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	512	<li> If the precedence of the token to be shifted is less than the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	513	precedence of the rule to reduce, then resolve in favor of the
				514	reduce action. No parsing conflict is reported.
				515	<li> If the precedences are the same and the shift token is
				516	right-associative, then resolve in favor of the shift.
				517	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	518	<li> If the precedences are the same and the shift token is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	519	left-associative, then resolve in favor of the reduce.
				520	No parsing conflict is reported.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	521	<li> Otherwise, resolve the conflict by doing the shift, and
				522	report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	523	</ul>
				524	Reduce-reduce conflicts are resolved this way:
				525	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	526	<li> If either reduce rule
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	527	lacks precedence information, then resolve in favor of the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	528	rule that appears first in the grammar, and report a parsing
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	529	conflict.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	530	<li> If both rules have precedence and the precedence is different,
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	531	then resolve the dispute in favor of the rule with the highest
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	532	precedence, and do not report a conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	533	<li> Otherwise, resolve the conflict by reducing by the rule that
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	534	appears first in the grammar, and report a parsing conflict.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	535	</ul>
				536
				537	<h3>Special Directives</h3>
				538
				539	<p>The input grammar to Lemon consists of grammar rules and special
				540	directives. We've described all the grammar rules, so now we'll
				541	talk about the special directives.</p>
				542
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	543	<p>Directives in Lemon can occur in any order. You can put them before
				544	the grammar rules, or after the grammar rules, or in the midst of the
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	545	grammar rules. It doesn't matter. The relative order of
				546	directives used to assign precedence to terminals is important, but
				547	other than that, the order of directives in Lemon is arbitrary.</p>
				548
				549	<p>Lemon supports the following special directives:
				550	<ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	551	<li><tt><a href='#pcode'>%code</a></tt>
				552	<li><tt><a href='#default_destructor'>%default_destructor</a></tt>
				553	<li><tt><a href='#default_type'>%default_type</a></tt>
				554	<li><tt><a href='#destructor'>%destructor</a></tt>
				555	<li><tt><a href='#pifdef'>%endif</a></tt>
				556	<li><tt><a href='#extraarg'>%extra_argument</a></tt>
				557	<li><tt><a href='#pfallback'>%fallback</a></tt>
				558	<li><tt><a href='#pifdef'>%ifdef</a></tt>
				559	<li><tt><a href='#pifdef'>%ifndef</a></tt>
				560	<li><tt><a href='#pinclude'>%include</a></tt>
				561	<li><tt><a href='#pleft'>%left</a></tt>
				562	<li><tt><a href='#pname'>%name</a></tt>
				563	<li><tt><a href='#pnonassoc'>%nonassoc</a></tt>
				564	<li><tt><a href='#parse_accept'>%parse_accept</a></tt>
				565	<li><tt><a href='#parse_failure'>%parse_failure</a></tt>
				566	<li><tt><a href='#pright'>%right</a></tt>
				567	<li><tt><a href='#stack_overflow'>%stack_overflow</a></tt>
				568	<li><tt><a href='#stack_size'>%stack_size</a></tt>
				569	<li><tt><a href='#start_symbol'>%start_symbol</a></tt>
				570	<li><tt><a href='#syntax_error'>%syntax_error</a></tt>
				571	<li><tt><a href='#token_class'>%token_class</a></tt>
				572	<li><tt><a href='#token_destructor'>%token_destructor</a></tt>
				573	<li><tt><a href='#token_prefix'>%token_prefix</a></tt>
				574	<li><tt><a href='#token_type'>%token_type</a></tt>
				575	<li><tt><a href='#ptype'>%type</a></tt>
				576	<li><tt><a href='#pwildcard'>%wildcard</a></tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	577	</ul>
				578	Each of these directives will be described separately in the
				579	following sections:</p>
				580
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	581	<a name='pcode'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	582	<h4>The <tt>%code</tt> directive</h4>
				583
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	584	<p>The <tt>%code</tt> directive is used to specify additional C code that
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	585	is added to the end of the main output file. This is similar to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	586	the <tt><a href='#pinclude'>%include</a></tt> directive except that
				587	<tt>%include</tt> is inserted at the beginning of the main output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	588
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	589	<p><tt>%code</tt> is typically used to include some action routines or perhaps
				590	a tokenizer or even the "main()" function
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	591	as part of the output file.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	592
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	593	<a name='default_destructor'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	594	<h4>The <tt>%default_destructor</tt> directive</h4>
				595
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	596	<p>The <tt>%default_destructor</tt> directive specifies a destructor to
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	597	use for non-terminals that do not have their own destructor
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	598	specified by a separate <tt>%destructor</tt> directive. See the documentation
				599	on the <tt><a name='#destructor'>%destructor</a></tt> directive below for
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	600	additional information.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	601
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	602	<p>In some grammars, many different non-terminal symbols have the
				603	same data type and hence the same destructor. This directive is
				604	a convenient way to specify the same destructor for all those
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	605	non-terminals using a single statement.</p>
				606
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	607	<a name='default_type'></a>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	608	<h4>The <tt>%default_type</tt> directive</h4>
				609
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	610	<p>The <tt>%default_type</tt> directive specifies the data type of non-terminal
				611	symbols that do not have their own data type defined using a separate
				612	<tt><a href='#ptype'>%type</a></tt> directive.</p>
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	613
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	614	<a name='destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	615	<h4>The <tt>%destructor</tt> directive</h4>
				616
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	617	<p>The <tt>%destructor</tt> directive is used to specify a destructor for
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	618	a non-terminal symbol.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	619	(See also the <tt><a href='#token_destructor'>%token_destructor</a></tt>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	620	directive which is used to specify a destructor for terminal symbols.)</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	621
				622	<p>A non-terminal's destructor is called to dispose of the
				623	non-terminal's value whenever the non-terminal is popped from
				624	the stack. This includes all of the following circumstances:
				625	<ul>
				626	<li> When a rule reduces and the value of a non-terminal on
				627	the right-hand side is not linked to C code.
				628	<li> When the stack is popped during error processing.
				629	<li> When the ParseFree() function runs.
				630	</ul>
				631	The destructor can do whatever it wants with the value of
				632	the non-terminal, but its design is to deallocate memory
				633	or other resources held by that non-terminal.</p>
				634
				635	<p>Consider an example:
				636	<pre>
				637	%type nt {void*}
				638	%destructor nt { free($$); }
				639	nt(A) ::= ID NUM. { A = malloc( 100 ); }
				640	</pre>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	641	This example is a bit contrived, but it serves to illustrate how
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	642	destructors work. The example shows a non-terminal named
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	643	"nt" that holds values of type "void*". When the rule for
				644	an "nt" reduces, it sets the value of the non-terminal to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	645	space obtained from malloc(). Later, when the nt non-terminal
				646	is popped from the stack, the destructor will fire and call
				647	free() on this malloced space, thus avoiding a memory leak.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	648	(Note that the symbol "$$" in the destructor code is replaced
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	649	by the value of the non-terminal.)</p>
				650
				651	<p>It is important to note that the value of a non-terminal is passed
				652	to the destructor whenever the non-terminal is removed from the
				653	stack, unless the non-terminal is used in a C-code action. If
				654	the non-terminal is used by C-code, then it is assumed that the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	655	C-code will take care of destroying it.
				656	More commonly, the value is used to build some
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	657	larger structure, and we don't want to destroy it, which is why
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	658	the destructor is not called in this circumstance.</p>
				659
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	660	<p>Destructors help avoid memory leaks by automatically freeing
				661	allocated objects when they go out of scope.
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	662	To do the same using yacc or bison is much more difficult.</p>
				663
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	664	<a name='extraarg'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	665	<h4>The <tt>%extra_argument</tt> directive</h4>
				666
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	667	The <tt>%extra_argument</tt> directive instructs Lemon to add a 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	668	to the parameter list of the Parse() function it generates. Lemon
				669	doesn't do anything itself with this extra argument, but it does
				670	make the argument available to C-code action routines, destructors,
				671	and so forth. For example, if the grammar file contains:</p>
				672
				673	<p><pre>
				674	%extra_argument { MyStruct *pAbc }
				675	</pre></p>
				676
				677	<p>Then the Parse() function generated will have an 4th parameter
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	678	of type "MyStruct*" and all action routines will have access to
				679	a variable named "pAbc" that is the value of the 4th parameter
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	680	in the most recent call to Parse().</p>
				681
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	682	<a name='pfallback'></a>
				683	<h4>The <tt>%fallback</tt> directive</h4>
				684
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	685	<p>The <tt>%fallback</tt> directive specifies an alternative meaning for one
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	686	or more tokens. The alternative meaning is tried if the original token
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	687	would have generated a syntax error.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	688
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	689	<p>The <tt>%fallback</tt> directive was added to support robust parsing of SQL
				690	syntax in <a href='https://www.sqlite.org/'>SQLite</a>.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	691	The SQL language contains a large assortment of keywords, each of which
				692	appears as a different token to the language parser. SQL contains so
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	693	many keywords that it can be difficult for programmers to keep up with
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	694	them all. Programmers will, therefore, sometimes mistakenly use an
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	695	obscure language keyword for an identifier. The <tt>%fallback</tt> directive
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	696	provides a mechanism to tell the parser: "If you are unable to parse
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	697	this keyword, try treating it as an identifier instead."</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	698
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	699	<p>The syntax of <tt>%fallback</tt> is as follows:
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	700
				701	<blockquote>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	702	<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b>
				703	</blockquote></p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	704
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	705	<p>In words, the <tt>%fallback</tt> directive is followed by a list of token
				706	names terminated by a period.
				707	The first token name is the fallback token — the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	708	token to which all the other tokens fall back to. The second and subsequent
				709	arguments are tokens which fall back to the token identified by the first
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	710	argument.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	711
				712	<a name='pifdef'></a>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	713	<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives</h4>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	714
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	715	<p>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives
				716	are similar to #ifdef, #ifndef, and #endif in the C-preprocessor,
				717	just not as general.
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	718	Each of these directives must begin at the left margin. No whitespace
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	719	is allowed between the "%" and the directive name.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	720
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	721	<p>Grammar text in between "<tt>%ifdef MACRO</tt>" and the next nested
				722	"<tt>%endif</tt>" is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	723	ignored unless the "-DMACRO" command-line option is used. Grammar text
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	724	betwen "<tt>%ifndef MACRO</tt>" and the next nested "<tt>%endif</tt>" is
				725	included except when the "-DMACRO" command-line option is used.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	726
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	727	<p>Note that the argument to <tt>%ifdef</tt> and <tt>%ifndef</tt> must
				728	be a single preprocessor symbol name, not a general expression.
				729	There is no "<tt>%else</tt>" directive.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	730
				731
				732	<a name='pinclude'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	733	<h4>The <tt>%include</tt> directive</h4>
				734
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	735	<p>The <tt>%include</tt> directive specifies C code that is included at the
				736	top of the generated parser. You can include any text you want —
drh	f2340fc	2001-06-08 00:25:18 +0000	[diff] [blame]	737	the Lemon parser generator copies it blindly. If you have multiple
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	738	<tt>%include</tt> directives in your grammar file, their values are concatenated
				739	so that all <tt>%include</tt> code ultimately appears near the top of the
				740	generated parser, in the same order as it appeared in the grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	741
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	742	<p>The <tt>%include</tt> directive is very handy for getting some extra #include
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	743	preprocessor statements at the beginning of the generated parser.
				744	For example:</p>
				745
				746	<p><pre>
				747	%include {#include <unistd.h>}
				748	</pre></p>
				749
				750	<p>This might be needed, for example, if some of the C actions in the
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	751	grammar call functions that are prototyped in unistd.h.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	752
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	753	<a name='pleft'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	754	<h4>The <tt>%left</tt> directive</h4>
				755
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	756	The <tt>%left</tt> directive is used (along with the
				757	<tt><a href='#pright'>%right</a></tt> and
				758	<tt><a href='#pnonassoc'>%nonassoc</a></tt> directives) to declare
				759	precedences of terminal symbols.
				760	Every terminal symbol whose name appears after
				761	a <tt>%left</tt> directive but before the next period (".") is
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	762	given the same left-associative precedence value. Subsequent
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	763	<tt>%left</tt> directives have higher precedence. For example:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	764
				765	<p><pre>
				766	%left AND.
				767	%left OR.
				768	%nonassoc EQ NE GT GE LT LE.
				769	%left PLUS MINUS.
				770	%left TIMES DIVIDE MOD.
				771	%right EXP NOT.
				772	</pre></p>
				773
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	774	<p>Note the period that terminates each <tt>%left</tt>,
				775	<tt>%right</tt> or <tt>%nonassoc</tt>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	776	directive.</p>
				777
				778	<p>LALR(1) grammars can get into a situation where they require
				779	a large amount of stack space if you make heavy use or right-associative
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	780	operators. For this reason, it is recommended that you use <tt>%left</tt>
				781	rather than <tt>%right</tt> whenever possible.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	782
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	783	<a name='pname'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	784	<h4>The <tt>%name</tt> directive</h4>
				785
				786	<p>By default, the functions generated by Lemon all begin with the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	787	five-character string "Parse". You can change this string to something
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	788	different using the <tt>%name</tt> directive. For instance:</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	789
				790	<p><pre>
				791	%name Abcde
				792	</pre></p>
				793
				794	<p>Putting this directive in the grammar file will cause Lemon to generate
				795	functions named
				796	<ul>
				797	<li> AbcdeAlloc(),
				798	<li> AbcdeFree(),
				799	<li> AbcdeTrace(), and
				800	<li> Abcde().
				801	</ul>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	802	The <tt>%name</tt> directive allows you to generate two or more different
				803	parsers and link them all into the same executable.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	804
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	805	<a name='pnonassoc'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	806	<h4>The <tt>%nonassoc</tt> directive</h4>
				807
				808	<p>This directive is used to assign non-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	809	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	810	<a href='#precrules'>precedence rules</a>
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	811	or on the <tt><a href='#pleft'>%left</a></tt> directive
				812	for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	813
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	814	<a name='parse_accept'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	815	<h4>The <tt>%parse_accept</tt> directive</h4>
				816
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	817	<p>The <tt>%parse_accept</tt> directive specifies a block of C code that is
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	818	executed whenever the parser accepts its input string. To "accept"
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	819	an input string means that the parser was able to process all tokens
				820	without error.</p>
				821
				822	<p>For example:</p>
				823
				824	<p><pre>
				825	%parse_accept {
				826	printf("parsing complete!\n");
				827	}
				828	</pre></p>
				829
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	830	<a name='parse_failure'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	831	<h4>The <tt>%parse_failure</tt> directive</h4>
				832
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	833	<p>The <tt>%parse_failure</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	834	is executed whenever the parser fails complete. This code is not
				835	executed until the parser has tried and failed to resolve an input
				836	error using is usual error recovery strategy. The routine is
				837	only invoked when parsing is unable to continue.</p>
				838
				839	<p><pre>
				840	%parse_failure {
				841	fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
				842	}
				843	</pre></p>
				844
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	845	<a name='pright'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	846	<h4>The <tt>%right</tt> directive</h4>
				847
				848	<p>This directive is used to assign right-associative precedence to
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	849	one or more terminal symbols. See the section on
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	850	<a href='#precrules'>precedence rules</a>
				851	or on the <a href='#pleft'>%left</a> directive for additional information.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	852
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	853	<a name='stack_overflow'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	854	<h4>The <tt>%stack_overflow</tt> directive</h4>
				855
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	856	<p>The <tt>%stack_overflow</tt> directive specifies a block of C code that
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	857	is executed if the parser's internal stack ever overflows. Typically
				858	this just prints an error message. After a stack overflow, the parser
				859	will be unable to continue and must be reset.</p>
				860
				861	<p><pre>
				862	%stack_overflow {
				863	fprintf(stderr,"Giving up. Parser stack overflow\n");
				864	}
				865	</pre></p>
				866
				867	<p>You can help prevent parser stack overflows by avoiding the use
				868	of right recursion and right-precedence operators in your grammar.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	869	Use left recursion and and left-precedence operators instead to
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	870	encourage rules to reduce sooner and keep the stack size down.
				871	For example, do rules like this:
				872	<pre>
				873	list ::= list element. // left-recursion. Good!
				874	list ::= .
				875	</pre>
				876	Not like this:
				877	<pre>
				878	list ::= element list. // right-recursion. Bad!
				879	list ::= .
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	880	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	881
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	882	<a name='stack_size'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	883	<h4>The <tt>%stack_size</tt> directive</h4>
				884
				885	<p>If stack overflow is a problem and you can't resolve the trouble
				886	by using left-recursion, then you might want to increase the size
				887	of the parser's stack using this directive. Put an positive integer
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	888	after the <tt>%stack_size</tt> directive and Lemon will generate a parse
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	889	with a stack of the requested size. The default value is 100.</p>
				890
				891	<p><pre>
				892	%stack_size 2000
				893	</pre></p>
				894
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	895	<a name='start_symbol'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	896	<h4>The <tt>%start_symbol</tt> directive</h4>
				897
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	898	<p>By default, the start symbol for the grammar that Lemon generates
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	899	is the first non-terminal that appears in the grammar file. But you
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	900	can choose a different start symbol using the
				901	<tt>%start_symbol</tt> directive.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	902
				903	<p><pre>
				904	%start_symbol prog
				905	</pre></p>
				906
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	907	<a name='syntax_error'></a>
				908	<h4>The <tt>%syntax_error</tt> directive</h4>
				909
				910	<p>See <a href='#error_processing'>Error Processing</a>.</p>
				911
				912	<a name='token_class'></a>
				913	<h4>The <tt>%token_class</tt> directive</h4>
				914
				915	<p>Undocumented. Appears to be related to the MULTITERMINAL concept.
				916	<a href='http://sqlite.org/src/fdiff?v1=796930d5fc2036c7&v2=624b24c5dc048e09&sbs=0'>Implementation</a>.</p>
				917
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	918	<a name='token_destructor'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	919	<h4>The <tt>%token_destructor</tt> directive</h4>
				920
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	921	<p>The <tt>%destructor</tt> directive assigns a destructor to a non-terminal
				922	symbol. (See the description of the
				923	<tt><a href='%destructor'>%destructor</a></tt> directive above.)
				924	The <tt>%token_destructor</tt> directive does the same thing
				925	for all terminal symbols.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	926
				927	<p>Unlike non-terminal symbols which may each have a different data type
				928	for their values, terminals all use the same data type (defined by
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	929	the <tt><a href='#token_type'>%token_type</a></tt> directive)
				930	and so they use a common destructor.
				931	Other than that, the token destructor works just like the non-terminal
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	932	destructors.</p>
				933
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	934	<a name='token_prefix'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	935	<h4>The <tt>%token_prefix</tt> directive</h4>
				936
				937	<p>Lemon generates #defines that assign small integer constants
				938	to each terminal symbol in the grammar. If desired, Lemon will
				939	add a prefix specified by this directive
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	940	to each of the #defines it generates.</p>
				941
				942	<p>So if the default output of Lemon looked like this:
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	943	<pre>
				944	#define AND 1
				945	#define MINUS 2
				946	#define OR 3
				947	#define PLUS 4
				948	</pre>
				949	You can insert a statement into the grammar like this:
				950	<pre>
				951	%token_prefix TOKEN_
				952	</pre>
				953	to cause Lemon to produce these symbols instead:
				954	<pre>
				955	#define TOKEN_AND 1
				956	#define TOKEN_MINUS 2
				957	#define TOKEN_OR 3
				958	#define TOKEN_PLUS 4
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	959	</pre></p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	960
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	961	<a name='token_type'></a><a name='ptype'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	962	<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
				963
				964	<p>These directives are used to specify the data types for values
				965	on the parser's stack associated with terminal and non-terminal
				966	symbols. The values of all terminal symbols must be of the same
				967	type. This turns out to be the same data type as the 3rd parameter
				968	to the Parse() function generated by Lemon. Typically, you will
				969	make the value of a terminal symbol by a pointer to some kind of
				970	token structure. Like this:</p>
				971
				972	<p><pre>
				973	%token_type {Token*}
				974	</pre></p>
				975
				976	<p>If the data type of terminals is not specified, the default value
drh	dfe4e6b	2016-10-08 13:34:08 +0000	[diff] [blame]	977	is "void*".</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	978
				979	<p>Non-terminal symbols can each have their own data types. Typically
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	980	the data type of a non-terminal is a pointer to the root of a parse tree
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	981	structure that contains all information about that non-terminal.
				982	For example:</p>
				983
				984	<p><pre>
				985	%type expr {Expr*}
				986	</pre></p>
				987
				988	<p>Each entry on the parser's stack is actually a union containing
				989	instances of all data types for every non-terminal and terminal symbol.
				990	Lemon will automatically use the correct element of this union depending
				991	on what the corresponding non-terminal or terminal symbol is. But
				992	the grammar designer should keep in mind that the size of the union
				993	will be the size of its largest element. So if you have a single
				994	non-terminal whose data type requires 1K of storage, then your 100
				995	entry parser stack will require 100K of heap space. If you are willing
				996	and able to pay that price, fine. You just need to know.</p>
				997
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	998	<a name='pwildcard'></a>
				999	<h4>The <tt>%wildcard</tt> directive</h4>
				1000
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1001	<p>The <tt>%wildcard</tt> directive is followed by a single token name and a
				1002	period. This directive specifies that the identified token should
				1003	match any input token.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1004
				1005	<p>When the generated parser has the choice of matching an input against
				1006	the wildcard token and some other token, the other token is always used.
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1007	The wildcard token is only matched if there are no alternatives.</p>
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1008
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1009	<a name='error_processing'></a>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1010	<h3>Error Processing</h3>
				1011
				1012	<p>After extensive experimentation over several years, it has been
				1013	discovered that the error recovery strategy used by yacc is about
				1014	as good as it gets. And so that is what Lemon uses.</p>
				1015
				1016	<p>When a Lemon-generated parser encounters a syntax error, it
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1017	first invokes the code specified by the <tt>%syntax_error</tt> directive, if
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1018	any. It then enters its error recovery strategy. The error recovery
				1019	strategy is to begin popping the parsers stack until it enters a
				1020	state where it is permitted to shift a special non-terminal symbol
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1021	named "error". It then shifts this non-terminal and continues
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1022	parsing. The <tt>%syntax_error</tt> routine will not be called again
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1023	until at least three new tokens have been successfully shifted.</p>
				1024
				1025	<p>If the parser pops its stack until the stack is empty, and it still
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1026	is unable to shift the error symbol, then the
				1027	<tt><a href='#parse_failure'>%parse_failure</a></tt> routine
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1028	is invoked and the parser resets itself to its start state, ready
				1029	to begin parsing a new file. This is what will happen at the very
drh	9a243e6	2017-09-20 09:09:34 +0000	[diff] [blame]	1030	first syntax error, of course, if there are no instances of the
drh	9bccde3	2016-03-19 18:00:44 +0000	[diff] [blame]	1031	"error" non-terminal in your grammar.</p>
drh	7589723	2000-05-29 14:26:00 +0000	[diff] [blame]	1032
				1033	</body>
				1034	</html>