Blame - doc/internal.doc - chromium.googlesource.com/chromium/deps/nasm

blob: 88f06bc244d4a9b17690737c1e2fb04f580452ce [file] [log] [blame]

H. Peter Anvin	9a633fa	2002-04-30 21:08:11 +0000	[diff] [blame]	1	Internals of the Netwide Assembler
				2	==================================
				3
				4	The Netwide Assembler is intended to be a modular, re-usable x86
				5	assembler, which can be embedded in other programs, for example as
				6	the back end to a compiler.
				7
				8	The assembler is composed of modules. The interfaces between them
				9	look like:
				10
				11	+--- preproc.c ----+
				12	\| \|
				13	+---- parser.c ----+
				14	\| \| \|
				15	\| float.c \|
				16	\| \|
				17	+--- assemble.c ---+
				18	\| \| \|
				19	nasm.c ---+ insnsa.c +--- nasmlib.c
				20	\| \|
				21	+--- listing.c ----+
				22	\| \|
				23	+---- labels.c ----+
				24	\| \|
				25	+--- outform.c ----+
				26	\| \|
				27	+----- *out.c -----+
				28
				29	In other words, each of `preproc.c', `parser.c', `assemble.c',
				30	`labels.c', `listing.c', `outform.c' and each of the output format
				31	modules `*out.c' are independent modules, which do not directly
				32	inter-communicate except through the main program.
				33
				34	The Netwide Disassembler is not intended to be particularly
				35	portable or reusable or anything, however. So I won't bother
				36	documenting it here. :-)
				37
				38	nasmlib.c
				39	---------
				40
				41	This is a library module; it contains simple library routines which
				42	may be referenced by all other modules. Among these are a set of
				43	wrappers around the standard `malloc' routines, which will report a
				44	fatal error if they run out of memory, rather than returning NULL.
				45
				46	preproc.c
				47	---------
				48
				49	This contains a macro preprocessor, which takes a file name as input
				50	and returns a sequence of preprocessed source lines. The only symbol
				51	exported from the module is `nasmpp', which is a data structure of
				52	type `Preproc', declared in nasm.h. This structure contains pointers
				53	to all the functions designed to be callable from outside the
				54	module.
				55
				56	parser.c
				57	--------
				58
				59	This contains a source-line parser. It parses `canonical' assembly
				60	source lines, containing some combination of the `label', `opcode',
				61	`operand' and `comment' fields: it does not process directives or
				62	macros. It exports two functions: `parse_line' and `cleanup_insn'.
				63
				64	`parse_line' is the main parser function: you pass it a source line
				65	in ASCII text form, and it returns you an `insn' structure
				66	containing all the details of the instruction on that line. The
				67	parameters it requires are:
				68
				69	- The location (segment, offset) where the instruction on this line
				70	will eventually be placed. This is necessary in order to evaluate
				71	expressions containing the Here token, `$'.
				72
				73	- A function which can be called to retrieve the value of any
				74	symbols the source line references.
				75
				76	- Which pass the assembler is on: an undefined symbol only causes an
				77	error condition on pass two.
				78
				79	- The source line to be parsed.
				80
				81	- A structure to fill with the results of the parse.
				82
				83	- A function which can be called to report errors.
				84
				85	Some instructions (DB, DW, DD for example) can require an arbitrary
				86	amount of storage, and so some of the members of the resulting
				87	`insn' structure will be dynamically allocated. The other function
				88	exported by `parser.c' is `cleanup_insn', which can be called to
				89	deallocate any dynamic storage associated with the results of a
				90	parse.
				91
				92	names.c
				93	-------
				94
				95	This doesn't count as a module - it defines a few arrays which are
				96	shared between NASM and NDISASM, so it's a separate file which is
				97	#included by both parser.c and disasm.c.
				98
				99	float.c
				100	-------
				101
				102	This is essentially a library module: it exports one function,
				103	`float_const', which converts an ASCII representation of a
				104	floating-point number into an x86-compatible binary representation,
				105	without using any built-in floating-point arithmetic (so it will run
				106	on any platform, portably). It calls nothing, and is called only by
				107	`parser.c'. Note that the function `float_const' must be passed an
				108	error reporting routine.
				109
				110	assemble.c
				111	----------
				112
				113	This module contains the code generator: it translates `insn'
				114	structures as returned from the parser module into actual generated
				115	code which can be placed in an output file. It exports two
				116	functions, `assemble' and `insn_size'.
				117
				118	`insn_size' is designed to be called on pass one of assembly: it
				119	takes an `insn' structure as input, and returns the amount of space
				120	that would be taken up if the instruction described in the structure
				121	were to be converted to real machine code. `insn_size' also requires
				122	to be told the location (as a segment/offset pair) where the
				123	instruction would be assembled, the mode of assembly (16/32 bit
				124	default), and a function it can call to report errors.
				125
				126	`assemble' is designed to be called on pass two: it takes all the
				127	parameters that `insn_size' does, but has an extra parameter which
				128	is an output driver. `assemble' actually converts the input
				129	instruction into machine code, and outputs the machine code by means
				130	of calling the `output' function of the driver.
				131
				132	insnsa.c
				133	--------
				134
				135	This is another library module: it exports one very big array of
H. Peter Anvin	45724a8	2002-05-25 01:39:12 +0000	[diff] [blame]	136	instruction translations. It is generated automatically from the
				137	insns.dat file by the insns.pl script.
H. Peter Anvin	9a633fa	2002-04-30 21:08:11 +0000	[diff] [blame]	138
				139	labels.c
				140	--------
				141
				142	This module contains a label manager. It exports six functions:
				143
				144	`init_labels' should be called before any other function in the
				145	module. `cleanup_labels' may be called after all other use of the
				146	module has finished, to deallocate storage.
				147
				148	`define_label' is called to define new labels: you pass it the name
				149	of the label to be defined, and the (segment,offset) pair giving the
				150	value of the label. It is also passed an error-reporting function,
				151	and an output driver structure (so that it can call the output
				152	driver's label-definition function). `define_label' mentally
				153	prepends the name of the most recently defined non-local label to
				154	any label beginning with a period.
				155
				156	`define_label_stub' is designed to be called in pass two, once all
				157	the labels have already been defined: it does nothing except to
				158	update the "most-recently-defined-non-local-label" status, so that
				159	references to local labels in pass two will work correctly.
				160
				161	`declare_as_global' is used to declare that a label should be
				162	global. It must be called _before_ the label in question is defined.
				163
				164	Finally, `lookup_label' attempts to translate a label name into a
				165	(segment,offset) pair. It returns non-zero on success.
				166
				167	The label manager module is (theoretically :) restartable: after
				168	calling `cleanup_labels', you can call `init_labels' again, and
				169	start a new assembly with a new set of symbols.
				170
				171	listing.c
				172	---------
				173
				174	This file contains the listing file generator. The interface to the
				175	module is through the one symbol it exports, `nasmlist', which is a
				176	structure containing six function pointers. The calling semantics of
				177	these functions isn't terribly well thought out, as yet, but it
				178	works (just about) so it's going to get left alone for now...
				179
				180	outform.c
				181	---------
				182
				183	This small module contains a set of routines to manage a list of
				184	output formats, and select one given a keyword. It contains three
				185	small routines: `ofmt_register' which registers an output driver as
				186	part of the managed list, `ofmt_list' which lists the available
				187	drivers on stdout, and `ofmt_find' which tries to find the driver
				188	corresponding to a given name.
				189
				190	The output modules
				191	------------------
				192
				193	Each of the output modules, `outbin.o', `outelf.o' and so on,
				194	exports only one symbol, which is an output driver data structure
				195	containing pointers to all the functions needed to produce output
				196	files of the appropriate type.
				197
				198	The exception to this is `outcoff.o', which exports _two_ output
				199	driver structures, since COFF and Win32 object file formats are very
				200	similar and most of the code is shared between them.
				201
				202	nasm.c
				203	------
				204
				205	This is the main program: it calls all the functions in the above
				206	modules, and puts them together to form a working assembler. We
				207	hope. :-)
				208
				209	Segment Mechanism
				210	-----------------
				211
				212	In NASM, the term `segment' is used to separate the different
				213	sections/segments/groups of which an object file is composed.
				214	Essentially, every address NASM is capable of understanding is
				215	expressed as an offset from the beginning of some segment.
				216
				217	The defining property of a segment is that if two symbols are
				218	declared in the same segment, then the distance between them is
				219	fixed at assembly time. Hence every externally-declared variable
				220	must be declared in its own segment, since none of the locations of
				221	these are known, and so no distances may be computed at assembly
				222	time.
				223
				224	The special segment value NO_SEG (-1) is used to denote an absolute
				225	value, e.g. a constant whose value does not depend on relocation,
				226	such as the _size_ of a data object.
				227
				228	Apart from NO_SEG, segment indices all have their least significant
				229	bit clear, if they refer to actual in-memory segments. For each
				230	segment of this type, there is an auxiliary segment value, defined
				231	to be the same number but with the LSB set, which denotes the
				232	segment-base value of that segment, for object formats which support
				233	it (Microsoft .OBJ, for example).
				234
				235	Hence, if `textsym' is declared in a code segment with index 2, then
				236	referencing `SEG textsym' would return zero offset from
				237	segment-index 3. Or, in object formats which don't understand such
				238	references, it would return an error instead.
				239
				240	The next twist is SEG_ABS. Some symbols may be declared with a
				241	segment value of SEG_ABS plus a 16-bit constant: this indicates that
				242	they are far-absolute symbols, such as the BIOS keyboard buffer
				243	under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are
				244	handled with care in the parser, since they are supposed to evaluate
				245	simply to their offset part within expressions, but applying SEG to
				246	one should yield its segment part. A far-absolute should never find
				247	its way _out_ of the parser, unless it is enclosed in a WRT clause,
				248	in which case Microsoft 16-bit object formats will want to know
				249	about it.
				250
				251	Porting Issues
				252	--------------
				253
				254	We have tried to write NASM in portable ANSI C: we do not assume
				255	little-endianness or any hardware characteristics (in order that
				256	NASM should work as a cross-assembler for x86 platforms, even when
				257	run on other, stranger machines).
				258
				259	Assumptions we _have_ made are:
				260
				261	- We assume that `short' is at least 16 bits, and `long' at least
				262	32. This really _shouldn't_ be a problem, since Kernighan and
				263	Ritchie tell us we are entitled to do so.
				264
				265	- We rely on having more than 6 characters of significance on
				266	externally linked symbols in the NASM sources. This may get fixed
				267	at some point. We haven't yet come across a linker brain-dead
				268	enough to get it wrong anyway.
				269
				270	- We assume that `fopen' using the mode "wb" can be used to write
				271	binary data files. This may be wrong on systems like VMS, with a
				272	strange file system. Though why you'd want to run NASM on VMS is
				273	beyond me anyway.
				274
				275	That's it. Subject to those caveats, NASM should be completely
				276	portable. If not, we _really_ want to know about it.
				277
				278	Porting Non-Issues
				279	------------------
				280
				281	The following is _not_ a portability problem, although it looks like
				282	one.
				283
				284	- When compiling with some versions of DJGPP, you may get errors
				285	such as `warning: ANSI C forbids braced-groups within
				286	expressions'. This isn't NASM's fault - the problem seems to be
				287	that DJGPP's definitions of the <ctype.h> macros include a
				288	GNU-specific C extension. So when compiling using -ansi and
				289	-pedantic, DJGPP complains about its own header files. It isn't a
				290	problem anyway, since it still generates correct code.