blob: 88f06bc244d4a9b17690737c1e2fb04f580452ce [file] [log] [blame]
H. Peter Anvin9a633fa2002-04-30 21:08:11 +00001Internals of the Netwide Assembler
2==================================
3
4The Netwide Assembler is intended to be a modular, re-usable x86
5assembler, which can be embedded in other programs, for example as
6the back end to a compiler.
7
8The assembler is composed of modules. The interfaces between them
9look like:
10
11 +--- preproc.c ----+
12 | |
13 +---- parser.c ----+
14 | | |
15 | float.c |
16 | |
17 +--- assemble.c ---+
18 | | |
19 nasm.c ---+ insnsa.c +--- nasmlib.c
20 | |
21 +--- listing.c ----+
22 | |
23 +---- labels.c ----+
24 | |
25 +--- outform.c ----+
26 | |
27 +----- *out.c -----+
28
29In other words, each of `preproc.c', `parser.c', `assemble.c',
30`labels.c', `listing.c', `outform.c' and each of the output format
31modules `*out.c' are independent modules, which do not directly
32inter-communicate except through the main program.
33
34The Netwide *Disassembler* is not intended to be particularly
35portable or reusable or anything, however. So I won't bother
36documenting it here. :-)
37
38nasmlib.c
39---------
40
41This is a library module; it contains simple library routines which
42may be referenced by all other modules. Among these are a set of
43wrappers around the standard `malloc' routines, which will report a
44fatal error if they run out of memory, rather than returning NULL.
45
46preproc.c
47---------
48
49This contains a macro preprocessor, which takes a file name as input
50and returns a sequence of preprocessed source lines. The only symbol
51exported from the module is `nasmpp', which is a data structure of
52type `Preproc', declared in nasm.h. This structure contains pointers
53to all the functions designed to be callable from outside the
54module.
55
56parser.c
57--------
58
59This contains a source-line parser. It parses `canonical' assembly
60source lines, containing some combination of the `label', `opcode',
61`operand' and `comment' fields: it does not process directives or
62macros. It exports two functions: `parse_line' and `cleanup_insn'.
63
64`parse_line' is the main parser function: you pass it a source line
65in ASCII text form, and it returns you an `insn' structure
66containing all the details of the instruction on that line. The
67parameters it requires are:
68
69- The location (segment, offset) where the instruction on this line
70 will eventually be placed. This is necessary in order to evaluate
71 expressions containing the Here token, `$'.
72
73- A function which can be called to retrieve the value of any
74 symbols the source line references.
75
76- Which pass the assembler is on: an undefined symbol only causes an
77 error condition on pass two.
78
79- The source line to be parsed.
80
81- A structure to fill with the results of the parse.
82
83- A function which can be called to report errors.
84
85Some instructions (DB, DW, DD for example) can require an arbitrary
86amount of storage, and so some of the members of the resulting
87`insn' structure will be dynamically allocated. The other function
88exported by `parser.c' is `cleanup_insn', which can be called to
89deallocate any dynamic storage associated with the results of a
90parse.
91
92names.c
93-------
94
95This doesn't count as a module - it defines a few arrays which are
96shared between NASM and NDISASM, so it's a separate file which is
97#included by both parser.c and disasm.c.
98
99float.c
100-------
101
102This is essentially a library module: it exports one function,
103`float_const', which converts an ASCII representation of a
104floating-point number into an x86-compatible binary representation,
105without using any built-in floating-point arithmetic (so it will run
106on any platform, portably). It calls nothing, and is called only by
107`parser.c'. Note that the function `float_const' must be passed an
108error reporting routine.
109
110assemble.c
111----------
112
113This module contains the code generator: it translates `insn'
114structures as returned from the parser module into actual generated
115code which can be placed in an output file. It exports two
116functions, `assemble' and `insn_size'.
117
118`insn_size' is designed to be called on pass one of assembly: it
119takes an `insn' structure as input, and returns the amount of space
120that would be taken up if the instruction described in the structure
121were to be converted to real machine code. `insn_size' also requires
122to be told the location (as a segment/offset pair) where the
123instruction would be assembled, the mode of assembly (16/32 bit
124default), and a function it can call to report errors.
125
126`assemble' is designed to be called on pass two: it takes all the
127parameters that `insn_size' does, but has an extra parameter which
128is an output driver. `assemble' actually converts the input
129instruction into machine code, and outputs the machine code by means
130of calling the `output' function of the driver.
131
132insnsa.c
133--------
134
135This is another library module: it exports one very big array of
H. Peter Anvin45724a82002-05-25 01:39:12 +0000136instruction translations. It is generated automatically from the
137insns.dat file by the insns.pl script.
H. Peter Anvin9a633fa2002-04-30 21:08:11 +0000138
139labels.c
140--------
141
142This module contains a label manager. It exports six functions:
143
144`init_labels' should be called before any other function in the
145module. `cleanup_labels' may be called after all other use of the
146module has finished, to deallocate storage.
147
148`define_label' is called to define new labels: you pass it the name
149of the label to be defined, and the (segment,offset) pair giving the
150value of the label. It is also passed an error-reporting function,
151and an output driver structure (so that it can call the output
152driver's label-definition function). `define_label' mentally
153prepends the name of the most recently defined non-local label to
154any label beginning with a period.
155
156`define_label_stub' is designed to be called in pass two, once all
157the labels have already been defined: it does nothing except to
158update the "most-recently-defined-non-local-label" status, so that
159references to local labels in pass two will work correctly.
160
161`declare_as_global' is used to declare that a label should be
162global. It must be called _before_ the label in question is defined.
163
164Finally, `lookup_label' attempts to translate a label name into a
165(segment,offset) pair. It returns non-zero on success.
166
167The label manager module is (theoretically :) restartable: after
168calling `cleanup_labels', you can call `init_labels' again, and
169start a new assembly with a new set of symbols.
170
171listing.c
172---------
173
174This file contains the listing file generator. The interface to the
175module is through the one symbol it exports, `nasmlist', which is a
176structure containing six function pointers. The calling semantics of
177these functions isn't terribly well thought out, as yet, but it
178works (just about) so it's going to get left alone for now...
179
180outform.c
181---------
182
183This small module contains a set of routines to manage a list of
184output formats, and select one given a keyword. It contains three
185small routines: `ofmt_register' which registers an output driver as
186part of the managed list, `ofmt_list' which lists the available
187drivers on stdout, and `ofmt_find' which tries to find the driver
188corresponding to a given name.
189
190The output modules
191------------------
192
193Each of the output modules, `outbin.o', `outelf.o' and so on,
194exports only one symbol, which is an output driver data structure
195containing pointers to all the functions needed to produce output
196files of the appropriate type.
197
198The exception to this is `outcoff.o', which exports _two_ output
199driver structures, since COFF and Win32 object file formats are very
200similar and most of the code is shared between them.
201
202nasm.c
203------
204
205This is the main program: it calls all the functions in the above
206modules, and puts them together to form a working assembler. We
207hope. :-)
208
209Segment Mechanism
210-----------------
211
212In NASM, the term `segment' is used to separate the different
213sections/segments/groups of which an object file is composed.
214Essentially, every address NASM is capable of understanding is
215expressed as an offset from the beginning of some segment.
216
217The defining property of a segment is that if two symbols are
218declared in the same segment, then the distance between them is
219fixed at assembly time. Hence every externally-declared variable
220must be declared in its own segment, since none of the locations of
221these are known, and so no distances may be computed at assembly
222time.
223
224The special segment value NO_SEG (-1) is used to denote an absolute
225value, e.g. a constant whose value does not depend on relocation,
226such as the _size_ of a data object.
227
228Apart from NO_SEG, segment indices all have their least significant
229bit clear, if they refer to actual in-memory segments. For each
230segment of this type, there is an auxiliary segment value, defined
231to be the same number but with the LSB set, which denotes the
232segment-base value of that segment, for object formats which support
233it (Microsoft .OBJ, for example).
234
235Hence, if `textsym' is declared in a code segment with index 2, then
236referencing `SEG textsym' would return zero offset from
237segment-index 3. Or, in object formats which don't understand such
238references, it would return an error instead.
239
240The next twist is SEG_ABS. Some symbols may be declared with a
241segment value of SEG_ABS plus a 16-bit constant: this indicates that
242they are far-absolute symbols, such as the BIOS keyboard buffer
243under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are
244handled with care in the parser, since they are supposed to evaluate
245simply to their offset part within expressions, but applying SEG to
246one should yield its segment part. A far-absolute should never find
247its way _out_ of the parser, unless it is enclosed in a WRT clause,
248in which case Microsoft 16-bit object formats will want to know
249about it.
250
251Porting Issues
252--------------
253
254We have tried to write NASM in portable ANSI C: we do not assume
255little-endianness or any hardware characteristics (in order that
256NASM should work as a cross-assembler for x86 platforms, even when
257run on other, stranger machines).
258
259Assumptions we _have_ made are:
260
261- We assume that `short' is at least 16 bits, and `long' at least
262 32. This really _shouldn't_ be a problem, since Kernighan and
263 Ritchie tell us we are entitled to do so.
264
265- We rely on having more than 6 characters of significance on
266 externally linked symbols in the NASM sources. This may get fixed
267 at some point. We haven't yet come across a linker brain-dead
268 enough to get it wrong anyway.
269
270- We assume that `fopen' using the mode "wb" can be used to write
271 binary data files. This may be wrong on systems like VMS, with a
272 strange file system. Though why you'd want to run NASM on VMS is
273 beyond me anyway.
274
275That's it. Subject to those caveats, NASM should be completely
276portable. If not, we _really_ want to know about it.
277
278Porting Non-Issues
279------------------
280
281The following is _not_ a portability problem, although it looks like
282one.
283
284- When compiling with some versions of DJGPP, you may get errors
285 such as `warning: ANSI C forbids braced-groups within
286 expressions'. This isn't NASM's fault - the problem seems to be
287 that DJGPP's definitions of the <ctype.h> macros include a
288 GNU-specific C extension. So when compiling using -ansi and
289 -pedantic, DJGPP complains about its own header files. It isn't a
290 problem anyway, since it still generates correct code.