H. Peter Anvin | 9a633fa | 2002-04-30 21:08:11 +0000 | [diff] [blame] | 1 | Internals of the Netwide Assembler |
| 2 | ================================== |
| 3 | |
| 4 | The Netwide Assembler is intended to be a modular, re-usable x86 |
| 5 | assembler, which can be embedded in other programs, for example as |
| 6 | the back end to a compiler. |
| 7 | |
| 8 | The assembler is composed of modules. The interfaces between them |
| 9 | look like: |
| 10 | |
| 11 | +--- preproc.c ----+ |
| 12 | | | |
| 13 | +---- parser.c ----+ |
| 14 | | | | |
| 15 | | float.c | |
| 16 | | | |
| 17 | +--- assemble.c ---+ |
| 18 | | | | |
| 19 | nasm.c ---+ insnsa.c +--- nasmlib.c |
| 20 | | | |
| 21 | +--- listing.c ----+ |
| 22 | | | |
| 23 | +---- labels.c ----+ |
| 24 | | | |
| 25 | +--- outform.c ----+ |
| 26 | | | |
| 27 | +----- *out.c -----+ |
| 28 | |
| 29 | In other words, each of `preproc.c', `parser.c', `assemble.c', |
| 30 | `labels.c', `listing.c', `outform.c' and each of the output format |
| 31 | modules `*out.c' are independent modules, which do not directly |
| 32 | inter-communicate except through the main program. |
| 33 | |
| 34 | The Netwide *Disassembler* is not intended to be particularly |
| 35 | portable or reusable or anything, however. So I won't bother |
| 36 | documenting it here. :-) |
| 37 | |
| 38 | nasmlib.c |
| 39 | --------- |
| 40 | |
| 41 | This is a library module; it contains simple library routines which |
| 42 | may be referenced by all other modules. Among these are a set of |
| 43 | wrappers around the standard `malloc' routines, which will report a |
| 44 | fatal error if they run out of memory, rather than returning NULL. |
| 45 | |
| 46 | preproc.c |
| 47 | --------- |
| 48 | |
| 49 | This contains a macro preprocessor, which takes a file name as input |
| 50 | and returns a sequence of preprocessed source lines. The only symbol |
| 51 | exported from the module is `nasmpp', which is a data structure of |
| 52 | type `Preproc', declared in nasm.h. This structure contains pointers |
| 53 | to all the functions designed to be callable from outside the |
| 54 | module. |
| 55 | |
| 56 | parser.c |
| 57 | -------- |
| 58 | |
| 59 | This contains a source-line parser. It parses `canonical' assembly |
| 60 | source lines, containing some combination of the `label', `opcode', |
| 61 | `operand' and `comment' fields: it does not process directives or |
| 62 | macros. It exports two functions: `parse_line' and `cleanup_insn'. |
| 63 | |
| 64 | `parse_line' is the main parser function: you pass it a source line |
| 65 | in ASCII text form, and it returns you an `insn' structure |
| 66 | containing all the details of the instruction on that line. The |
| 67 | parameters it requires are: |
| 68 | |
| 69 | - The location (segment, offset) where the instruction on this line |
| 70 | will eventually be placed. This is necessary in order to evaluate |
| 71 | expressions containing the Here token, `$'. |
| 72 | |
| 73 | - A function which can be called to retrieve the value of any |
| 74 | symbols the source line references. |
| 75 | |
| 76 | - Which pass the assembler is on: an undefined symbol only causes an |
| 77 | error condition on pass two. |
| 78 | |
| 79 | - The source line to be parsed. |
| 80 | |
| 81 | - A structure to fill with the results of the parse. |
| 82 | |
| 83 | - A function which can be called to report errors. |
| 84 | |
| 85 | Some instructions (DB, DW, DD for example) can require an arbitrary |
| 86 | amount of storage, and so some of the members of the resulting |
| 87 | `insn' structure will be dynamically allocated. The other function |
| 88 | exported by `parser.c' is `cleanup_insn', which can be called to |
| 89 | deallocate any dynamic storage associated with the results of a |
| 90 | parse. |
| 91 | |
| 92 | names.c |
| 93 | ------- |
| 94 | |
| 95 | This doesn't count as a module - it defines a few arrays which are |
| 96 | shared between NASM and NDISASM, so it's a separate file which is |
| 97 | #included by both parser.c and disasm.c. |
| 98 | |
| 99 | float.c |
| 100 | ------- |
| 101 | |
| 102 | This is essentially a library module: it exports one function, |
| 103 | `float_const', which converts an ASCII representation of a |
| 104 | floating-point number into an x86-compatible binary representation, |
| 105 | without using any built-in floating-point arithmetic (so it will run |
| 106 | on any platform, portably). It calls nothing, and is called only by |
| 107 | `parser.c'. Note that the function `float_const' must be passed an |
| 108 | error reporting routine. |
| 109 | |
| 110 | assemble.c |
| 111 | ---------- |
| 112 | |
| 113 | This module contains the code generator: it translates `insn' |
| 114 | structures as returned from the parser module into actual generated |
| 115 | code which can be placed in an output file. It exports two |
| 116 | functions, `assemble' and `insn_size'. |
| 117 | |
| 118 | `insn_size' is designed to be called on pass one of assembly: it |
| 119 | takes an `insn' structure as input, and returns the amount of space |
| 120 | that would be taken up if the instruction described in the structure |
| 121 | were to be converted to real machine code. `insn_size' also requires |
| 122 | to be told the location (as a segment/offset pair) where the |
| 123 | instruction would be assembled, the mode of assembly (16/32 bit |
| 124 | default), and a function it can call to report errors. |
| 125 | |
| 126 | `assemble' is designed to be called on pass two: it takes all the |
| 127 | parameters that `insn_size' does, but has an extra parameter which |
| 128 | is an output driver. `assemble' actually converts the input |
| 129 | instruction into machine code, and outputs the machine code by means |
| 130 | of calling the `output' function of the driver. |
| 131 | |
| 132 | insnsa.c |
| 133 | -------- |
| 134 | |
| 135 | This is another library module: it exports one very big array of |
H. Peter Anvin | 45724a8 | 2002-05-25 01:39:12 +0000 | [diff] [blame] | 136 | instruction translations. It is generated automatically from the |
| 137 | insns.dat file by the insns.pl script. |
H. Peter Anvin | 9a633fa | 2002-04-30 21:08:11 +0000 | [diff] [blame] | 138 | |
| 139 | labels.c |
| 140 | -------- |
| 141 | |
| 142 | This module contains a label manager. It exports six functions: |
| 143 | |
| 144 | `init_labels' should be called before any other function in the |
| 145 | module. `cleanup_labels' may be called after all other use of the |
| 146 | module has finished, to deallocate storage. |
| 147 | |
| 148 | `define_label' is called to define new labels: you pass it the name |
| 149 | of the label to be defined, and the (segment,offset) pair giving the |
| 150 | value of the label. It is also passed an error-reporting function, |
| 151 | and an output driver structure (so that it can call the output |
| 152 | driver's label-definition function). `define_label' mentally |
| 153 | prepends the name of the most recently defined non-local label to |
| 154 | any label beginning with a period. |
| 155 | |
| 156 | `define_label_stub' is designed to be called in pass two, once all |
| 157 | the labels have already been defined: it does nothing except to |
| 158 | update the "most-recently-defined-non-local-label" status, so that |
| 159 | references to local labels in pass two will work correctly. |
| 160 | |
| 161 | `declare_as_global' is used to declare that a label should be |
| 162 | global. It must be called _before_ the label in question is defined. |
| 163 | |
| 164 | Finally, `lookup_label' attempts to translate a label name into a |
| 165 | (segment,offset) pair. It returns non-zero on success. |
| 166 | |
| 167 | The label manager module is (theoretically :) restartable: after |
| 168 | calling `cleanup_labels', you can call `init_labels' again, and |
| 169 | start a new assembly with a new set of symbols. |
| 170 | |
| 171 | listing.c |
| 172 | --------- |
| 173 | |
| 174 | This file contains the listing file generator. The interface to the |
| 175 | module is through the one symbol it exports, `nasmlist', which is a |
| 176 | structure containing six function pointers. The calling semantics of |
| 177 | these functions isn't terribly well thought out, as yet, but it |
| 178 | works (just about) so it's going to get left alone for now... |
| 179 | |
| 180 | outform.c |
| 181 | --------- |
| 182 | |
| 183 | This small module contains a set of routines to manage a list of |
| 184 | output formats, and select one given a keyword. It contains three |
| 185 | small routines: `ofmt_register' which registers an output driver as |
| 186 | part of the managed list, `ofmt_list' which lists the available |
| 187 | drivers on stdout, and `ofmt_find' which tries to find the driver |
| 188 | corresponding to a given name. |
| 189 | |
| 190 | The output modules |
| 191 | ------------------ |
| 192 | |
| 193 | Each of the output modules, `outbin.o', `outelf.o' and so on, |
| 194 | exports only one symbol, which is an output driver data structure |
| 195 | containing pointers to all the functions needed to produce output |
| 196 | files of the appropriate type. |
| 197 | |
| 198 | The exception to this is `outcoff.o', which exports _two_ output |
| 199 | driver structures, since COFF and Win32 object file formats are very |
| 200 | similar and most of the code is shared between them. |
| 201 | |
| 202 | nasm.c |
| 203 | ------ |
| 204 | |
| 205 | This is the main program: it calls all the functions in the above |
| 206 | modules, and puts them together to form a working assembler. We |
| 207 | hope. :-) |
| 208 | |
| 209 | Segment Mechanism |
| 210 | ----------------- |
| 211 | |
| 212 | In NASM, the term `segment' is used to separate the different |
| 213 | sections/segments/groups of which an object file is composed. |
| 214 | Essentially, every address NASM is capable of understanding is |
| 215 | expressed as an offset from the beginning of some segment. |
| 216 | |
| 217 | The defining property of a segment is that if two symbols are |
| 218 | declared in the same segment, then the distance between them is |
| 219 | fixed at assembly time. Hence every externally-declared variable |
| 220 | must be declared in its own segment, since none of the locations of |
| 221 | these are known, and so no distances may be computed at assembly |
| 222 | time. |
| 223 | |
| 224 | The special segment value NO_SEG (-1) is used to denote an absolute |
| 225 | value, e.g. a constant whose value does not depend on relocation, |
| 226 | such as the _size_ of a data object. |
| 227 | |
| 228 | Apart from NO_SEG, segment indices all have their least significant |
| 229 | bit clear, if they refer to actual in-memory segments. For each |
| 230 | segment of this type, there is an auxiliary segment value, defined |
| 231 | to be the same number but with the LSB set, which denotes the |
| 232 | segment-base value of that segment, for object formats which support |
| 233 | it (Microsoft .OBJ, for example). |
| 234 | |
| 235 | Hence, if `textsym' is declared in a code segment with index 2, then |
| 236 | referencing `SEG textsym' would return zero offset from |
| 237 | segment-index 3. Or, in object formats which don't understand such |
| 238 | references, it would return an error instead. |
| 239 | |
| 240 | The next twist is SEG_ABS. Some symbols may be declared with a |
| 241 | segment value of SEG_ABS plus a 16-bit constant: this indicates that |
| 242 | they are far-absolute symbols, such as the BIOS keyboard buffer |
| 243 | under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are |
| 244 | handled with care in the parser, since they are supposed to evaluate |
| 245 | simply to their offset part within expressions, but applying SEG to |
| 246 | one should yield its segment part. A far-absolute should never find |
| 247 | its way _out_ of the parser, unless it is enclosed in a WRT clause, |
| 248 | in which case Microsoft 16-bit object formats will want to know |
| 249 | about it. |
| 250 | |
| 251 | Porting Issues |
| 252 | -------------- |
| 253 | |
| 254 | We have tried to write NASM in portable ANSI C: we do not assume |
| 255 | little-endianness or any hardware characteristics (in order that |
| 256 | NASM should work as a cross-assembler for x86 platforms, even when |
| 257 | run on other, stranger machines). |
| 258 | |
| 259 | Assumptions we _have_ made are: |
| 260 | |
| 261 | - We assume that `short' is at least 16 bits, and `long' at least |
| 262 | 32. This really _shouldn't_ be a problem, since Kernighan and |
| 263 | Ritchie tell us we are entitled to do so. |
| 264 | |
| 265 | - We rely on having more than 6 characters of significance on |
| 266 | externally linked symbols in the NASM sources. This may get fixed |
| 267 | at some point. We haven't yet come across a linker brain-dead |
| 268 | enough to get it wrong anyway. |
| 269 | |
| 270 | - We assume that `fopen' using the mode "wb" can be used to write |
| 271 | binary data files. This may be wrong on systems like VMS, with a |
| 272 | strange file system. Though why you'd want to run NASM on VMS is |
| 273 | beyond me anyway. |
| 274 | |
| 275 | That's it. Subject to those caveats, NASM should be completely |
| 276 | portable. If not, we _really_ want to know about it. |
| 277 | |
| 278 | Porting Non-Issues |
| 279 | ------------------ |
| 280 | |
| 281 | The following is _not_ a portability problem, although it looks like |
| 282 | one. |
| 283 | |
| 284 | - When compiling with some versions of DJGPP, you may get errors |
| 285 | such as `warning: ANSI C forbids braced-groups within |
| 286 | expressions'. This isn't NASM's fault - the problem seems to be |
| 287 | that DJGPP's definitions of the <ctype.h> macros include a |
| 288 | GNU-specific C extension. So when compiling using -ansi and |
| 289 | -pedantic, DJGPP complains about its own header files. It isn't a |
| 290 | problem anyway, since it still generates correct code. |