Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 1 | The memory API |
| 2 | ============== |
| 3 | |
| 4 | The memory API models the memory and I/O buses and controllers of a QEMU |
| 5 | machine. It attempts to allow modelling of: |
| 6 | |
| 7 | - ordinary RAM |
| 8 | - memory-mapped I/O (MMIO) |
| 9 | - memory controllers that can dynamically reroute physical memory regions |
Ademar de Souza Reis Jr | 69ddaf6 | 2011-12-05 16:54:14 -0300 | [diff] [blame] | 10 | to different destinations |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 11 | |
| 12 | The memory model provides support for |
| 13 | |
| 14 | - tracking RAM changes by the guest |
| 15 | - setting up coalesced memory for kvm |
| 16 | - setting up ioeventfd regions for kvm |
| 17 | |
Paolo Bonzini | 2d40178 | 2013-05-06 18:23:38 +0200 | [diff] [blame] | 18 | Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks |
| 19 | (leaves) are RAM and MMIO regions, while other nodes represent |
| 20 | buses, memory controllers, and memory regions that have been rerouted. |
| 21 | |
| 22 | In addition to MemoryRegion objects, the memory API provides AddressSpace |
| 23 | objects for every root and possibly for intermediate MemoryRegions too. |
| 24 | These represent memory as seen from the CPU or a device's viewpoint. |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 25 | |
| 26 | Types of regions |
| 27 | ---------------- |
| 28 | |
| 29 | There are four types of memory regions (all represented by a single C type |
| 30 | MemoryRegion): |
| 31 | |
| 32 | - RAM: a RAM region is simply a range of host memory that can be made available |
| 33 | to the guest. |
| 34 | |
| 35 | - MMIO: a range of guest memory that is implemented by host callbacks; |
| 36 | each read or write causes a callback to be called on the host. |
| 37 | |
| 38 | - container: a container simply includes other memory regions, each at |
| 39 | a different offset. Containers are useful for grouping several regions |
| 40 | into one unit. For example, a PCI BAR may be composed of a RAM region |
| 41 | and an MMIO region. |
| 42 | |
| 43 | A container's subregions are usually non-overlapping. In some cases it is |
| 44 | useful to have overlapping regions; for example a memory controller that |
| 45 | can overlay a subregion of RAM with MMIO or ROM, or a PCI controller |
| 46 | that does not prevent card from claiming overlapping BARs. |
| 47 | |
| 48 | - alias: a subsection of another region. Aliases allow a region to be |
| 49 | split apart into discontiguous regions. Examples of uses are memory banks |
| 50 | used when the guest address space is smaller than the amount of RAM |
| 51 | addressed, or a memory controller that splits main memory to expose a "PCI |
| 52 | hole". Aliases may point to any type of region, including other aliases, |
| 53 | but an alias may not point back to itself, directly or indirectly. |
| 54 | |
Peter Maydell | 6f1ce94 | 2013-10-15 15:42:34 +0100 | [diff] [blame] | 55 | It is valid to add subregions to a region which is not a pure container |
| 56 | (that is, to an MMIO, RAM or ROM region). This means that the region |
| 57 | will act like a container, except that any addresses within the container's |
| 58 | region which are not claimed by any subregion are handled by the |
| 59 | container itself (ie by its MMIO callbacks or RAM backing). However |
| 60 | it is generally possible to achieve the same effect with a pure container |
| 61 | one of whose subregions is a low priority "background" region covering |
| 62 | the whole address range; this is often clearer and is preferred. |
| 63 | Subregions cannot be added to an alias region. |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 64 | |
| 65 | Region names |
| 66 | ------------ |
| 67 | |
| 68 | Regions are assigned names by the constructor. For most regions these are |
| 69 | only used for debugging purposes, but RAM regions also use the name to identify |
| 70 | live migration sections. This means that RAM region names need to have ABI |
| 71 | stability. |
| 72 | |
| 73 | Region lifecycle |
| 74 | ---------------- |
| 75 | |
Paolo Bonzini | 8b5c216 | 2015-02-13 13:42:03 +0100 | [diff] [blame] | 76 | A region is created by one of the memory_region_init*() functions and |
| 77 | attached to an object, which acts as its owner or parent. QEMU ensures |
| 78 | that the owner object remains alive as long as the region is visible to |
| 79 | the guest, or as long as the region is in use by a virtual CPU or another |
| 80 | device. For example, the owner object will not die between an |
| 81 | address_space_map operation and the corresponding address_space_unmap. |
Paolo Bonzini | d8d9581 | 2014-06-11 12:42:01 +0200 | [diff] [blame] | 82 | |
Paolo Bonzini | 8b5c216 | 2015-02-13 13:42:03 +0100 | [diff] [blame] | 83 | After creation, a region can be added to an address space or a |
| 84 | container with memory_region_add_subregion(), and removed using |
| 85 | memory_region_del_subregion(). |
Paolo Bonzini | d8d9581 | 2014-06-11 12:42:01 +0200 | [diff] [blame] | 86 | |
Paolo Bonzini | 8b5c216 | 2015-02-13 13:42:03 +0100 | [diff] [blame] | 87 | Various region attributes (read-only, dirty logging, coalesced mmio, |
| 88 | ioeventfd) can be changed during the region lifecycle. They take effect |
| 89 | as soon as the region is made visible. This can be immediately, later, |
| 90 | or never. |
| 91 | |
| 92 | Destruction of a memory region happens automatically when the owner |
| 93 | object dies. |
| 94 | |
| 95 | If however the memory region is part of a dynamically allocated data |
| 96 | structure, you should call object_unparent() to destroy the memory region |
| 97 | before the data structure is freed. For an example see VFIOMSIXInfo |
| 98 | and VFIOQuirk in hw/vfio/pci.c. |
| 99 | |
| 100 | You must not destroy a memory region as long as it may be in use by a |
| 101 | device or CPU. In order to do this, as a general rule do not create or |
| 102 | destroy memory regions dynamically during a device's lifetime, and only |
| 103 | call object_unparent() in the memory region owner's instance_finalize |
| 104 | callback. The dynamically allocated data structure that contains the |
| 105 | memory region then should obviously be freed in the instance_finalize |
| 106 | callback as well. |
| 107 | |
| 108 | If you break this rule, the following situation can happen: |
| 109 | |
| 110 | - the memory region's owner had a reference taken via memory_region_ref |
| 111 | (for example by address_space_map) |
| 112 | |
| 113 | - the region is unparented, and has no owner anymore |
| 114 | |
| 115 | - when address_space_unmap is called, the reference to the memory region's |
| 116 | owner is leaked. |
| 117 | |
| 118 | |
| 119 | There is an exception to the above rule: it is okay to call |
| 120 | object_unparent at any time for an alias or a container region. It is |
| 121 | therefore also okay to create or destroy alias and container regions |
| 122 | dynamically during a device's lifetime. |
| 123 | |
| 124 | This exceptional usage is valid because aliases and containers only help |
| 125 | QEMU building the guest's memory map; they are never accessed directly. |
| 126 | memory_region_ref and memory_region_unref are never called on aliases |
| 127 | or containers, and the above situation then cannot happen. Exploiting |
| 128 | this exception is rarely necessary, and therefore it is discouraged, |
| 129 | but nevertheless it is used in a few places. |
| 130 | |
| 131 | For regions that "have no owner" (NULL is passed at creation time), the |
| 132 | machine object is actually used as the owner. Since instance_finalize is |
| 133 | never called for the machine object, you must never call object_unparent |
| 134 | on regions that have no owner, unless they are aliases or containers. |
| 135 | |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 136 | |
| 137 | Overlapping regions and priority |
| 138 | -------------------------------- |
| 139 | Usually, regions may not overlap each other; a memory address decodes into |
| 140 | exactly one target. In some cases it is useful to allow regions to overlap, |
| 141 | and sometimes to control which of an overlapping regions is visible to the |
| 142 | guest. This is done with memory_region_add_subregion_overlap(), which |
| 143 | allows the region to overlap any other region in the same container, and |
| 144 | specifies a priority that allows the core to decide which of two regions at |
| 145 | the same address are visible (highest wins). |
Marcel Apfelbaum | 8002ccd | 2013-09-16 11:21:15 +0300 | [diff] [blame] | 146 | Priority values are signed, and the default value is zero. This means that |
| 147 | you can use memory_region_add_subregion_overlap() both to specify a region |
| 148 | that must sit 'above' any others (with a positive priority) and also a |
| 149 | background region that sits 'below' others (with a negative priority). |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 150 | |
Peter Maydell | 6f1ce94 | 2013-10-15 15:42:34 +0100 | [diff] [blame] | 151 | If the higher priority region in an overlap is a container or alias, then |
| 152 | the lower priority region will appear in any "holes" that the higher priority |
| 153 | region has left by not mapping subregions to that area of its address range. |
| 154 | (This applies recursively -- if the subregions are themselves containers or |
| 155 | aliases that leave holes then the lower priority region will appear in these |
| 156 | holes too.) |
| 157 | |
| 158 | For example, suppose we have a container A of size 0x8000 with two subregions |
| 159 | B and C. B is a container mapped at 0x2000, size 0x4000, priority 1; C is |
| 160 | an MMIO region mapped at 0x0, size 0x6000, priority 2. B currently has two |
| 161 | of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at |
| 162 | offset 0x2000. As a diagram: |
| 163 | |
| 164 | 0 1000 2000 3000 4000 5000 6000 7000 8000 |
| 165 | |------|------|------|------|------|------|------|-------| |
| 166 | A: [ ] |
| 167 | C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] |
| 168 | B: [ ] |
| 169 | D: [DDDDD] |
| 170 | E: [EEEEE] |
| 171 | |
| 172 | The regions that will be seen within this address range then are: |
| 173 | [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] |
| 174 | |
| 175 | Since B has higher priority than C, its subregions appear in the flat map |
| 176 | even where they overlap with C. In ranges where B has not mapped anything |
| 177 | C's region appears. |
| 178 | |
| 179 | If B had provided its own MMIO operations (ie it was not a pure container) |
| 180 | then these would be used for any addresses in its range not handled by |
| 181 | D or E, and the result would be: |
| 182 | [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] |
| 183 | |
| 184 | Priority values are local to a container, because the priorities of two |
| 185 | regions are only compared when they are both children of the same container. |
| 186 | This means that the device in charge of the container (typically modelling |
| 187 | a bus or a memory controller) can use them to manage the interaction of |
| 188 | its child regions without any side effects on other parts of the system. |
| 189 | In the example above, the priorities of D and E are unimportant because |
| 190 | they do not overlap each other. It is the relative priority of B and C |
| 191 | that causes D and E to appear on top of C: D and E's priorities are never |
| 192 | compared against the priority of C. |
| 193 | |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 194 | Visibility |
| 195 | ---------- |
| 196 | The memory core uses the following rules to select a memory region when the |
| 197 | guest accesses an address: |
| 198 | |
| 199 | - all direct subregions of the root region are matched against the address, in |
| 200 | descending priority order |
| 201 | - if the address lies outside the region offset/size, the subregion is |
| 202 | discarded |
Peter Maydell | 6f1ce94 | 2013-10-15 15:42:34 +0100 | [diff] [blame] | 203 | - if the subregion is a leaf (RAM or MMIO), the search terminates, returning |
| 204 | this leaf region |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 205 | - if the subregion is a container, the same algorithm is used within the |
| 206 | subregion (after the address is adjusted by the subregion offset) |
Peter Maydell | 6f1ce94 | 2013-10-15 15:42:34 +0100 | [diff] [blame] | 207 | - if the subregion is an alias, the search is continued at the alias target |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 208 | (after the address is adjusted by the subregion offset and alias offset) |
Peter Maydell | 6f1ce94 | 2013-10-15 15:42:34 +0100 | [diff] [blame] | 209 | - if a recursive search within a container or alias subregion does not |
| 210 | find a match (because of a "hole" in the container's coverage of its |
| 211 | address range), then if this is a container with its own MMIO or RAM |
| 212 | backing the search terminates, returning the container itself. Otherwise |
| 213 | we continue with the next subregion in priority order |
| 214 | - if none of the subregions match the address then the search terminates |
| 215 | with no match found |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 216 | |
| 217 | Example memory map |
| 218 | ------------------ |
| 219 | |
| 220 | system_memory: container@0-2^48-1 |
| 221 | | |
| 222 | +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) |
| 223 | | |
| 224 | +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) |
| 225 | | |
| 226 | +---- vga-window: alias@0xa0000-0xbfffff ---> #pci (0xa0000-0xbffff) |
| 227 | | (prio 1) |
| 228 | | |
| 229 | +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) |
| 230 | |
| 231 | pci (0-2^32-1) |
| 232 | | |
| 233 | +--- vga-area: container@0xa0000-0xbffff |
| 234 | | | |
| 235 | | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff) |
| 236 | | | |
| 237 | | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff) |
| 238 | | |
| 239 | +---- vram: ram@0xe1000000-0xe1ffffff |
| 240 | | |
| 241 | +---- vga-mmio: mmio@0xe2000000-0xe200ffff |
| 242 | |
| 243 | ram: ram@0x00000000-0xffffffff |
| 244 | |
Ademar de Souza Reis Jr | 69ddaf6 | 2011-12-05 16:54:14 -0300 | [diff] [blame] | 245 | This is a (simplified) PC memory map. The 4GB RAM block is mapped into the |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 246 | system address space via two aliases: "lomem" is a 1:1 mapping of the first |
| 247 | 3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the |
| 248 | so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with |
| 249 | 4GB of memory. |
| 250 | |
| 251 | The memory controller diverts addresses in the range 640K-768K to the PCI |
Avi Kivity | 7075ba3 | 2011-08-08 19:58:50 +0300 | [diff] [blame] | 252 | address space. This is modelled using the "vga-window" alias, mapped at a |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 253 | higher priority so it obscures the RAM at the same addresses. The vga window |
| 254 | can be removed by programming the memory controller; this is modelled by |
| 255 | removing the alias and exposing the RAM underneath. |
| 256 | |
| 257 | The pci address space is not a direct child of the system address space, since |
| 258 | we only want parts of it to be visible (we accomplish this using aliases). |
| 259 | It has two subregions: vga-area models the legacy vga window and is occupied |
| 260 | by two 32K memory banks pointing at two sections of the framebuffer. |
| 261 | In addition the vram is mapped as a BAR at address e1000000, and an additional |
| 262 | BAR containing MMIO registers is mapped after it. |
| 263 | |
| 264 | Note that if the guest maps a BAR outside the PCI hole, it would not be |
| 265 | visible as the pci-hole alias clips it to a 0.5GB range. |
| 266 | |
Avi Kivity | 9d3a473 | 2011-07-26 14:26:00 +0300 | [diff] [blame] | 267 | MMIO Operations |
| 268 | --------------- |
| 269 | |
| 270 | MMIO regions are provided with ->read() and ->write() callbacks; in addition |
| 271 | various constraints can be supplied to control how these callbacks are called: |
| 272 | |
| 273 | - .valid.min_access_size, .valid.max_access_size define the access sizes |
| 274 | (in bytes) which the device accepts; accesses outside this range will |
| 275 | have device and bus specific behaviour (ignored, or machine check) |
| 276 | - .valid.aligned specifies that the device only accepts naturally aligned |
| 277 | accesses. Unaligned accesses invoke device and bus specific behaviour. |
| 278 | - .impl.min_access_size, .impl.max_access_size define the access sizes |
| 279 | (in bytes) supported by the *implementation*; other access sizes will be |
| 280 | emulated using the ones available. For example a 4-byte write will be |
Ademar de Souza Reis Jr | 69ddaf6 | 2011-12-05 16:54:14 -0300 | [diff] [blame] | 281 | emulated using four 1-byte writes, if .impl.max_access_size = 1. |
Fam Zheng | edc1ba7 | 2014-05-05 15:53:41 +0800 | [diff] [blame] | 282 | - .impl.unaligned specifies that the *implementation* supports unaligned |
| 283 | accesses; if false, unaligned accesses will be emulated by two aligned |
| 284 | accesses. |
| 285 | - .old_mmio can be used to ease porting from code using |
| 286 | cpu_register_io_memory(). It should not be used in new code. |