Making Invalid VM State Unrepresentable
How Kalahari narrows snapshot and restore correctness with typed topology, queue counts, memory geometry, VM state views, and architecture-specific exits.
VM snapshot bugs usually start before the snapshot.
The VM boots. The guest sees devices. The console works. Networking might even pass an integration test. Then the VM is frozen, restored, or spawned from a zygote, and a detail that used to be implicit becomes part of a durable contract.
Was that virtio-fs device built with one request queue or nine? Did the backend pool pre-register the same queue notification slots the guest will use after restore? Is a PMEM range host-owned layout metadata, or did we accidentally trust a guest-visible PFN superblock? Can an ARM backend ever receive an x86 port-I/O exit?
If those questions are answered by comments and conventions, restore has to rediscover the whole VM shape from loose bytes. Kalahari moved in the opposite direction: make invalid state harder to construct, and make the snapshot surface smaller.
The result is not one clever abstraction. It is a chain of ordinary Rust types that carry VM invariants forward: typed topology, validated queue counts, canonical memory geometry, checked RAM descriptors, device-state views, and architecture-gated vCPU exits.
Snapshots Amplify Loose State
A running VM can hide a surprising amount of ambiguity.
A device layout can be “close enough” while the guest is booting. A backend can assume the default queue count. A memory offset can be recomputed from another field. A generic exit enum can include variants that a target architecture should never produce.
Snapshot and restore remove that slack. A restored VM must match the original VM’s guest-visible hardware and host-owned bookkeeping. The same device slot order, queue notification layout, interrupt lines, durable device bytes, RAM descriptor, PMEM geometry, and architecture state must line up.
That is why “validate more during restore” is not enough by itself. Validation at the boundary is necessary, but if every internal API still accepts raw integers and untyped slots, the restore path becomes a second VM builder with less context and more ways to guess wrong.
Kalahari’s VM layer instead tries to answer the shape questions once, near construction, and then carry the answer as typed state.
Topology Is a Token
The central object is the device topology.
That topology is not just a list of device kinds. It records each device kind and the exact virtqueue count in MMIO device-slot order.
That detail matters because “same devices” is not the same as “same VM.” A network device with one queue pair and a network device with three queue pairs both have the same broad kind, but they require different queue notification slots and backend wake layout. The scheduler therefore keys backend pools by vCPU count plus the validated topology, and it rejects a pool whose durable layout differs from the VM’s layout.
The same discipline appears at the worker boundary. KVM and HVF workers receive an exact worker topology: device slots, GSIs, resample wake indexes, MMIO notify addresses, queue indexes, and wake indexes. Worker-side validation rejects duplicate GSIs, empty queue lists, out-of-range queue indexes, duplicate wake indexes, and duplicate notify slots.
That turns topology into a capability. If code has a validated layout, it can build IRQ lines, queue wake maps, worker pools, and diagnostic entries. If it only has a few raw device flags, it cannot pretend to have a VM.
Queue Counts Are Device State
Virtio queue counts look like small numbers. They are really part of the device ABI.
The console device exposes six queues. RNG and PMEM expose one. Network queues are derived from queue pairs. Virtio-fs has a hiprio queue plus a validated number of request queues. Those numbers determine which guest writes to QueueNotify are meaningful and which host wake bits exist.
Kalahari encodes that in layers.
The virtio-fs request queue count is validated before it reaches device configuration. Raw queue counts become durable, bounded values rather than loose integers. The layout code checks each count against the device kind, then writes both device kinds and exact queue counts into the VM state header.
Restore then reads the durable header back into the same layout shape. For network and virtio-fs, it also cross-checks the header count against durable device config bytes, such as maximum network queue pairs and the stored number of filesystem request queues.
That prevents a common class of restore bugs: accepting a header that says one topology while the device state describes another. The queue count is no longer an inferred property. It is durable topology metadata, validated against the device that will consume it.
Memory Needs Geometry
Guest memory is not just a byte length.
The state file contains host-only metadata, vCPU slots, irqchip state, device slots, device metadata, a ring buffer, PMEM metadata sections, a RAM descriptor section, and guest RAM. Some ranges are mapped into the guest. Some must never be guest-derived. Some need page alignment, while PMEM also needs guest NVDIMM section alignment.
The VM state header is computed from checked arithmetic, validated RAM size, page alignment, PMEM image counts, and PMEM geometry. Later, restore recomputes that canonical layout from the header and mapped length. If offsets, sizes, padding, PMEM arrays, or total size differ, the header is rejected.
That yields a useful invariant: the mapped bytes match the one layout the VMM knows how to interpret.
The RAM descriptor follows the same pattern. Kalahari only represents RAM sizes that are nonzero, block-aligned, and fit the descriptor bitmap. It checks the descriptor header, block size, block count, reserved fields, and section size before runtime code can borrow the initialized bitmap.
PMEM is especially important. The header carries host-owned data offsets, total sizes, and image counts. The guest-visible PFN superblocks are initialized from that geometry, not treated as the source of truth for host mapping. That distinction keeps restore from trusting bytes that are part of the guest-visible device contract.
Memory state becomes validated geometry, not a pile of offsets.
Views Narrow the Surface
Once the header is validated, the rest of the code should not keep reparsing it.
The mapped VM state is exposed through a narrow view. It separates host-only metadata from guest-visible memory, keeps an immutable copy of the header that passed validation, and exposes guest memory through validated GPA mappings. Accessors derive offsets from that validated copy rather than from mutable live header bytes.
Device state gets a similar treatment. The raw state file has homogeneous device slots, but code does not get to reinterpret a slot as any device type it wants. Each persisted state layout is tied to a durable kind code, and callers only receive a typed view after the kind check has passed.
There is still unsafe code, because a VMM has to map memory and reinterpret POD state. The important point is where the unsafe surface lives. The broad restore path does not hand offsets to every device and hope they agree. It validates the layout, validates device metadata, rebuilds durable topology from the header, and then hands devices typed views.
The smaller the view, the fewer invariants every caller has to remember.
Exits Belong to Architectures
VM state is not only memory and devices. It also includes the control protocol between vCPUs and the host.
Kalahari’s vCPU exit model is backend-neutral, but it is not architecture-blind. Port-I/O exits only exist on x86_64 builds, because ARM64 has no port I/O. ARM KVM exit decoding maps MMIO, system shutdown/reset, HLT, and unknown exits. x86 KVM exit decoding handles PIO and MMIO, and bounds-checks PIO data offsets before reading from the KVM run mapping.
The distinction matters during snapshotting. A clean shutdown is different from an unrecoverable exit. A known MMIO read has response semantics. An unknown exit does not. Treating all of those as loose integers would push architecture knowledge into every consumer.
By making the exit shape typed and architecture-gated, Kalahari avoids states like “ARM backend receives an x86 PIO response” at compile time, and it keeps fatal exit paths from being mistaken for snapshot-safe states.
The Restore Surface Gets Smaller
Strong types do not remove validation. They decide where validation belongs.
The construction path validates user-facing config, device layout, queue counts, root filesystem selection, memory sizing, PMEM geometry, and worker topology. The durable state path validates the canonical header, RAM descriptor, PMEM sections, PSCI state, irqchip section, device metadata, and durable device layout. The runtime path uses typed device views, typed wake maps, and architecture-specific exits.
That changes what restore has to validate.
Without these invariants, restore has to accept a bag of bytes and ask every subsystem to defend itself. With them, restore validates a small set of boundaries and then works through typed views. Tests can focus on corrupt headers, mismatched queue counts, malformed descriptors, duplicate topology entries, and lifecycle transitions instead of trying to exercise every impossible combination after the VM is already running.
This is the systems-design payoff: invalid state representation leads to late bugs; typed construction moves those bugs to the edge; typed state views make the snapshot and restore surface smaller.
Why Kalahari Cares
Kalahari exposes a much smaller API than the machinery underneath it:
const sandbox = await client.createSandbox();
const zygote = await sandbox.zygote();
const child = await zygote.spawn();That API depends on the VM layer being strict. A zygote spawn can reuse a prepared VM quickly only if the template’s hardware shape, memory layout, backend pool, and architecture state are already precise. Kalahari’s spawn validation prevents hardware-affecting options from changing under a zygote. The VMM supplies the deeper guarantee: the state being spawned has a validated topology and a canonical layout.
Fast zygote spawning is valuable because users should not have to wait for every sandbox to boot from scratch. But speed is only useful if every child starts from a state the system can explain.
That is why Kalahari’s VM code keeps turning comments into types. Not for Rust aesthetics. For restore, spawn, and snapshot correctness, the best bug is the state that cannot be built.