Virtio Bugs Are Ownership Bugs
How Kalahari's virtio layer turns descriptor ownership into completion tokens, queue brands, deferred writable regions, and exact used-ring lengths.
Virtio looks like a ring protocol. In practice, it is an ownership protocol.
A guest publishes a descriptor chain in the available ring. The device consumes the head, reads request buffers, optionally writes response buffers, writes a used-ring entry, bumps used.idx, and maybe interrupts the guest.
available descriptor -> device-owned work -> used-ring completion -> guest-owned bufferThat sounds mechanical until the device work crosses a boundary. Network RX waits on a backend packet lease. Virtio-fs waits on a FUSE backend. Console and agent traffic are driven by host wakeups. A VM can stop while one of those operations is in flight. A queue can be reset and reused before an old async reply comes back.
At that point, the hard question is not “how do we write the ring?” It is “who owns the guest-visible side effect right now?”
For Kalahari, the answer lives in the virtio layer. A popped descriptor is treated as an ownership value. Devices do not publish raw descriptor heads casually. They move completion tokens through a narrow path, and those tokens carry enough identity to prevent stale or misrouted completions from mutating the guest.
The Completion Is the Capability
The basic invariant is simple:
Once a valid descriptor chain is accepted for device work, exactly one guest-visible completion path may own it.
Double completion corrupts the used ring. Missing completion leaves the guest waiting. Writing response bytes without the matching used-ring entry creates state the guest cannot observe reliably. Publishing an old async reply after a queue reset writes into the wrong queue instance.
Popping from the available ring starts the process by producing a raw descriptor-chain ownership value. That value cannot be pushed directly to the used ring. Device code must first validate it as one of the allowed chain shapes:
- readable-only
- writable-only
- split request/response
That conversion walks the descriptor chain and validates its shape. Readable-only queues cannot receive writable descriptors. Writable-only queues cannot receive readable descriptors. Split chains must have readable descriptors before writable descriptors. Malformed chains become queue violations instead of forged completions.
Only validated chains can become completion tokens. Synchronous device code completes by moving one of those tokens into the queue’s publish path. The completion is a consumed value, not a reusable (head, len) pair.
The queue also checks balance. For ordinary synchronous devices, if a device returns after popping clean descriptor chains without publishing the matching used-ring entries, the transport is marked as needing reset. For async pop paths, every popped chain must be converted into exactly one deferred completion capability before the pop scope returns.
That is the first line of defense. A descriptor chain is either malformed and the queue is reset, or it becomes a completion token that must move.
Queue Brands Stop Crossed Wires
Linear values are not enough by themselves. They also need identity.
Kalahari uses queue brands in two ways.
The compile-time brand is a fresh identity attached to one queue-processing scope. It is threaded through descriptor buffers, popped chains, prepared completions, and the async pop context. That brand prevents a completion or writable descriptor view produced from one queue view from escaping and being pushed through another queue view. Compile-time tests cover exactly those cases: completions cannot cross queue brands, and deferred completions cannot cross the pop-context brand.
The runtime identity captures the queue index and queue generation at pop time. Queue generation matters because a guest can reset a virtqueue and reuse the same queue slot, descriptor head, and ring addresses. A delayed completion that only remembers “queue 1, head 0” is not enough. It must also match the same generation of that queue.
When the deferred completion is later published, the virtio queue layer checks that the queue still exists, queue work is still enabled, and the captured identity still matches the current queue generation. If not, the result is discarded as stale before any guest response bytes are written.
pop on queue generation 7
await backend
guest resets queue to generation 8
old reply returns
completion is discarded before guest writeThis distinction is important for snapshotting and for ordinary queue reset behavior. The same descriptor head can be valid twice in a VM’s lifetime. The queue brand and runtime identity decide which lifetime a completion belongs to.
Deferred Writable Regions Are Not Guest Pointers
The dangerous case is not a read-only request. It is a response that will be written later.
In a VM runtime, a “pointer” to the response destination is guest physical address plus length plus queue context. It is only safe under conditions that can change:
- the queue may have been reset
- the used ring may no longer be writable
- the response may be larger than the offered writable descriptors
- the device may be stopping for snapshot
- another queue instance may have reused the same descriptor head
Kalahari does not let async device code keep raw writable descriptor addresses. It converts writable descriptors into deferred writable regions: an opaque write plan that exposes capacity but not ambient guest-write authority. The async side can carry a deferred completion token, but it cannot write guest memory with it.
Publishing later goes back through the virtio queue layer. The queue layer checks the token’s queue identity, converts the response length into a checked used-ring length, opens a fresh queue view, reserves the next used-ring entry, validates the writable prefix, writes through an exact response writer, and then publishes the reserved completion.
The response writer is deliberately exact. If the response slice is longer or shorter than the checked used-ring length, the operation fails. If the response does not fit the captured writable capacity, it fails before writing. The only normal path is: owned response bytes in, exact guest write out, matching used-ring length published.
Deferred writable regions are not a convenience wrapper. They are how the VM runtime avoids storing ambient guest write authority inside arbitrary async work.
Used-Ring Lengths Are Part of the Contract
The used ring does not just say “done.” Each used entry carries the descriptor head and the number of bytes written by the device.
That length is guest-visible truth. A wrong used-ring length can make the guest ignore valid response data, trust partial data as complete, or read bytes the device never meant to publish.
The virtio layer therefore treats used-ring lengths as checked values. Devices do not hand the queue a plain host integer at the end of an arbitrary write. The queue compares the requested byte count with the writable descriptor capacity and ensures it fits the virtio used-ring length field.
For response paths, the queue reserves the used-ring slot before writing guest buffers. It validates the scalar writes for the descriptor id, the used-ring length, and used.idx. The response is then written, and the queue publishes the used entry and advances used.idx with the ordering fences required by the virtio split-ring protocol.
The network device shows why this matters.
On TX, the networking device treats the chain as readable, requires the 12-byte modern virtio-net header, rejects unsupported offload headers, bounds the total packet size, and completes the chain with zero written bytes. Before it sends to the networking backend, it validates that the next completion can be published, so a backend side effect is not committed when the used ring is already invalid.
On RX, the order is even tighter. The networking backend hands the device a packet lease. The device pops a writable descriptor chain, computes the virtio-net header plus packet payload length, prepares that exact writable response, builds an owned response buffer, publishes it through the prepared completion path, and only then commits the backend lease. The used-ring length is exactly the header plus packet payload. If the guest offered too little writable capacity, the backend packet is not consumed as delivered.
That is the pattern: backend effects and guest effects are ordered around a checked completion token, not around informal control flow.
FUSE Turns It Into a Stress Test
Virtio-fs is the hardest version of the problem because FUSE is both async and byte-oriented.
The virtio-fs device is intentionally thin. The real work happens in the VM device layer and the FUSE protocol dispatcher.
When a request queue is kicked, the device layer enters a branded pop context. Inside that context it converts each popped chain into a split request/response chain, copies the readable descriptors into a bounded owned request buffer, and creates a deferred completion token from the writable side. After that point, the async FUSE backend owns request bytes and a deferred completion token, but not a writable guest pointer.
FUSE parsing then ignores descriptor boundaries and operates on the canonical byte stream:
[FUSE input header][opcode-specific args][optional payload]That detail matters. A lookup name split across descriptors is still one lookup name. A malformed payload is still malformed even if the descriptor shape was legal.
The parser bounds copied request bytes, data payloads, names, and batch-forget entries. It validates fixed-size requests, NUL-terminated names, two-name payloads, write sizes, xattr bodies, and backend reply sizes. Most malformed requests become an EINVAL FUSE reply. No-reply operations such as forget and interrupt requests preserve their no-reply semantics.
The backend returns an owned FUSE reply, not a guest write. The reply is encoded into a host-owned response frame with a FUSE output header and typed body fragments. Only after that does the device layer consume the stored completion token and ask the virtio queue layer to publish the response bytes.
The popped FUSE request stores its completion as a single-use value and takes it before publishing. A second attempt to publish the same descriptor has no token left. That is exactly-once completion enforced in the device path, not just in the queue core.
Snapshot Quiescence Needs the Same Proof
The ownership model pays off most visibly at snapshot boundaries.
A Kalahari zygote is only valid if no host task can still mutate the VM after the parent becomes the template. That includes virtio completions. It is not enough for vCPUs to stop. The VM runtime must also establish that device work has either completed, faulted, or been discarded in a state the snapshot code understands.
The VM run loop reflects that ordering. When a run is ending, it stops the VM, waits for vCPUs, awaits the device loop, awaits the virtio-fs worker, converts final drain failures into hard errors, checks IPC transport snapshot quiescence, and only then saves CPU and interrupt-controller state into the memory-backed VM state.
The device loop does a final drain of pending wake sources. Synchronous devices are drained with a bounded round count; exhaustion fails closed. The filesystem worker stops popping new descriptors but lets in-flight FUSE requests finish with their real replies. It does not synthesize fake completions just because shutdown began. If virtio-fs cannot quiesce within the timeout, the run fails instead of saving a questionable snapshot.
After a healthy run, converting the sandbox into a zygote consumes the parent, closes live backend resources, releases mapped device state, and only then exposes the frozen template. That lifecycle matters because guest memory and device state live in shared mappings. A backend still holding a writable continuation is not compatible with a frozen parent.
Queue brands, completion tokens, deferred writable regions, and exact used-ring lengths are what make that validation local. A stale async FUSE reply cannot write through an old descriptor after queue reset. A sync device cannot quietly leak a popped descriptor. A response writer cannot publish a length different from the bytes it wrote. A snapshot cannot succeed while the VM runtime knows device work is still unsettled.
The User Sees the Absence of Bugs
A Kalahari user should be able to write ordinary code:
await sandbox.writeFile('/tmp/app.js', 'console.log(1)');
const result = await sandbox.run('node', { args: ['/tmp/app.js'] });That can cross a container filesystem, virtio-fs, FUSE, network setup, console output, the guest IPC transport, and zygote machinery. The user should get a command result or a clear error. They should not get a hung process because a descriptor was consumed without completion. They should not get a child VM mutated by an async reply that belonged to its parent.
The way to make that boring is to make virtio ownership boring.
Every accepted descriptor has one owner. Every completion token moves once. Every deferred write is represented as a bounded, typed operation. Every used-ring length is checked against the writable capacity and the bytes actually produced. Every queue reset changes the identity that delayed completions must match.
Virtio is a protocol, but protocol correctness is not just parsing. It is ownership. In Kalahari’s virtio layer, the safest completion is the one that cannot be expressed twice.