Vsock Doesn't Survive Snapshots. A Shared-Memory Ring Does.
Why Kalahari's host-guest IPC is an SPSC ring buffer over shared memory instead of vsock, and what Firecracker and SmolVM say about where vsock state lives.
When you’re sketching out host↔guest IPC for a microVM, vsock is the obvious answer. It’s a standard virtio device, it’s in upstream Linux, every serious VMM supports it, and the userland API is just socket(AF_VSOCK, ...). You get a connection, you read and write, you’re done. For a fresh microVM that runs once and exits, that’s fine.
It stops being fine the moment you want to snapshot a running microVM and restore it later, possibly in a different host process. Kalahari does want to do that: zygote snapshots are how createSandbox() becomes 117 ms instead of “however long npm install takes.” So the IPC has to survive the snapshot boundary.
Vsock doesn’t: listening sockets survive restore, but established connections do not. This post explains why, what we use instead, and how others route around the same problem.
What Vsock Promises Across Restore
Firecracker is explicit about this in its own documentation. From snapshot-support.md:
Vsock connections that are open when the snapshot is taken are closed […] When the VM becomes active again, the vsock driver closes all existing connections.
The mechanism is a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event the VMM injects into the guest at restore time. The guest’s vsock driver sees the event and tears down every active connection, because the alternative is letting userland keep using a connection whose host-side endpoint has gone away.
Listening sockets survive. Established connections do not. So if you build a control protocol where the host opens a connection to the guest agent at sandbox creation and uses it for the lifetime of the sandbox, that protocol does not survive a snapshot/restore cycle. The guest agent has to re-accept, the host has to re-connect, both sides have to re-handshake whatever application protocol they were running, and any in-flight messages are gone.
You can build around that. You can make the protocol stateless above the transport, retry every operation, and treat reconnection as the steady state. People do. But this is paying complexity and latency on every snapshot to work around a transport that wasn’t designed for the lifecycle you actually have.
Where Vsock State Actually Lives
The reason vsock can’t keep connections is not a Firecracker policy choice. It’s where the state lives.
A vsock connection has state in three places:
- The guest kernel’s vsock socket table. The TCP-like state machine, sequence numbers, send/receive buffers, the
struct vsock_sockthat userland holds an FD into. - The VMM’s virtio-vsock device backend. The host-side counterpart: queue state, connection tracking, and the UDS or AF_VSOCK file descriptor that bridges to the host process.
- The host process(es) that the connection terminates in. A running accept-loop, or the process that called connect.
To restore a connection cleanly, you’d need to restore all three sides consistently. The guest kernel’s table can be restored along with the rest of guest memory. The VMM’s backend state can be persisted to the snapshot file. But the host process is a separate concern: it might be a fresh process after restore, with no knowledge of the previous connection, no matching socket file descriptor, and possibly running on a different machine.
Even if you fix all of that, the protocol itself is the wrong shape. TCP-style sequence numbers exist to detect packet reordering and loss across an unreliable network. A snapshot/restore cycle is not packet loss; it’s a one-shot teleport. For this lifecycle, carrying flow-control state across a teleport just to throw it away on the other side is overhead with little payoff.
What We Use Instead
Kalahari’s host↔guest IPC is a single-producer/single-consumer (SPSC) ring buffer over a shared-memory region the VMM gives to the guest as a virtio-pmem device. The shared region holds two unidirectional rings (host→guest and guest→host) plus a small header with magic, version, and head/tail cursors. Each ring has exactly one producer and one consumer; bidirectional traffic is built by pairing two rings rather than making one multi-writer queue. The data lives in the shared region. The cursors live in the shared region. The framing lives in the shared region.
The doorbell, the mechanism that says “wake up and look at the ring,” is a separate one-byte vport on a virtio-console multiport device. A kick on the vport carries no state. It’s a level-trigger style notification: if you missed it, the next time you peek at the ring you’ll see whatever’s there. Idempotent.
That decomposition is the part that matters for snapshot/restore.
When the VMM snapshots the guest, it writes the guest’s memory pages to the snapshot file. The shared region is part of guest memory. In Kalahari, the virtio-pmem region used for IPC is deliberately included in the snapshot image rather than treated as an external host-only mapping, so the ring bytes are restored with the guest. Its bytes, including the head/tail cursors and any in-flight payload, are saved verbatim. When the VMM restores, it maps those pages back into the new guest, and the ring header is byte-for-byte what it was. The guest agent doesn’t need to know the snapshot happened. It re-attaches to the same shared region (the GPA is in /proc/cmdline), validates the magic and version, and resumes where it left off.
The doorbell is the one piece that is not stateful in this design, and that’s deliberate. After restore, both sides may see one missed kick or one duplicate kick. Neither matters, because the consumer’s loop is “check the ring; if empty, block on the doorbell.” A spurious wakeup is harmless. A missed wakeup is recovered the next time the producer kicks.
// Reader side, simplified:
loop {
while let Some(frame) = reader.try_peek()? {
handle(frame);
reader.advance()?;
}
reader.wait_kick()?; // a missed kick is harmless;
// we'll see new data on the next peek anyway.
}This is not novel. It is the standard SPSC-over-shmem pattern that virtio queues themselves use. The point is that we use it for the application-level transport, not just inside the virtio device model, so the application’s transport state survives anything that preserves the shared-memory region. Snapshot/restore preserves the shared-memory region. Forking a zygote preserves the shared-memory region. Migrating between processes on the same host preserves the shared-memory region.
How Others Route Around It
CelestoAI’s SmolVM is a useful comparison because it ships a similar product (microVM-as-a-sandbox), targets a similar audience, and runs into the same problem.
SmolVM’s primary control channel is SSH over a forwarded TCP port. From its QEMU runtime:
hostfwd_rules = [f"hostfwd=tcp:127.0.0.1:{ssh_port}-:22"]The VMM exposes the guest’s port 22 on a host loopback port. Anything that wants to talk to the guest opens an SSH connection to 127.0.0.1:<port>. Vsock is supported as an opt-in device (see VsockConfig in their types.py), but the default control path goes through SSH on TCP.
That’s a perfectly reasonable choice, and it’s also a tell. SSH-over-TCP-on-loopback survives snapshot/restore the same way TCP-on-loopback survives any process-level disconnect: by reconnecting. The application protocol on top, SSH, has its own reconnect/retry semantics; the SDK opens a fresh session per call. The transport’s snapshot story is “don’t try; just reconnect.”
Different stack, same observation: connection-oriented host↔guest transports do not survive a snapshot/restore boundary, and the workable architectures either reconnect (SmolVM) or move the transport state into shared memory the VMM owns (Kalahari).
The Kalahari approach trades a bit of subtlety in the ring buffer’s adversarial-writer logic (a misbehaving peer must not be able to push the reader into a runaway loop, which is why try_peek snapshots the writer’s head once per call and bounds the skip distance) for a transport that just keeps working across snapshot, restore, fork, and zygote spawn. The SSH approach trades a reconnect on every restore for a familiar protocol everyone already understands.
Why This Matters Beyond Snapshots
Snapshot/restore is the case that breaks vsock most visibly, but the same design constraint, keep transport state in bytes that can be preserved and re-attached, pays off elsewhere too.
Zygote children. When Kalahari forks a child microVM from a zygote, the child sees the same shared-memory region as the parent (until it diverges). Connection state in shared memory is forkable. Connection state in kernel socket tables is not, at least not as cheaply.
VMM crash recovery. If the VMM process exits and restarts, a vsock connection is dead by definition: the host-side endpoint went away. A shared-memory ring whose backing pages are preserved (for example via memfd or a host-side mmap) can be re-attached by the new VMM process as long as the bytes are still there.
Cross-process IPC inside the VMM. The same primitive is used between Kalahari’s hypervisor worker process and the VMM main process. There is no vsock involved at all in that path, and the worker can be restarted without losing the IPC ring state.
In other words, the choice isn’t “ringbuf for snapshot, vsock for everything else.” The choice is “ringbuf for everything, because the snapshot case is the load-bearing one and you don’t want two transports.”
What We’re Not Saying
We’re not saying vsock is bad. For a one-shot microVM with no snapshot/restore lifecycle, vsock is fine, and the userland API (socket(AF_VSOCK, ...)) is friendlier than mmap plus a custom protocol. If your sandboxes don’t outlive a single boot, you should probably just use vsock.
We’re saying that the transport you pick has to match the lifecycle you have. Kalahari’s lifecycle includes “snapshot a prepared sandbox, then spawn N children from that snapshot.” A transport that doesn’t survive that lifecycle is a transport you’ll spend the rest of the project working around.