Toward Zero-Copy OCI Layers

Container images are layered, but most runtimes don’t keep them that way at the storage edge. They unpack tarballs into one shared graph driver, walk those layers with an overlay implementation that lives in the host kernel, and bind-mount the result into the workload.

Kalahari takes a different approach. Each OCI layer becomes its own EROFS image on the host, addressed by the SHA256 of the original tar layer. Those images are exposed to the guest as virtio-pmem-backed ranges with DAX. The guest stacks them into the workload’s root filesystem with ordinary overlayfs.

This post is about why “one EROFS per layer” is a bigger choice than it sounds, and what falls out of it.

What’s Wrong with Flattening

The simplest path is to flatten an OCI image into a single root filesystem. Pull each layer in order, apply each on top of the last, write out one tree. That tree is what the workload mounts.

Flattening sounds like a one-line story until you ask:

A second image shares 90% of its layers with the first one. Do you re-fetch and re-flatten every time?
A layer hash you’ve already converted appears in a third image. Do you remember the conversion?
A multi-gigabyte base image needs to land in memory. Where does the unpacked tree live during conversion?

The honest answers are: yes, no, and “in RAM, hopefully.” Caching by image identity instead of by layer identity throws away the structure that registries already give you.

Kalahari’s storage layer keeps that structure. The unit of caching is the layer, keyed by the OCI blob digest: the SHA256 of the compressed layer blob the registry already published. (Not the uncompressed diffID; the registry-published blob digest is the cache key.) The builder decompresses that blob to build EROFS, but the cache key remains the registry blob digest. Two images that share a layer share its EROFS blob on disk, byte for byte.

flowchart LR
    subgraph py312["python:3.12-slim"]
        L1["layer sha256:abc..."]
        L2["layer sha256:def..."]
        L3["layer sha256:ghi..."]
    end
    subgraph py313["python:3.13-slim"]
        L4["layer sha256:abc..."]
        L5["layer sha256:def..."]
        L6["layer sha256:jkl..."]
    end
    subgraph store["Content-addressed blob store"]
        B1["blobs/abc... (EROFS)"]
        B2["blobs/def... (EROFS)"]
        B3["blobs/ghi... (EROFS)"]
        B4["blobs/jkl... (EROFS)"]
    end
    L1 --> B1
    L2 --> B2
    L3 --> B3
    L4 -. reused .-> B1
    L5 -. reused .-> B2
    L6 --> B4

The cache hit is content-addressed. It does not depend on the image tag, the registry, or any metadata that can drift.

Why EROFS

EROFS is a read-only filesystem in the upstream Linux kernel, designed for compact images that mount fast. For a sandbox that wants to expose immutable image data to a guest VM, it’s a good fit for reasons that aren’t obvious until you go looking:

It’s read-only by design, which matches the “this layer never changes” semantics of an OCI layer exactly.
The on-disk layout is compact: small files can be tail-packed inline into inode metadata so they don’t pay a block-allocation cost.
The block size can be set at build time, so it can be aligned with the guest page size.
It supports DAX. Pages of an EROFS image, when backed by persistent memory or a virtio-pmem device, can be mapped directly into the guest’s page table without going through a guest page cache.

That last property is the one that decides the host/guest hand-off.

Streaming the Build

Before any of that matters, you have to actually build the EROFS image. This is where naïve approaches fall over.

A base image layer can be hundreds of megabytes uncompressed. Multiple layers can be in flight at once. If the builder buffers the entire layer in memory, importing a real image starts to look like an OOM test.

Kalahari’s EROFS builder is streaming. Tar entries arrive one at a time from a decompressed stream. File data is written straight through to the output as the entry is consumed. The builder never holds the full file body in memory, regardless of file size.

The on-disk layout is “data first, metadata last.” File blocks are written as they stream in. Inode tables, directory entries, and the superblock pointer to the metadata region are emitted at the end, after all file data has been placed. Memory usage is O(metadata): proportional to the number of inodes, not the total bytes.

flowchart LR
    HTTP["HTTP fetch"] --> GZ["gunzip stream"] --> TAR["tar entries"] --> BLD["push_file(path, meta, size, reader)"]
    BLD --> O1["file blocks streamed straight to output"]
    BLD --> O2["inode table updated incrementally"]
    BLD --> O3["on finalize:<br/>superblock + meta region"]

The builder API is small enough to describe in a sentence: hand it a tar reader and a writer that supports finalization, and it produces an EROFS image. There is no intermediate step where a flattened directory tree exists on disk or in RAM.

Whiteouts Are Translated, Not Merged

OCI represents whiteouts as tar entries named .wh.NAME and .wh..wh..opq; overlayfs represents them on disk as 0/0 character devices and trusted.overlay.opaque=y xattrs. Both conventions are documented (OCI image-spec for the tar form, the kernel overlayfs docs for the on-disk form), and the translation between them is mechanical.

A .wh.foo entry in a higher layer means “in the merged view, foo from a lower layer should not exist.” .wh..wh..opq means “everything from lower layers is hidden under this directory.”

A flattening builder has to merge these. It needs to remember which paths each layer hides, walk the merged view across all lower layers, and emit one combined result. It is doing the overlay’s job at build time, and across the entire image.

A per-layer EROFS builder doesn’t merge. It translates from OCI tar form to overlayfs form: each .wh.NAME becomes a character device with rdev = 0 at NAME, and each .wh..wh..opq becomes a trusted.overlay.opaque=y xattr on the parent directory’s inode. Both translations are constant-time per entry, not a merge across layers. Each layer’s EROFS image still describes only that layer’s additions, modifications, and (now) deletions, expressed in the form overlayfs already knows how to consume.

The deletion semantics show up later, in the guest, where overlayfs reads those chardevs and that xattr exactly as it would on any other backing filesystem. Kalahari’s test suite covers whiteouts and opaque directories across lowerdir-only EROFS stacks specifically, since this is the compatibility point between OCI layer semantics and overlayfs semantics. The builder doesn’t need to model cross-layer overlay merging, and the host doesn’t need to keep a merged view at all.

This is the right division of labor. The Linux kernel already has overlayfs. Reimplementing its merge logic inside an EROFS builder would be duplicating the wrong half of the system; emitting the on-disk markers it expects is the small half that actually has to live somewhere.

Page-Aligned Block Size

The EROFS block size in Kalahari is pinned to the guest page size: 4 KiB. The guest kernel uses 4 KiB pages on every supported platform, including macOS Apple Silicon hosts where the host kernel uses 16 KiB pages. (On Apple Silicon, the pmem device is mapped at a 16 KiB-aligned guest physical address for hv_vm_map compatibility, but the guest still operates at 4 KiB granularity internally. The same EROFS images, the same overlayfs stack, the same dm-linear carving all run inside that Linux guest under Hypervisor.framework just as they do under KVM.)

This sounds like a small technicality. It is the thing that makes DAX work.

DAX (direct access) lets a filesystem expose its blocks to user space as memory mappings without going through a separate page cache. For DAX to work on this path, the filesystem block boundaries need to line up with the guest kernel’s page boundaries. Otherwise the kernel can’t install a clean direct mapping for those filesystem blocks and has to fall back to a less direct path.

When EROFS blocks and guest pages match, mapping a file from the EROFS image into the guest looks like:

Guest userland calls mmap on a file in the mounted EROFS.
The guest kernel resolves the file’s blocks to addresses inside the virtio-pmem device.
The guest kernel installs page table entries that point at those addresses directly.
There is no intermediate guest page cache. No copy. No double-buffering.

For mmap and exec paths, file access becomes page-table mapping plus ordinary memory access instead of a copy into the guest page cache.

virtio-pmem Is the Hand-off

Each EROFS layer is exposed to the guest as a DAX-capable block range, not as a virtio-blk disk and not as a virtio-fs share.

Virtio-pmem is the virtio device class for “this region of physical-address space is byte-addressable persistent memory.” From the guest’s point of view, it looks like an NVDIMM. The guest can mount a filesystem on it with DAX, and reads of files in that filesystem become loads against the device’s physical address range.

The host side is even simpler. The EROFS blob is mmap’d once on the host. The mapping is exposed to the guest as the pmem device’s backing memory. There is no host-side per-read request path for file data; the guest reads mapped memory directly.

In the simple model, each layer gets its own pmem-backed device. Those devices are mounted at temporary paths inside the guest before the workload starts.

flowchart LR
    subgraph host["Host"]
        BA["blobs/abc... (EROFS)"]
        BD["blobs/def... (EROFS)"]
        BG["blobs/ghi... (EROFS)"]
    end
    subgraph guest["Guest"]
        P0["/dev/pmem0"] -- "mount EROFS dax" --> M0["/lower/0"]
        P1["/dev/pmem1"] -- "mount EROFS dax" --> M1["/lower/1"]
        P2["/dev/pmem2"] -- "mount EROFS dax" --> M2["/lower/2"]
    end
    BA -- "mmap" --> P0
    BD -- "mmap" --> P1
    BG -- "mmap" --> P2

Once the lower mounts are in place, an in-guest agent assembles them as the lowerdirs of an overlayfs:

flowchart LR
    L0["/lower/0<br/>read-only EROFS layer"]
    L1["/lower/1<br/>read-only EROFS layer"]
    L2["/lower/2<br/>read-only EROFS layer"]
    U["/upper<br/>writable tmpfs scratch"]
    W["/work<br/>(overlayfs internal)"]
    M["/merged<br/>workload root"]
    L0 -- "lowerdir" --> M
    L1 -- "lowerdir" --> M
    L2 -- "lowerdir" --> M
    U -- "upperdir" --> M
    W -. workdir .-> M

This is the only place stacking happens, and it happens in the place that already understands how to stack: the guest’s overlayfs implementation. The host has no merged view. The host has a content-addressed bag of EROFS blobs.

The upper layer is tmpfs by default, which makes the sandbox ephemeral by default: writes the workload makes vanish when the sandbox stops. For RL rollouts, eval runs, and one-shot agent invocations that’s the right shape. For longer-lived sessions or “resume this sandbox tomorrow” semantics, the upper can be backed by a host-mounted virtio-fs share instead of tmpfs, in which case writes persist on the host filesystem. The choice is per-sandbox, not built into the storage layer; the layer cache is durable in either case, but the workload’s writable state isn’t unless you ask for it to be.

What “Zero-Copy” Means Here

The title says “zero-copy,” so it is worth being precise about what is and is not being copied.

Trace one byte: the first byte of /usr/lib/python3.12/os.py inside the running sandbox. Follow it from where it lives on the host to where Python’s parser reads it.

With virtio-pmem and DAX:

The host opens the EROFS blob containing that layer and mmaps it. Those user-virtual addresses are backed by host physical pages holding the file’s contents.
The VMM publishes that mapping as the backing memory of the virtio-pmem device. The hypervisor’s GPA→host-PA translation now points the device’s guest physical address range at the same host physical pages.
The guest mounts the EROFS image with DAX. When the guest kernel resolves /usr/lib/python3.12/os.py, the lookup returns a guest physical address inside the pmem range.
The guest mmaps the file into the Python process. The guest page-table entry for that user-space virtual address points directly at the guest PA from step 3.
When Python reads the byte, the CPU walks the guest’s VA→GPA mapping, then the hypervisor’s GPA→host-PA mapping, then issues a load.

One mmap on the host. One page-table entry in the guest. The byte is never copied; it is read in place through two layers of address translation that have to happen for any guest memory access at all.

Without DAX, the same byte takes a longer route. The guest issues an I/O request (virtio-blk, or virtio-fs to a host-side virtiofsd); the host services it and the bytes land in a guest I/O buffer; the guest block driver places them into the guest page cache, a separate allocation in guest memory; and a user-space mmap or read either points into that page cache or copies again. Two or three copies plus a duplicate allocation, where the DAX path has zero of either.

Without DAX (virtio-blk or virtio-fs):

flowchart LR
    H1["host file<br/>(or merged tree)"] -- "I/O request" --> GIO["guest I/O buffer"]
    GIO -- "copy" --> GPC["guest page cache"]
    GPC -- "map or copy" --> UM1["user-space read"]

With DAX (Kalahari):

flowchart LR
    H2["host EROFS blob<br/>(one mmap)"] -- "same host pages" --> PMEM["virtio-pmem<br/>backing memory"]
    PMEM -- "page-table entry" --> UM2["user-space read"]

That difference is what the title points at: the same host RAM page that holds the EROFS blob is the page the guest workload reads. The page-aligned block size from the previous section is the precondition that lets step 4 above hold; the single host mmap from “virtio-pmem Is the Hand-off” is the host-side prerequisite. With both in place, one EROFS layer becomes one mapping the guest’s CPU walks directly.

One Pmem Device, Many Layers

The “separate pmem-backed device per layer” picture works for small images. It stops working when the image has thirty layers, or when a VM hosts several images at once.

virtio-pmem device counts are limited per VM. Each device adds setup cost on the host and a separate virtio-mmio entry the guest must wire up at boot. A VM with thirty pmem devices has thirty MMIO devices to register, thirty entries in the guest’s NVDIMM table, thirty mount calls in the agent. None of that scales gracefully.

Kalahari solves this by packing layers. Multiple EROFS layer images are laid out in a single contiguous pmem region on the host. The VM gets one virtio-pmem device whose backing memory contains every layer’s bytes in order, with page-aligned ranges so device-mapper can carve them cleanly. Layer ordering stays in the guest’s overlay mount order, not in the pmem device itself.

The carving happens in the guest. The agent uses Linux device-mapper to create a dm-linear target per layer, mapping a specific byte range of /dev/pmem0 to a virtual block device. Each dm-linear device looks, from EROFS’s point of view, like a private block device that holds exactly one image. EROFS mounts on top of /dev/mapper/layer-N, not on /dev/pmem0 directly.

flowchart LR
    subgraph host["Host"]
        REG["[ layer 1 | layer 2 | layer 3 ]<br/>single mmap'd region"]
    end
    subgraph guest["Guest"]
        PMEM["/dev/pmem0"]
        DM0["dm-linear<br/>off=0, len=L1"]
        DM1["dm-linear<br/>off=O1, len=L2"]
        DM2["dm-linear<br/>off=O2, len=L3"]
        L0["/dev/mapper/layer-0<br/>EROFS dax → /lower/0"]
        L1["/dev/mapper/layer-1<br/>EROFS dax → /lower/1"]
        L2["/dev/mapper/layer-2<br/>EROFS dax → /lower/2"]
        PMEM --> DM0 --> L0
        PMEM --> DM1 --> L1
        PMEM --> DM2 --> L2
    end
    REG -- "exposed as one virtio-pmem device" --> PMEM

The mappings are constant-time to set up. dm-linear adds a small range lookup rather than a data-copy path; it’s a remapping table the kernel walks once per request to translate logical block addresses to underlying ones. DAX can still work through this path as long as the device-mapper target preserves DAX support and the carved ranges remain page-aligned (the layer offsets within the pmem region are multiples of the EROFS block size, which is the page size).

This is what makes “many layers” tractable. From the guest’s point of view, each layer still has its own block device and its own EROFS mount. From the host’s, it’s one mapping, one virtio-pmem device, one virtio-mmio entry on the cmdline.

Why Not Flatten and Skip the Stack

The serious counter-argument for flattening isn’t “save a mount.” It’s that overlayfs adds per-lookup cost across N lowerdirs, and that hurts cold-cache metadata-heavy workloads. Every stat, every open, every readdir walks the lower stack from top to bottom looking for a hit. With 25 lowerdirs and a workload that opens thousands of files (Python startup, npm install, importing a Go module cache), that adds up.

That’s the rebuttal that needs an answer. The answer is in the numbers below. metadata_scan_stdlib walks the entire CPython 3.12 standard library, calling os.scandir and entry.stat on every directory entry it finds. It is the cold-cache metadata-heavy workload the argument is about. In the measurements below, Kalahari runs it ~4× faster than the comparison runtime that flattens layers into a single image and serves them through the host’s page cache. The per-lookup overlay cost is real but small, and what dominates is upstream of the overlay: EROFS metadata layout for sequential scan, plus DAX bypassing the guest page cache so each stat reads bytes that are already mapped.

So the flattening trade is “reduce some per-lookup overlay work on a path that’s already fast” against the following costs that a flatten-and-skip-the-stack design would have to absorb (these are properties Kalahari’s per-layer design keeps; a flattened design gives them up):

Layer caching by digest. Under a flattened design the registry-published SHA of each layer would no longer be a cache key for anything stored on disk; the cache key would shift to the manifest digest of the merged result. A new image that differs only in its top layer would then trigger a full rebuild instead of reusing the lower-layer blobs.
Cross-image deduplication. Two images that share a 200 MiB base image flatten into two 200 MiB+ blobs, not one shared one.
Streaming-only build memory. Flattening with overlay semantics forces you to materialize the merged view somewhere. That somewhere is either disk (slow, lots of I/O) or memory (doesn’t work for big images).
Whiteout responsibility. The host has to take it on, with all the edge cases that overlayfs has already worked out.

There’s a real ceiling on the per-layer approach: overlayfs has a hard limit on lowerdir count, set by the kernel. That limit has commonly been 128 on older kernels and higher on newer kernels. Real OCI images sit at 5–25 layers, so this is comfortable for almost everything. It can become a problem when a single VM composes several images at once or stacks per-session “agent layers” on top of a base. Kalahari’s importer doesn’t add its own cap on top of the kernel’s; if a VM ends up with more lowerdirs than the kernel allows, the mount(2) call in the guest agent fails and the sandbox boot fails with that error. The stack does not silently truncate.

Hardlinks: Yes Within a Layer, No Across

EROFS supports hardlinks, and Kalahari’s builder uses them within a single layer. When a tar entry is marked as a hardlink, the second occurrence becomes an inode that points to the first. This matches what the original image’s tar layer expresses.

Kalahari does not deduplicate identical files across different EROFS images. If two layers from two different images happen to contain a byte-identical 50 MiB file, that file appears twice on disk.

This is intentional. Cross-layer dedup would require host-side coordination across image imports: hashing every file in every layer, maintaining a dedup index, rewriting EROFS images to point into a shared content blob. The whole point of layer-level caching is that layer identity is the cache key. The registry already knows which layers two images share, by SHA. If you want shared bytes between two images, share a layer; that is the scheme the OCI specification gives you.

Within a layer, a tar’s hardlink semantics are preserved exactly. Across layers, the unit of dedup is the layer itself. That keeps the storage layer simple enough to reason about.

How This Compares to Kata Containers

Kata Containers is the closest peer in spirit: OCI workloads inside lightweight VMs, an in-guest agent, and several of the mechanisms in this post. The differences are about which role each mechanism plays.

Kata uses virtio-pmem with DAX for the guest VM rootfs: the kernel and userland the agent itself runs on. There is a documented path for building that rootfs as EROFS for compactness. The same mechanism does not, by default, carry container image layers; layers are typically assembled on the host by containerd’s overlay snapshotter and shared into the guest as a virtio-fs mount through virtiofsd. Newer paths move that work into the guest: Kata + Nydus uses RAFS-over-EROFS-over-fscache, and Containerd 2.1 ships an EROFS snapshotter that produces one EROFS blob per layer on the host (currently exposed via virtio-fs, with virtio-blk proposed).

Lined up against Kalahari:

Per-layer EROFS on the host. Same idea as Kata’s EROFS snapshotter path: layer-as-cache-key, content-addressed.
Guest transport. Kata’s per-layer EROFS work carries blobs over virtio-fs (or proposed virtio-blk). Kalahari carries them over virtio-pmem with DAX, the same mechanism Kata reserves for the guest OS image, applied to layer blobs themselves. The guest reads layer bytes by faulting against host-mapped memory, not by going through a virtio request path or a guest page cache.
Page-aligned EROFS block size and dm-linear carving. Kata does not pin EROFS block size to the guest page size, and mounts each layer as its own device. Kalahari does both, which is what lets DAX install direct guest page-table mappings and what keeps the per-layer-with-DAX approach tractable past virtio-pmem device-count limits.
Where overlay merging happens. Kata’s default path merges on the host; the Nydus and EROFS-snapshotter paths merge in the guest. Kalahari merges in the guest and never has a merged view on the host.

Mainline Kata, with the EROFS snapshotter, is converging on the same per-layer EROFS shape Kalahari has. What does not exist there is the virtio-pmem-with-DAX hand-off for layer blobs and the packing-plus-dm-linear-carving that scales it.

Numbers

The shape of these decisions shows up at runtime. Kalahari’s filesystem stack (page-aligned EROFS over virtio-pmem with DAX, guest-side overlayfs) is doing less work per filesystem operation than a runtime that mounts an unpacked tree through a host overlay driver and then through the guest’s page cache.

The runtime measured below is microsandbox, which is unrelated to Kata. Microsandbox takes the flatten-and-share approach this post opens with; nothing in this section measures Kata, and these numbers should not be read as a Kata comparison.

Below is a single-machine comparison of bench_fs.py from the open-source microsandbox benchmark suite, against python:3.12-slim. Hardware: AMD Ryzen 9 9900X (12 cores / 24 threads), 96 GB RAM, KVM. Host kernel: Linux 6.19.11. msb’s bundled guest kernel: Linux 6.12.68 (libkrunfw 5.2.1). Kalahari’s guest kernel: Linux 7.0 with the config described in A Kernel Config You Can Read in One Sitting. CPU governor is powersave with boost enabled; vCPUs were not pinned, and both runtimes ran on the same host with the same governor and turbo behavior, so any baseline distortion applies symmetrically. Both runtimes are configured with 512 MiB of guest memory (msb’s default, matched on Kalahari via memoryMb: 512) so memory size isn’t a confound. Each cell is the workload median ± one standard deviation across 100 iterations inside a freshly-created sandbox; raw per-iteration times arrays are preserved in the benchmark artifact.

Workload	msb 0.4.4 (med ± σ)	Kalahari (med ± σ)	speedup
metadata_scan_stdlib	6.45 ± 0.75 ms	1.49 ± 0.02 ms	4.34×
read_all_py_stdlib	10.91 ± 0.63 ms	4.24 ± 0.22 ms	2.57×
deep_tree_traverse	10.33 ± 0.09 ms	5.25 ± 0.18 ms	1.97×
random_read_stdlib	1.82 ± 2.20 ms	0.77 ± 1.01 ms	2.35×
small_file_create_1k	11.86 ± 1.32 ms	6.11 ± 0.59 ms	1.94×
mid_file_create_100	2.76 ± 0.55 ms	0.96 ± 0.24 ms	2.87×
seq_write_fsync_16m	4.51 ± 0.14 ms	1.04 ± 0.53 ms	4.32×
shm_write_fsync_16m	4.57 ± 0.22 ms	1.02 ± 0.08 ms	4.47×
seq_read_16m	1.22 ± 0.05 ms	0.53 ± 0.03 ms	2.28×
mmap_read_16m	1.48 ± 0.07 ms	0.54 ± 0.05 ms	2.72×
file_delete_1k	3.69 ± 0.39 ms	1.14 ± 0.22 ms	3.24×
rename_1k	3.95 ± 0.09 ms	1.55 ± 0.13 ms	2.55×
mixed_read_write	10.07 ± 0.30 ms	4.87 ± 0.41 ms	2.07×
concurrent_read_4t	9.44 ± 0.45 ms	3.65 ± 0.24 ms	2.59×

Most cells have σ well under 10% of the median, with two known fat-tail workloads: random_read_stdlib samples 200 files per iteration so occasional cold pages dominate a few runs (symmetric across runtimes; the median speedup still holds), and one outlier iteration on Kalahari’s seq_write_fsync_16m (max 6.34 ms vs median 1.04 ms) inflates that cell’s σ without moving its median.

The largest gaps are on the rootfs read workloads (metadata_scan_stdlib, read_all_py_stdlib) and on the fsync write workloads. The first set is the EROFS-over-virtio-pmem path, where reads do not go through a guest page cache and metadata is laid out for sequential scan. The second set is the in-guest tmpfs write path on top of an overlay upper.

seq_write_fsync_16m and shm_write_fsync_16m land within run-to-run noise of each other on each runtime (msb 4.51 vs 4.57 ms; Kalahari 1.04 vs 1.02 ms). That tracks: both targets are tmpfs in both runtimes, fsync on tmpfs does not flush durable storage, and the workload reduces to a 16 MiB memcpy plus 100 syscalls.

Why is that pair (4.3×–4.5×) so much faster than the matching reads (2.3×–2.7×)? At matched memory, probing both runtimes from inside the guest with the same workload shows:

	msb	Kalahari
`transparent_hugepage/enabled`	`madvise`	`never`
`transparent_hugepage/shmem_enabled` (governs tmpfs)	`never`	`never`
MemTotal (guest)	528 MiB	498 MiB
16 MiB anon `mmap` + `memset`, no filesystem	3.49 ms	1.38 ms
16 MiB write+fsync to `/tmp` (tmpfs)	5.08 ms	1.72 ms

Both runtimes have shmem THP off, so tmpfs writes don’t benefit from huge pages on either side. The Python process doesn’t madvise(MADV_HUGEPAGE), so msb’s madvise setting doesn’t help it either. Despite that, Kalahari is ≈2.5× faster on the no-filesystem mmap+memset baseline: pure anonymous-page allocation plus memcpy, no storage layer involved at all.

That 2.5× gap on raw memory operations is upstream of anything in this post: it’s a difference in the VMM path and the guest kernel build, not in EROFS or virtio-pmem. It accounts for most of the write-side advantage, since the tmpfs write workloads are dominated by the same anonymous-page allocator path.

The read-side numbers are what this post is actually about. seq_read_16m and mmap_read_16m show ~2.3×–2.7×, which is the storage stack contribution on top of the same memory-bandwidth baseline. metadata_scan_stdlib, read_all_py_stdlib, and the directory-traversal workloads show wider gaps (2× to 4×) because they exercise the metadata layout and the absence of a guest page cache that the per-layer EROFS-over-virtio-pmem-with-DAX path makes possible.

These are filesystem benchmarks and should be read as such. They do not measure VM cold start, agent-perceived I/O, or anything that crosses a network boundary. What they measure is what an in-guest workload sees when it reads, writes, fsyncs, and traverses files on the root filesystem of a freshly-created sandbox. The numbers in the table are the sum of two effects: a VMM/guest-kernel baseline difference that gives Kalahari ≈2.5× on memory-bandwidth-bound work, and a storage-layer difference on top of that for workloads that touch the rootfs.

What Falls Out

Per-layer EROFS over virtio-pmem with DAX, plus guest-side overlayfs, is one decision with several quiet consequences:

Layer caching is content-addressed by the SHA the registry already publishes.
Cross-image dedup is automatic at the layer granularity.
Image build is streaming; multi-gigabyte base layers don’t blow up host memory.
Whiteout semantics live where they’re already implemented, in overlayfs.
For mmap and exec paths, file access can be satisfied through page-table mappings against memory the host already mapped once.

The product surface is a small SDK call:

const sandbox = await client.createSandbox({
  image: 'docker.io/library/python:3.12-slim',
});

Underneath, that one call walks the OCI manifest, resolves layer digests against the local content-addressed store, fetches and converts any missing layers into EROFS in a streaming pipeline, brings up a VM with packed virtio-pmem-backed layer ranges, and lets the in-guest agent compose them into an overlayfs root.

The user sees a sandbox running on a familiar image. The storage layer sees reusable, content-addressed layer blobs.