Two Kernel Bugs on the Way to Zero-Copy OCI Layers

While building Kalahari’s storage stack, described in Toward Zero-Copy OCI Layers, we hit two upstream Linux bugs.

This post is about how each bug was found, which design choice surfaced it, and why neither was caught earlier. The patches have the technical details. The discovery path is the interesting part: both bugs sit at intersections of features that are individually well-tested but rarely combined in production.

Bug 1: fs/dax kernel paging fault on virtio-pmem with altmap

The design choice that surfaced it

In March we implemented PFN altmap mode for the virtio-pmem devices that back EROFS layer blobs. The motivation was OOM avoidance.

struct page metadata for a memory region lives in vmemmap. By default the kernel allocates vmemmap pages from boot RAM, in proportion to the size of the region. The pmem ranges in Kalahari hold packed EROFS layers and can run from hundreds of megabytes for a single base image to tens of gigabytes when a VM hosts several container images at once. With the default layout, vmemmap for those ranges eats a significant fraction of the guest’s tiny boot RAM, and importing a real base image would OOM.

The fix is altmap: write a PFN superblock at offset 4 KiB inside the pmem device, set PFN_MODE_PMEM, and let the kernel store vmemmap inside the pmem region itself instead of consuming boot RAM. The OOM goes away, but the layout invariant changes: only the device’s own PFN range has backed vmemmap pages now. PFNs outside the device range point into unmapped vmemmap.

That layout decision mattered six weeks later.

How it failed

On April 24 we upgraded the guest kernel from 6.12.6 to a mainline snapshot during the 7.0 development window. The first DAX fault on the upgraded kernel crashed the guest:

Unable to handle kernel paging request at virtual address ffff_fdff_bf00_0008 (vmemmap region)
Call trace:
 dax_disassociate_entry.isra.0+0x20/0x50
 dax_iomap_pte_fault
 dax_iomap_fault
 erofs_dax_fault

The fault address was inside the vmemmap region but outside the pmem device’s PFN range, and the path was truncate / invalidate, plus the PMD-downgrade branch of dax_iomap_pte_fault when a DAX entry is being freed. The downgrade case happens routinely as EROFS files get unmapped; on altmap layouts, every such unmap was now a guest crash.

Where the regression came from

The crash was new in 7.0, so we bisected against mainline. The culprit was commit 98c183a4fccf (“fs/dax: don’t disassociate zero page entries”), which added zero/empty-entry early returns to dax_associate_entry() and dax_disassociate_entry().

The bug was not that empty or zero DAX entries were mishandled; the bug was that the code converted the entry to a folio before checking whether the entry was meaningful:

static void dax_disassociate_entry(void *entry, ...)
{
    struct folio *folio = dax_to_folio(entry);   /* happens first */

    if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
        return;
    ...
}

dax_to_folio(entry) expands to page_folio(pfn_to_page(dax_to_pfn(entry))). page_folio() reads page->compound_head via READ_ONCE. For an empty or zero XA value, dax_to_pfn(entry) extracts a bogus PFN, pfn_to_page() indexes into vmemmap with it, and page_folio() dereferences whatever sits at that vmemmap offset.

What that dereference touches depends on the vmemmap layout. On a system where vmemmap covers all of RAM, it reads a mapped (if meaningless) page, and the early return on the next line discards the result. On altmap pmem, only the device’s PFN range has backed vmemmap pages. A bogus PFN can land in unmapped vmemmap, and the dereference traps.

dax_busy_page() has the identical pattern and was also affected. The original commit that added the early returns did not touch it.

The fix

Move the dax_to_folio() call after the zero/empty guard in both dax_disassociate_entry() and dax_busy_page():

 static void dax_disassociate_entry(void *entry, ...)
 {
-    struct folio *folio = dax_to_folio(entry);
+    struct folio *folio;

     if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
         return;

+    folio = dax_to_folio(entry);
     dax_folio_put(folio);
 }

Four lines. The patch is on lore.kernel.org with Cc: [email protected] # v6.15+, since 98c183a4fccf first reached v6.15.

Why it was missed

DAX is mature on real persistent memory. virtio-pmem is mature without altmap. EROFS is mature without DAX. The combination, and specifically the altmap-on-virtual-pmem layout where vmemmap has unmapped gaps, is what exposes the bogus-PFN dereference. EROFS files are also natural triggers for the unmap path during teardown: the truncate and invalidate paths fire whenever DAX entries are released as files get evicted from the cache.

A related ARM64/KVM issue surfaced during the same upgrade, but it was not an upstream fs/dax bug. The trap shape involved is specific to ARM64, so a paragraph of background first.

When a guest VM faults on memory the hypervisor needs to handle (typically emulated MMIO), the ARM64 CPU delivers a data abort to the hypervisor along with a syndrome register, ESR_EL2. One bit in that syndrome is “ISV” (Instruction Syndrome Valid). When ISV is set, the syndrome includes decoded information about the trapping load or store: access size, target register, sign-extension. KVM, or userspace, can emulate the access from that alone, without re-fetching and decoding the guest instruction. When ISV is clear (“NISV”, Non-ISV), the syndrome has no decoded information, and KVM cannot emulate the access without doing the instruction-fetch-and-decode itself. By default KVM treats NISV as something it cannot safely handle and returns -ENOSYS from KVM_RUN. Userspace can opt into KVM_CAP_ARM_NISV_TO_USER, after which NISV traps surface as a KVM_EXIT_ARM_NISV exit and the VMM decides what to do.

ARM64 cache-maintenance instructions, such as IC IVAU (instruction-cache invalidate by virtual address), always trap as NISV: the architecture defines no ISV decoding for them, because they are not loads or stores in the conventional sense. They also architecturally count as writes for permission-checking purposes, since they can have store-like side effects (flushing dirty data, invalidating coherence).

With those two facts in place the bug chain is short. The same upstream commit that introduced the DAX altmap bug, 38607c62b34b (“fs/dax: properly refcount fs dax pages”), routes DAX PTEs through insert_page() instead of insert_pfn(). The latter produced PFN-only “special” PTEs; the former produces ordinary PTEs. Ordinary PTEs on ARM64 go through __sync_icache_dcache() when first made executable, which issues IC IVAU against the page to keep the instruction cache coherent with whatever was just written there. In Kalahari, EROFS layer bytes are mapped as a readonly KVM memslot, because the guest must not modify them. The first time the guest executes code from an EROFS-backed mapping, the kernel issues IC IVAU against that page; ARM treats it as a write to readonly memory and traps to KVM with NISV; KVM returns -ENOSYS; the VM is dead.

This one is fixed in the VMM, not as an upstream patch. Enable KVM_CAP_ARM_NISV_TO_USER so the trap surfaces as KVM_EXIT_ARM_NISV instead of -ENOSYS. When the exit’s fault address is inside an EROFS readonly memslot, skip the offending instruction by advancing PC by four. The cache-maintenance op is a no-op for our purposes anyway: the guest cannot actually write that memory, so there is nothing dirty to flush. This matches what KVM itself does internally for cache maintenance against non-memslot regions, via the kvm_vcpu_dabt_is_cm path; the userspace handler is extending the same treatment to readonly memslots.

Bug 2: overlayfs ESTALE on tmpfile copy-up over a virtiofs upper

A note on overlayfs terminology first. An overlayfs mount stacks one or more read-only lower layers under a single writable upper layer. Reads see the merged view of all layers; writes go to the upper; deletions are recorded as whiteout markers in the upper, so a file from a lower layer can appear “removed” without the lower itself being modified. Overlayfs also keeps a hidden workdir, a directory on the same filesystem as the upper, which it uses to stage atomic operations like rename-with-replace and to materialize copy-up tmpfiles before they are linked into place. In Kalahari, the lowers are the per-layer EROFS images, and the upper is whatever filesystem holds the workload’s writes during its session.

The design choice that surfaced overlayfs

On April 15 we added a virtiofs-backed overlay upper to the VMM. The motivation, as described at the end of Toward Zero-Copy OCI Layers, is that the writable upper of the workload’s overlayfs root is normally tmpfs, which makes the sandbox ephemeral. For sandboxes that need to persist writes (“resume this sandbox tomorrow”), the upper can instead be backed by a host-mounted virtiofs share, so writes land on the host filesystem.

How overlayfs failed

Two days later, when we ran dpkg-style workloads inside such a sandbox, every install failed:

dpkg: error processing archive ...:
 unable to install new file '...': Stale file handle

The same image with a tmpfs or local-fs upper worked fine.

Where the overlayfs regression came from

Overlayfs linked the tmpfile into the right place, then accidentally kept tracking the pre-link disconnected dentry.

Overlayfs has two copy-up paths. The “tmpfile” path, used when the upper supports O_TMPFILE, creates an unnamed tmpfile in the upper, populates it, and links it into place under the destination name. The “workdir” path uses an explicit working directory. The tmpfile path is the fast one and is taken whenever the upper advertises tmpfile support, including virtiofs and FUSE.

vfs_tmpfile() allocates the tmpfile dentry with d_alloc(parentpath->dentry, &slash_name), so the dentry’s d_name is "/" and its d_parent is the workdir. Local upper filesystems (ext4, btrfs, xfs, tmpfs) immediately rename it to #<inum> inside their own ->tmpfile() op via d_mark_tmpfile(). virtiofs and FUSE do not. The tmpfile dentry stays named "/" with d_parent = workdir.

That is fine while the tmpfile stays unnamed. It breaks when overlayfs publishes that dentry as the file’s permanent upper:

ovl_inode_update(d_inode(c->dentry), dget(temp));

temp is the disconnected O_TMPFILE dentry, named "/" and parented to the workdir. A few lines earlier, ovl_do_link() had linked it into the destination directory under the right name, and the link operation returned a separate dentry, upper, with the correct parent and name. upper is the dentry overlayfs should publish; the code publishes temp and drops upper on the floor.

When the upper filesystem implements ->d_revalidate(), as virtiofs and FUSE do, ovl_revalidate_real() later calls it with the dentry’s parent inode and a snapshot of d_name. The server tries to look up "/" inside the workdir, fails, and overlayfs returns -ESTALE to userspace. Every subsequent op against the copied-up file hits the same revalidation. The error is permanent for the lifetime of the inode, which is why dpkg’s rename-over-existing breaks immediately.

Bisecting in overlayfs history pointed at commit 6b52243f633e (“ovl: fold copy-up helpers into callers”). Before the fold, the tmpfile copy-up path used a dedicated helper ovl_link_tmpfile() that captured the linked destination dentry returned by ovl_do_link():

err = ovl_do_link(temp, udir, upper);
...
if (!err)
    *newdentry = dget(upper);

and the caller published newdentry via ovl_inode_update(). The fold inlined ovl_do_link() into ovl_copy_up_tmpfile() but lost a step in the inlining: the dget(upper) capture went away, and the publish line was rewritten as ovl_inode_update(d_inode(c->dentry), dget(temp)). The fold itself was mechanical; the regression came from the rewrite that replaced newdentry with temp.

The overlayfs fix

Restore the dget(upper) capture inside the success branch of ovl_do_link(), and publish that dentry via ovl_inode_update() instead of the tmpfile dentry:

     err = ovl_do_link(ofs, temp, udir, upper);
+    if (!err) {
+        /*
+         * Record the linked dentry, not the disconnected
+         * O_TMPFILE dentry, so that ->d_revalidate() on
+         * the upper fs sees the real parent/name.
+         */
+        newdentry = dget(upper);
+    }
     ...
-    ovl_inode_update(d_inode(c->dentry), dget(temp));
+    ovl_inode_update(d_inode(c->dentry), newdentry);

The patch is on lore.kernel.org with Cc: [email protected] # v4.20+, since 6b52243f633e reached back through several stable trees.

Why overlayfs missed it

The bug is seven years old. It hid for that long because every common upper filesystem (ext4, btrfs, xfs, tmpfs) calls d_mark_tmpfile() from inside ->tmpfile(), which renames the disconnected dentry to #<inum> under workdir before overlayfs publishes it. A workdir-relative lookup of #<inum> happens to succeed for them, even though it is not what overlayfs intended. virtiofs and FUSE do not call d_mark_tmpfile(), and they implement ->d_revalidate(), so the disconnected dentry hits the path that exposes the bug. A workload like dpkg, which does rename-over-existing under copy-up, is a clean trigger; but the kind of overlayfs deployments that hit it (virtiofs or FUSE upper, not local fs upper) are rare in practice.

Lightweight VMs running OCI workloads on virtio-pmem with overlayfs in the guest, plus an optional virtiofs upper for persistent sandboxes, land squarely in both intersections. That is the configuration this kind of work needs.

Patches

[PATCH] fs/dax: check for empty/zero entries before calling pfn_to_page() (lore)
[PATCH] ovl: use linked upper dentry in copy-up tmpfile (lore)