Skip to main content
← back to blog

A Kernel Config You Can Read in One Sitting

How Kalahari's guest kernel boots in milliseconds: starting from allnoconfig and turning on only the devices the microVM's VMM actually exposes.

kalahari linux virtualization developer-tools

A general-purpose distro kernel is built to boot on a laptop, a server, a cloud instance, and hardware its authors have never seen. It probes ACPI tables, walks PCI buses, initializes RTC drivers, waits on entropy, and considers a few hundred filesystem types in case one of them shows up. That’s the right design for “kernel that runs on anything.”

It is the wrong design for a microVM whose hardware is a fixed list of paravirtualized devices that the VMM hands the guest at boot.

Kalahari’s guest kernel is built from a single Makefile that starts from allnoconfig and adds back only what the microVM is going to use. The whole config is short enough to read in one sitting. This post walks through it: what’s off, what’s on, and which decisions buy boot time.

The Baseline: allnoconfig, Not defconfig

Most kernel configs start from defconfig, which already has hundreds of options enabled, and then turn things off. We do the opposite. The build starts from allnoconfig, which leaves every CONFIG_* option set to n, then explicitly enables only what’s needed.

$(MAKE) -C $(KERNEL_SRC_DIR) O=$(KERNEL_BUILD_DIR) $(KMAKE_ARGS) \
    KCONFIG_ALLCONFIG=$(KERNEL_BUILD_DIR)/allno.config allnoconfig

The advantage is not just a smaller kernel. It’s that every line in the config file is intentional. Nothing is on because some defconfig author thought it was useful for a 2014 server motherboard. If a feature is enabled, you can grep for it in the Makefile and find the call that turned it on.

That changes how you reason about the kernel. The default question is no longer “do we still need this?” It is “do we need this yet?”

What’s Off

The interesting part of an allnoconfig-based microVM kernel isn’t the list of things you turn on. It’s the list of things you don’t:

--disable MODULES        # No runtime module loading
--disable ACPI           # No firmware tables to parse
--disable PCI            # No bus to enumerate
--disable USB_SUPPORT    # No USB stack
--disable WIRELESS
--disable WLAN
--disable SOUND
--disable SECURITY       # No LSM stack inside the VM
--disable AUDIT
--disable DEBUG_INFO

On x86 specifically:

--disable RTC_HCTOSYS    # Don't read the RTC at boot
--disable RTC_DRV_CMOS
--disable ACPI
--disable PCI

Each of these is a small story:

  • MODULES=n kills the entire kmod machinery. Nothing is loaded at runtime; everything that ships in the kernel is built in. The cost is that we can’t ship a hot-fix as a .ko. The benefit is that boot does not pause to load modules, the userland init does not need depmod or modprobe, and there is no surface area for a workload to insert a new module into the running kernel.
  • ACPI=n removes the kernel’s ACPI table parsing path. On a microVM, the VMM tells the kernel about devices through other channels (cmdline for virtio-mmio, the device tree on arm64). ACPI is dead weight.
  • PCI=n removes PCI bus enumeration. A microVM has no PCI bus, so enumeration would only spend time proving that no devices are there.
  • USB_SUPPORT=n, WIRELESS=n, WLAN=n, SOUND=n are the easy ones. There is no USB, Wi-Fi, or sound card.
  • RTC_HCTOSYS=n and RTC_DRV_CMOS=n on x86 say “do not read the CMOS RTC at boot to set the system clock.” The VMM hands the guest paravirt time through kvm-clock, which is faster and doesn’t require a port-I/O sequence to read.
  • SECURITY=n and AUDIT=n are the controversial-looking ones. Kalahari does not run SELinux, AppArmor, or the kernel audit subsystem inside the guest. In this design, the VM boundary is the primary security boundary; host-side isolation, the VMM, and seccomp policy express the constraints we care about. Adding an in-guest LSM would add boot work and policy surface without improving that boundary for our target workloads. This is a design statement, not a general-purpose default.

The numbers add up. Every probe you don’t do, every subsystem you don’t __init, every kernel thread you don’t spawn at boot is time the workload doesn’t wait for.

What’s On (And Why)

The rest of the config is the minimum set the microVM actually uses. They fall into a few groups.

Virtio Transport: cmdline, Not Discovery

--enable VIRTIO --enable VIRTIO_MENU --enable VIRTIO_MMIO
--enable VIRTIO_MMIO_CMDLINE_DEVICES
--enable VIRTIO_BLK
--enable VIRTIO_CONSOLE
--enable HW_RANDOM_VIRTIO
--enable VIRTIO_BALLOON
--enable VIRTIO_MEM
--enable VIRTIO_FS
--enable VIRTIO_PMEM
--enable VIRTIO_NET
--enable VIRTIO_VSOCKETS
--enable VIRTIO_INPUT

VIRTIO_MMIO is the transport: virtio devices appear as memory-mapped registers at fixed addresses. The classic alternative is VIRTIO_PCI, where the kernel walks a PCI bus to find virtio devices. We don’t have a PCI bus. With VIRTIO_MMIO_CMDLINE_DEVICES, the VMM tells the kernel about each virtio device as a kernel command-line argument:

virtio_mmio.device=4K@0xd0000000:5

The kernel parses one line per device. There is no bus to walk, no enumeration to do, and no probing of nonexistent slots. In this kernel, that makes device setup cheap and predictable.

HW_RANDOM_VIRTIO is the entropy source. Without it, services that want entropy at startup (TLS, ssh-keygen, glibc’s arc4random) can block while the CRNG initializes. With it, the host hands entropy directly into the guest’s RNG and userland never blocks on entropy at boot.

KVM Paravirt

--enable HYPERVISOR_GUEST --enable KVM_GUEST --enable PARAVIRT

KVM_GUEST switches the kernel into paravirt mode under KVM: paravirt clocks, paravirt spinlocks, paravirt TLB shootdowns. Because the kernel is built for KVM from the start, it does not need to behave like bare metal first and discover the hypervisor later.

The Storage Stack: One EROFS Per Layer

--enable ZONE_DEVICE --enable DAX --enable FS_DAX
--enable LIBNVDIMM --enable BLK_DEV_PMEM --enable VIRTIO_PMEM
--enable MISC_FILESYSTEMS --enable EROFS_FS
--enable MD --enable BLK_DEV_DM --enable DM_LINEAR
--enable OVERLAY_FS --enable OVERLAY_FS_REDIRECT_DIR
--enable EXT4_FS
--enable TMPFS --enable TMPFS_XATTR --enable TMPFS_POSIX_ACL
--enable SHMEM
--enable FUSE_FS

This is the chain behind Kalahari’s image path. The guest can access host page cache through virtio-pmem with DAX, expose one EROFS region per OCI layer using dm-linear, and assemble those layers with overlayfs. The full storage story is in Toward Zero-Copy OCI Layers. For this post, the important point is that pmem, DAX, dm-linear, EROFS, overlayfs, and tmpfs are all built in. There is no module load on the boot path.

Virtio-mem for Memory Hotplug

--enable SPARSEMEM_VMEMMAP --enable MEMORY_HOTPLUG --enable MEMORY_HOTREMOVE
--enable MHP_MEMMAP_ON_MEMORY
--enable MEMORY_HOTPLUG_DEFAULT_ONLINE
--enable VIRTIO_MEM

VIRTIO_MEM lets the host grow and shrink the guest’s memory at runtime by plugging and unplugging memory blocks. MHP_MEMMAP_ON_MEMORY puts the memory map for a hotplugged block inside that block, instead of allocating it from existing memory. MEMORY_HOTPLUG_DEFAULT_ONLINE means freshly hotplugged blocks are immediately usable without a userland online step.

The boot-time consequence is that we can boot the guest with a small initial memory size and add memory only if the workload asks for it.

The Workload Surface

--enable NAMESPACES --enable PID_NS --enable NET_NS --enable UTS_NS
--enable IPC_NS --enable USER_NS
--enable CGROUPS --enable CGROUP_SCHED --enable CGROUP_PIDS
--enable CGROUP_DEVICE --enable MEMCG --enable CPUSETS --enable CGROUP_FREEZER
--enable SECCOMP --enable SECCOMP_FILTER
--enable BPF_SYSCALL --enable CGROUP_BPF
--enable FUTEX --enable EPOLL --enable SIGNALFD --enable TIMERFD
--enable MEMFD_CREATE
--enable BINFMT_ELF --enable BINFMT_SCRIPT

These are turned on because the workload, running inside the guest, expects them. seccomp, epoll, futex, namespaces, and cgroups are the surface modern Linux userland builds against. They are not the dominant boot-time cost here; they are the Linux surface modern workloads expect.

SECURITY=n does not turn off these primitives. Seccomp filters and cgroup limits do not require an LSM. They’re how the guest’s own init shapes what the workload process can do, independent of any host-side LSM policy.

The Console

# x86
--enable SERIAL_8250 --enable SERIAL_8250_CONSOLE
--set-val SERIAL_8250_NR_UARTS 1
--set-val SERIAL_8250_RUNTIME_UARTS 1

# arm64
--enable SERIAL_AMBA_PL011 --enable SERIAL_AMBA_PL011_CONSOLE

One UART. The 8250 nr_uarts is forced down to 1 instead of the default 4 or 8: the kernel doesn’t probe for UARTs that aren’t there.

Architecture Splits

The Makefile is structured so the shared options work on every architecture, and the small per-arch blocks deal with what’s actually different:

x86:

--enable HYPERVISOR_GUEST --enable KVM_GUEST --enable PARAVIRT
--disable RTC_HCTOSYS
--disable RTC_DRV_CMOS
--disable ACPI
--disable PCI

arm64:

--enable ARM64
--enable ARM64_VA_BITS_48
--enable ARCH_VIRT
--enable ARM_GIC_V3
--enable ARM_ARCH_TIMER
--enable ARM_PSCI_FW
--enable RTC_DRV_PL031
--disable ACPI
--disable PCI

PSCI (Power State Coordination Interface) is how the arm64 guest talks to the hypervisor for power management; on SYSTEM_OFF, the VMM intercepts and shuts the VM down. The 48-bit VA layout is the standard arm64 server layout. ARM64_VA_BITS_52 is explicitly disabled because we don’t need that much address space and the larger layout has its own per-syscall costs.

The shared bits dominate. The arch-specific bits are the irreducible differences: which interrupt controller, which timer, which serial port, which paravirt clock.

The Boot Image Itself

ifeq ($(KERNEL_ARCH),arm64)
  KERNEL_IMAGE = $(KERNEL_BUILD_DIR)/arch/arm64/boot/Image
  KERNEL_TARGET := Image
else
  KERNEL_IMAGE = $(KERNEL_BUILD_DIR)/vmlinux
  KERNEL_TARGET := vmlinux
endif

On x86, the VMM loads vmlinux directly. There is no GRUB, no bzImage decompression preamble, no boot loader. The VMM jumps straight to the kernel’s 64-bit entry point. On arm64, the VMM does the same with the flat Image binary.

This is the boot path that papers on fast boot usually call “direct kernel boot.” It is also conspicuously absent from many cloud images, which still take an EFI boot path before reaching the kernel.

The init that runs first is not systemd. It is a single static guest agent that mounts a few filesystems, opens the shared-memory ring buffer to the host, and starts accepting commands. The agent is its own crate; the kernel’s job is to hand it a working environment as quickly as possible.

What Falls Out

Today, the build produces a 20 MB unstripped vmlinux ELF on x86. About 14 MB is loadable text, data, and bss; the rest is metadata. On arm64, the flat Image is 9.1 MB. Those are the artifacts the VMM consumes directly. Both numbers will move around based on kernel version and what you compile in, and neither is the dominant boot-time cost in the steady state. The measurements in Introducing Kalahari show about 117 ms p50 from sandbox-create to a runnable microVM when the image is already cached; the latency budget at that point is the kernel’s own __init work plus the guest agent’s setup, not the I/O of loading the kernel image.

The honest caveat: this kernel will not boot your laptop. If you plug a USB stick into it, nothing happens, because there’s no USB stack. If you put it on a server with NVMe drives, it won’t see them. The whole point is that the hardware list is fixed at build time and matches the VMM’s device model exactly.

If you want to read the whole config, it is the guest-kernel Makefile. You can read it top to bottom, and that is the design goal: a kernel config small enough that every change is reviewable.