Skip to main content
← back to blog

How Kalahari Branches VM Memory on macOS with Mach Memory Entries

Why Kalahari uses Mach memory entries, mach_vm_map(copy=TRUE), Mach IPC port descriptors, and registered-port bootstrap to make VM branching portable.

virtualization rust macos mach

The API Kalahari wants to expose is deliberately small:

const zygote = await sandbox.zygote();

const childA = await zygote.spawn();
const childB = await zygote.spawn();
const childC = await zygote.spawn();

A user takes a configured sandbox, freezes it as a zygote, and spawns several independent children. Each child can write to its own filesystem, modify its own memory, and become a zygote in turn for further branching. The implementation has to make all of that real without copying guest RAM each time.

That last requirement is the hard one. Guest RAM is the largest piece of VM state, often gigabytes per VM. If every spawn eagerly copied that memory, the API would be unusable. Branching has to be copy-on-write at the kernel level: the child shares the parent’s pages until it writes, and only then does the kernel allocate a private page for the child.

On Linux, Kalahari can build that abstraction from file-descriptor-named memory plus a copy-on-write backend. On macOS, the answer is something different: a Mach memory entry. This post is about why, and what the implementation has to do to make that primitive behave the way Kalahari needs.

Mach in Brief

macOS has two layers of kernel personality. The BSD layer gives you what looks like Unix: file descriptors, sockets, processes, signals. Underneath that, the Mach kernel exposes its own object model: tasks (close to processes), threads, virtual memory, and ports.

A Mach port is the kernel’s name for a generic capability. It can name an IPC channel, a thread, a task, a memory object, a hardware device, or many other things. Two facts about Mach ports matter for this post:

  • Ports are integers in your task’s port namespace, but they carry rights. A “send right” lets you send messages or operations to the port. A “receive right” lets you wait for them. Cloning a send right increments a kernel reference count. Dropping it decrements. Ports are not file descriptors. They encode capability, not just identity.
  • Ports can be transferred between tasks. Mach IPC has a port-descriptor message format that names a port from the sender and inserts an equivalent port into the receiver’s namespace. Unlike SCM_RIGHTS, which only knows how to move file descriptors, Mach IPC can move any port type, with the right kind and disposition preserved.

A Mach memory entry is a Mach port that names a virtual memory object. You can map it into your address space, transfer it to another task as a port descriptor, and let the kernel manage copy-on-write between mappings of the same entry. That last property is the one Kalahari needs.

The Handle Is a Port

It is tempting to start a cross-platform memory layer by asking for the macOS equivalent of memfd_create. That question points in the wrong direction. The closest thing is a memory entry, but a memory entry is not a file descriptor. It is a port with rights.

Kalahari’s macOS memory handle is built around that fact. The shape is roughly:

pub struct MemHandle {
    entry: u32,         // Mach memory entry send right
    size: usize,
    writable: bool,
}

impl MemHandle {
    pub fn allocate(size: usize) -> Self;                  // anonymous shared region
    pub fn from_file(path: &Path) -> Self;                  // read-only host file
    pub fn from_mach_port(port: u32, size: usize) -> Self;  // received over IPC

    pub fn try_clone(&self) -> Self;                        // bump send-right refcount
}

A few details are worth stating explicitly:

  • entry is an integer in the current task’s port namespace. The integer itself is meaningless to other tasks. What is transferred over IPC is the right the port represents, not the integer.
  • try_clone does not copy memory or pages. It calls mach_port_mod_refs to increment the send-right reference count and returns a fresh handle pointing at the same VM object.
  • Drop calls mach_port_deallocate, which decrements that count. When the last reference is gone, the kernel releases the port and (if nothing else references the underlying VM object) the pages.
  • writable is a Kalahari-side bit. It controls how this handle is mapped, not whether the underlying entry could in principle be mapped writable. The reason that distinction exists is explained below.

Allocating Fresh Anonymous Memory

When Kalahari needs a new buffer of guest RAM, it does this:

  1. Ask the Mach VM system for an anonymous, page-aligned range of virtual memory: mach_vm_allocate. This returns a memory range in the current task.
  2. Wrap that range in a memory entry with mach_make_memory_entry_64, passing the MAP_MEM_VM_SHARE flag.
mach_vm_allocate
        |
        v
mach_make_memory_entry_64(MAP_MEM_VM_SHARE)
        |
        v
Mach memory entry port

The MAP_MEM_VM_SHARE flag is what makes the entry transferable to other tasks. Without it the entry would be local-only and could not be sent to a worker.

After step 2, Kalahari has a port that refers to the same physical pages as the original allocation. The port can be mapped locally or sent over IPC to a worker process, where it can be mapped again. Both mappings see the same backing pages until either one writes.

The original VM range has to stay alive as long as the entry exists, because the entry only references it. That is what the Drop-time deallocation in the Rust handle is for. When the last MemHandle referring to the locally-allocated range goes away, Kalahari tells the kernel to release the range, the entry’s reference on the underlying VM object drops with it, and the pages become eligible for reclaim.

File-Backed Memory and the Hypervisor Wrinkle

Read-only filesystem images, like an EROFS rootfs, want a memory entry too, but a read-only one. Naively that should be simple: open the file read-only, map it read-only, wrap it as an entry. There is a wrinkle that comes from the hypervisor.

The macOS kernel tracks two protection values for every VM range:

  • Current protection: what is allowed right now (read, write, execute).
  • Maximum protection: the ceiling. Current protection can never exceed maximum.

Hypervisor.framework’s hv_vm_map checks the maximum protection of the host VM range it is asked to map for the guest. If the maximum lacks write, hv_vm_map refuses, even if Kalahari is only asking for guest read and execute access. The reason is conservative: the hypervisor wants the host range to permit any future writes the guest might be granted, and a host range whose maximum protection is read-only cannot ever do that.

So Kalahari constructs file-backed entries carefully:

  • Open the file read-only.
  • Map it privately with read-and-write maximum protection. The file is still read-only on disk, but the resulting in-memory range has the right ceiling for hv_vm_map.
  • Create a memory entry from that mapping. Mark the resulting MemHandle as not writable at the Kalahari API level.
  • When that handle is mapped for actual use, the current protection is read-only. The maximum protection still includes write, which is what hv_vm_map needs to see to accept the mapping.

That keeps the read-only semantics honest. Kalahari never asks for the current protection to include write, so the guest cannot write through it. The dance is only there to satisfy hv_vm_map’s maximum-protection check.

Branching Is a VM Mapping Operation

The interesting operation is branch(): take an existing memory entry and produce a new memory entry that starts as a copy-on-write child of it. The implementation does three things:

  1. Allocate a fresh VM range for the child, the same size as the parent.
  2. Call mach_vm_map on the parent’s memory entry with copy=TRUE and VM_FLAGS_OVERWRITE. That replaces the freshly allocated range with copy-on-write references to the parent’s pages.
  3. Wrap the new child range in its own MAP_MEM_VM_SHARE memory entry.
parent memory entry
        |
        | mach_vm_map(copy=TRUE)
        v
child VM region with CoW references
        |
        | mach_make_memory_entry_64(MAP_MEM_VM_SHARE)
        v
child memory entry

mach_vm_map(copy=TRUE) is the kernel-side primitive. No physical pages are copied at this point. The child range simply points at the parent’s pages with a copy-on-write attribute. When either the parent or the child writes to a page, the kernel allocates a private page for whichever side wrote, and the other side keeps the original. Conceptually, this is the same copy-on-write behavior Unix fork() relies on for cheap child processes.

What makes step 3 important is that the child becomes a memory entry of its own, not just a private mapping. That means the child can be sent over IPC, mapped by a different process, and branched again. Each generation of branch produces a handle of the same shape as its parent, which is what makes zygote.spawn() an N-way operation rather than a one-shot.

A simpler answer using MAP_PRIVATE would not work for the same reason fork() is not enough. MAP_PRIVATE gives one mapping private fault semantics, but it does not give you a transferable, durable object you can branch from later. Kalahari needs the latter.

IPC Has to Carry Rights, Not Just Bytes

Kalahari’s VM workers are separate processes from the parent. Whenever the parent creates a memory entry, the worker has to receive it before the worker can map it into the hypervisor. On Linux, the equivalent transport problem is solved with SCM_RIGHTS over a Unix-domain socket, which knows how to transfer file descriptors. On macOS, the resource is not a file descriptor. It is a Mach port with a send right, and SCM_RIGHTS cannot represent that without losing the rights model (encoding capability semantics through a file-descriptor abstraction that has no notion of port rights or named VM objects).

The macOS IPC channel is therefore a hybrid:

shared ring buffer      anonymous Mach VM, carried as a memory-entry handle
doorbell                AF_LOCAL socketpair
aux transport           Mach messages with port descriptors
  • The shared ring buffer is where the actual message data goes. It is plain bytes in shared memory, accessed by both processes. Kalahari serializes messages with postcard and writes them into the ring.
  • The doorbell is an AF_LOCAL socketpair used only for wakeups. When one side has produced a frame, it writes a byte to the doorbell to nudge the other side.
  • The aux transport carries Mach port descriptors. When a ring frame contains a memory handle, the actual port right is carried over a Mach message in the aux transport, in parallel with the bytes in the ring.

The aux transport uses Mach messages because Mach IPC is the kernel’s native way to copy or move port rights between tasks. Each port descriptor is sent with the disposition MACH_MSG_TYPE_COPY_SEND, which means the sender keeps its own send right and the receiver gets a fresh one of its own.

The receive side treats incoming Mach messages as untrusted transport input. It caps the Mach receive buffer at 64 MiB so a peer cannot force the receiver to allocate gigabytes per message. It checks the message id, the declared message size, and the descriptor count in the complex-message header. It rejects non-port descriptors, unexpected port dispositions, and null ports. It validates that the number of port descriptors matches the aux-slot count the ring frame declared.

Those checks are not cosmetic. A malformed port-bearing message can leak send rights into the receiving task, desynchronize the ring and aux streams (so a future frame thinks a port belongs to the wrong handle), or trigger a worker failure that surfaces far away from the original IPC bug.

Bootstrapping the Channel Is the Hard Part

Once a parent and a child can exchange Mach messages, transferring memory entries is routine. The awkward problem is creating that channel for the first time, before any IPC exists.

A child process at posix_spawn time has no shared-memory ring with its parent. It also has no Mach send right it can use to receive aux messages. Kalahari therefore has to place an initial Mach send right into the child during posix_spawn, before the child has done any work of its own. The child then calls mach_ports_lookup (a public API) to retrieve registered Mach ports installed at spawn time, uses one of them as a bootstrap channel, sends the parent a send right to its own receive port in return, and completes the bidirectional Mach transport from there.

There is a wrinkle: there is no public posix_spawn API to attach registered Mach ports at spawn time. The kernel mechanism exists, the child-side lookup exists, but Apple does not expose a way to set the registered-port array from the parent side. The slot lives inside _posix_spawnattr, the heap-allocated struct behind the opaque posix_spawnattr_t, and Apple does not promise that internal layout is stable across releases.

Kalahari cannot ship hard-coded offsets per macOS version. The probe instead works at runtime:

  1. Allocate a fresh posix_spawnattr_t. This calls the public posix_spawnattr_init, which allocates the underlying heap struct.
  2. Call the private SPI posix_spawnattr_setspecialport_np with TASK_BOOTSTRAP_PORT and a zero new_port value. This is a known sentinel: it forces the heap struct to allocate its internal port-action buffer and write a recognizable record into it.
  3. Scan up to 64 pointer-sized words inside the heap struct looking for the heap allocation the previous step just produced. Each candidate pointer is checked with malloc_size first, so the scan never dereferences anything that is not a live heap block.
  4. A candidate is accepted only when its embedded record matches the sentinel: an allocation count in a small range, a port-action count of 1, and an embedded action whose fields are exactly port_type == SPECIAL, which == TASK_BOOTSTRAP_PORT, and new_port == 0.
  5. Cache the discovered byte offset in a OnceLock and reuse it for every subsequent spawn.

If the scan fails, the layout has shifted in a way the probe does not understand. Kalahari returns ENOTSUP and posix_spawn fails with a clear error. There is no fallback to a fake transport, no silent skip of port bootstrap, no test stub. This layout has been verified across macOS 13-15 on aarch64 at the time of writing, but every major macOS release has to be re-verified before it is trusted.

Once the child has the aux channel, the parent sends the real shared-ring memory entry through that same Mach-message path. The child maps it, validates the ring header, consumes the bootstrap frame, and from that point on uses the same generic sender and receiver code as the rest of the channel.

That is a lot of work to make a normal-looking child process able to receive guest memory. It also belongs in the VM layer, not in user code.

The User Should Not See Any of This

The portable abstraction is not that every operating system has the same primitive. They do not. Linux exposes file-descriptor-named memory and a separate copy-on-write backend. macOS exposes VM objects named by ports. Windows exposes section handles. The portable abstraction is that every backend produces a branchable, transferable memory handle with the same lifecycle: allocate or import, map, branch, send to a worker, branch again, drop.

On macOS, that means the handle is a Mach memory entry, branching is mach_vm_map(copy=TRUE) followed by a fresh memory entry, and cross-process transfer is Mach IPC with port descriptors. The bootstrap probe, send-right accounting, writable metadata, and receive-side validation are all in service of one promise:

const zygote = await sandbox.zygote();
const child = await zygote.spawn();

The child should be a real, isolated VM branch. The caller should not have to learn Mach.

The trick is that Kalahari is not pretending macOS is Linux. It is using the kernel object model macOS already exposes (VM objects named by ports, rights transferred by messages, copy-on-write enforced by the kernel) and wrapping it in something that looks the same as the Linux backend at the public boundary.