Skip to main content
← back to blog

Process Handles Are Scheduler Contracts

How Kalahari keeps process handles, PTYs, callbacks, and zygote boundaries coherent across parked VM runs.

kalahari rust virtualization developer-tools

Kalahari’s process API is intentionally plain:

const server = await sandbox.startShell('python -m http.server 8000', {
  cwd: '/workspace',
  onStdout(chunk) {
    console.log(chunk);
  },
});

await server.kill();

That handle looks like a normal TypeScript object. It has a Kalahari-assigned pid, sendStdin(), resize(), kill(), and wait().

Underneath, it is a scheduler contract.

The command may outlive the API call that started it. The VM may be running inside one host operation, parked between operations, preempted so another VM can run, spawned from a zygote, or shutting down. The same process may also be reached through the native Kalahari API, an E2B-style wrapper, or a ComputeSDK adapter.

A process handle only feels simple if every one of those layers agrees on what it means to own a command.

Parked VMs Change the Problem

The scheduler exposes a logical VM that is usually parked. A caller enters guest execution through a scoped run operation, does work through a temporary VM handle, and gets a parked VM back when the operation completes. That boundary is important because it gives Kalahari a place to multiplex backend shells, freeze a zygote, close resources, and enforce invariants.

Processes do not naturally fit into that shape.

A command can start in one scheduler run and still be alive when that run wants to return. Later, another operation may need to write stdin, resize a PTY, collect output, or kill the same command. If the scheduler treats each run as a disposable host call, the process API turns into a set of races:

  • output delivered to one host path but not another
  • stdin EOF sent just because a temporary command owner was dropped
  • a command ID attached twice
  • a zygote child inheriting half-owned host state
  • destroy() making a cancelled process look successful

Those are not TypeScript edge cases. They are VM lifecycle questions.

Handles Are More Than IDs

The scheduler has two forms of command ownership.

During a running epoch, a command owner holds the live stdout, stderr, exit, stdin, and PTY control paths. To survive a parked boundary, it must become a lightweight reattachment handle that can be attached during a later run.

That conversion is deliberately strict. It can fail if stdout or stderr has already reached transient host channels, if output is still queued for the scheduler to forward, if the command has exited, or if there is pending control state such as stdin, EOF, or resize. In those cases the original command owner is returned with the error so the caller can keep draining or close it deliberately.

The rule is conservative because cloning a command stream after bytes have escaped would split reality. One attachment would see data another cannot. A resize or EOF could be applied in one run and silently disappear from the next. Kalahari would rather reject reattachment than let a handle lie.

The same rule shows up at VM-run boundaries. When a run finishes, the scheduler tries to detach active commands into handles only when it can do so without losing host-visible state. If a command is still attached to the user-visible run, the run cannot complete successfully. If the command was dropped, the scheduler sends EOF or abandons it instead of pretending it is still safely owned.

Duplicate Attachments Are Structural Bugs

A detached command handle is a capability to reconnect one live guest command. Attaching it twice would create two host owners for one stdout stream, one stdin writer, and one exit event.

Both the VMM layer and the scheduler reject duplicate attachments before the VM starts. A paused VM cannot queue the same command ID twice for reattachment, and the scheduler keeps command state in a registry keyed by execution ID so each epoch knows which detached primitive handles may be reattached.

Kalahari mirrors that discipline in the JavaScript layer. The process registry is keyed by native sandbox identity, not by wrapper object. A Kalahari handle, an E2B-compatible handle, and a connected handle all route through the same registry for that sandbox:

const server = await sandbox.startProcess('node', {
  args: ['server.js'],
  cwd: '/workspace',
});

const sameServer = sandbox.connectProcess(server.pid);
await sameServer.sendStdin('status\n');

console.log(sandbox.listProcesses());
await server.kill();

The pid here is Kalahari’s process-handle ID. It is stable inside the registry, but it should not be treated as an arbitrary guest PID that can be resolved by another sandbox.

Output Collection Starts at Launch

Background process support forced a separate distinction: collecting output is not the same thing as waiting for a process to finish.

When a process starts, Kalahari creates a PTY session, records metadata, and immediately starts a single collection promise. wait() returns that promise. Callbacks are invoked as chunks are read, and repeated waits on the same handle observe the same result.

That means this works even if wait() is never called during the interesting part of the process lifetime:

const build = await sandbox.startShell('npm run build -- --watch', {
  onStdout(chunk) {
    console.log(chunk);
  },
  onStderr(chunk) {
    console.error(chunk);
  },
});

await build.sendStdin('help\n');

This fixes the product-level behavior: callbacks are live from launch, durationMs measures from process start, and compatibility wrappers do not need their own output-drain loops.

It is not a hard memory cap.

The current JavaScript registry appends stdout and stderr into strings so wait() can later return a CommandResult. That eager drain prevents “nobody called wait yet” from being the reason native PTY reads stall, but a noisy long-running background process can still grow JS memory while Kalahari caches its output. One-shot run() has an outputLimitBytes path; process handles do not currently impose an equivalent cap in the JS registry.

That distinction matters. The accurate guarantee is prompt collection and callback delivery, not bounded background-log storage.

Callback Failure Ends the Process

onStdout and onStderr are part of the user’s requested observation path. If one throws, Kalahari treats that as a process lifecycle failure.

The registry does not log the callback error and keep the process running invisibly. The collection loop terminates the PTY session, removes the process from the registry, and makes wait() reject with the callback error.

The behavior is intentionally sharp:

process emits output
callback throws
Kalahari interrupts and closes the PTY session
the process leaves the registry
wait() rejects
killProcess(pid) returns false after cleanup

That is better than leaving a process alive after its only registered observer failed.

Zygote Boundaries Need Clean Ownership

Zygotes are where loose lifecycle rules become dangerous.

When a sandbox becomes a zygote, the parent sandbox is consumed. Children spawned from the zygote need a coherent container init process, but they must not inherit arbitrary user-visible PTYs or process handles with host-side output, stdin, or resize state in flight.

Kalahari enforces that at two layers.

At the TypeScript layer, sandbox.zygote() rejects while the process registry has active process handles. At the native runtime layer, zygote creation also rejects while PTY sessions are active. That catches lower-level createPty() users and compatibility paths that are not represented as normal Kalahari process records.

The init process is the special case because Kalahari owns it. Before freezing, the native actor drops the visible init output streams and repeatedly attempts to convert init into reattachable scheduler state. If stdout or stderr was already in transient host channels, it waits briefly for that state to drain. On success, the zygote stores the init handle. Each spawned child attaches that handle during its first run and resumes with a live init process.

User process handles do not cross that boundary:

const shell = await sandbox.startShell('cat');

await sandbox.zygote(); // rejects while shell is active

await shell.kill();
await shell.wait(); // rejects because the process was terminated

The boundary is strict because a zygote is a template. It cannot contain host-owned leftovers from a parent that no longer exists as a running sandbox.

Destroy Is Cancellation, Not Success

Sandbox destruction has its own lifecycle rule: cleanup should be idempotent, but user-facing waits should remain truthful.

KalahariSandbox.destroy() asks the existing process registry to close all active processes before destroying the native sandbox. Closing an already-finished PTY is tolerated during cleanup. A process that is being waited on, however, should not resolve as exitCode: 0 just because destruction closed its session.

So termination records an error, closes the PTY, deletes the registry entry, and causes in-flight or later wait() calls on that handle to reject. The cleanup path can be forgiving without turning cancellation into a successful command result.

The API Is Boring Because the Scheduler Is Precise

The public surface stays small:

const task = await sandbox.startProcess('node', {
  args: ['worker.js'],
  cwd: '/workspace',
  onStdout: (chunk) => console.log(chunk),
});

await task.sendStdin('status\n');
const result = await task.wait();

Behind that are parked VM runs, epoch attachments, reattachable command handles, output backpressure, callback failures, registry ownership, PTY cleanup, zygote conversion, and sandbox destruction.

That complexity belongs below the API. Kalahari users should not need to know whether a command survived a parked VM run or whether an E2B-compatible wrapper created the handle. They should get one coherent process lifecycle.

The lesson is that a sandbox SDK cannot paper over scheduler ambiguity. If the VM lifecycle is precise, the product API can be ordinary.