Introducing amla-sandbox: Secure Code Execution for AI Agents

Today we’re releasing amla-sandbox, a Python package that gives AI agents the ability to write and execute code—safely.

TL;DR: amla-sandbox is a secure WASM-based execution environment. Agents write Bash that calls your tools. Token usage drops 98% compared to tool-call loops. uv pip install git+https://github.com/amlalabs/amla-sandbox and you’re running.

The Problem: Agents Want to Code

LLMs are trained on code. When you ask an agent to process data, it wants to write a script—loop through records, filter by condition, transform and aggregate. That’s what it knows how to do.

But most popular agent frameworks execute LLM-generated code through subprocess calls on your host system:

Framework	Execution Method	Source
LangChain	Direct Python exec	CVE-2023-29374
AutoGPT	Shell subprocess	CVE-2024-6091
AutoGen	Python subprocess	Docs: “LLM can generate arbitrary code”
SWE-Agent	Bash subprocess	Trend Micro: AI Agent Code Execution
MetaGPT	Python subprocess	Issue #731: Arbitrary code execution

That’s arbitrary code execution on your host. One prompt injection away from disaster.

The standard response is to wrap execution in Docker. But Docker is infrastructure. It’s slow to start. It requires orchestration. And it still doesn’t solve the capability problem: the agent either has access to everything in the container, or you’re back to allow-listing individual operations.

The Cost Problem

Security aside, there’s an economic problem.

Traditional agent loops work like this:

sequenceDiagram
    participant LLM
    participant Tools

    LLM->>Tools: tool call 1
    Tools-->>LLM: result (enters context)
    Note right of LLM: Re-process full context
    LLM->>Tools: tool call 2
    Tools-->>LLM: result (enters context)
    Note right of LLM: Re-process full context
    LLM->>Tools: tool call 3
    Tools-->>LLM: result (enters context)
    Note right of LLM: Context keeps growing...

Each round trip requires an LLM invocation. The context accumulates. In a typical agent conversation, tool responses make up 67.6% of the total tokens, meaning tools comprise nearly 80% of what the agent actually sees.

The agent knows how to write a single script that does all 10 operations. The framework forces it into a call-by-call loop because that’s the only way to maintain control.

What if there was a way to let agents code—actually code—while maintaining security guarantees?

sequenceDiagram
    participant LLM
    participant Sandbox
    participant Tools

    LLM->>Sandbox: script (calls tools 1, 2, 3...)
    Sandbox->>Tools: tool call 1
    Tools-->>Sandbox: result
    Sandbox->>Tools: tool call 2
    Tools-->>Sandbox: result
    Sandbox->>Tools: tool call 3
    Tools-->>Sandbox: result
    Sandbox-->>LLM: final output only
    Note right of LLM: One LLM call, minimal context

Anthropic’s engineering team documented this approach: code execution reduces token usage from 150,000 tokens to 2,000 tokens—a 98.7% reduction. Large data flows through the code execution environment, not the LLM’s context window.

amla-sandbox: Let Agents Code Safely

amla-sandbox is a secure WASM-based execution environment. The agent writes code in JavaScript or shell. The code runs in isolation. Tool access is controlled by constraints you define.

from amla_sandbox import create_sandbox_tool

def get_weather(city: str) -> dict:
    """Get current weather for a city."""
    return {"city": city, "temp": 72, "conditions": "sunny"}

def send_email(to: str, subject: str, body: str) -> dict:
    """Send an email."""
    return {"status": "sent", "to": to}

# Create a sandbox with your functions
sandbox = create_sandbox_tool(tools=[get_weather, send_email])

# JavaScript: async/await for tool calls
result = sandbox.run("""
    const weather = await get_weather({ city: "San Francisco" });
    console.log("Weather:", weather);
    console.log("Conditions:", weather.conditions);
""", language="javascript")

# Shell: Unix pipelines with tool command
result = sandbox.run("""
    tool get_weather --city "Tokyo" | jq '.temp'
""", language="shell")

The code runs inside a WASM sandbox—not a subprocess, not a container. The sandbox can’t make syscalls, access the network, or touch the filesystem except through the tools you provide (see WebAssembly security model). The agent can only call tools you’ve registered.

What’s Inside the Sandbox

The WASM runtime includes:

A full shell environment — Agents write Bash with pipes, variables, and boolean operators (&&, ||).

Shell utilities — grep, jq, sort, uniq, head, tail, wc, cut, tr, xxd. Text processing without shelling out to your host.

# Agent can use Unix pipelines inside the sandbox
cat /workspace/logs.json |
    jq '.[] | select(.level == "error")' |
    sort -k2 |
    uniq -c |
    head -20

A virtual filesystem — /workspace/ for input files and scratch space.

Tool calling via tool command — Your Python functions become shell commands: tool get_weather --city SF.

JavaScript via QuickJS — Full ES2020 support with async/await. Call tools directly as async functions:

const weather = await get_weather({ city: 'San Francisco' });
const users = await search_database({ query: 'active' });
for (const user of users) {
  console.log(user.name);
}

All of this runs in WASM. No Docker. No VMs. No infrastructure beyond pip install.

Framework Integration

amla-sandbox works with your existing tools and frameworks.

LangGraph (recommended)

from langgraph.prebuilt import create_react_agent
from amla_sandbox import create_sandbox_tool

sandbox = create_sandbox_tool(tools=[get_weather, search_db, send_notification])
agent = create_react_agent(model, [sandbox.as_langchain_tool()])

result = agent.invoke({"messages": [("user", "Process the daily reports")]})

Tool ingestion — Convert existing tools from LangChain, OpenAI, or Anthropic formats:

from amla_sandbox.tools import from_langchain, from_openai_tools

# Your existing LangChain tools
sandbox_tools = [from_langchain(tool) for tool in langchain_tools]

# Or OpenAI function definitions
sandbox_tools = from_openai_tools(openai_function_schemas)

Constraints and Call Limits

You can constrain what agents are allowed to do:

sandbox = create_sandbox_tool(
    tools=[transfer_money, get_weather],
    constraints={
        # Limit transfers to $1000 max, only USD/EUR
        "transfer_money": {
            "amount": "<=1000",
            "currency": ["USD", "EUR"],
        },
    },
    max_calls={
        "transfer_money": 5,   # Max 5 transfers per session
        "get_weather": 100,    # Weather is cheap, allow more
    },
)

When the agent’s code calls await transfer_money({amount: 5000}), the call is rejected before it executes. The constraint amount <= 1000 is verified at the gateway.

Why WASM?

We evaluated several isolation approaches:

Approach	Startup	Portability	Isolation	Complexity
Subprocess	<1ms	High	None	Low
Docker	~500ms	Medium	Good	High
Firecracker	~125ms	Low	Excellent	Very High
gVisor	50-100ms	Low	Good	High
WASM	<1ms	High	Good	Low

WASM gives us:

No syscalls — The runtime can’t access the network, filesystem, or any host resources except through explicit tool bindings (WebAssembly security model).
Portability — One binary runs on macOS, Linux, Windows. No platform-specific containers.
Fast startup — Microseconds, not milliseconds. Wasmtime’s instantiation time went from ~2ms to 5µs. Critical when you’re running thousands of agent executions.

The tradeoff: WASM is slower than native code for compute-heavy tasks. For agent workloads (mostly I/O-bound tool calls), the overhead is negligible.

Getting Started

pip install amla-sandbox

Minimal example:

from amla_sandbox import create_sandbox_tool

def get_weather(city: str) -> dict:
    """Get current weather for a city."""
    return {"city": city, "temp": 72, "conditions": "sunny"}

sandbox = create_sandbox_tool(tools=[get_weather])

result = sandbox.run("""
    const sf = await get_weather({ city: "San Francisco" });
    const ny = await get_weather({ city: "New York" });
    console.log(sf, ny);
""", language="javascript")

print(result)

The documentation covers everything from basic usage through production patterns.

What This Enables

When code execution is safe and cheap, new architectures become viable:

Data processing agents that write actual data pipelines instead of calling tools one row at a time.

Research agents that fetch, parse, filter, and synthesize information in a single execution—not a 50-step tool loop.

Agentic workflows where agents can branch and compose operations in code rather than being limited to single tool calls.

Open Source

The Python code is MIT licensed. The WASM binary is bundled with the package.

Repository: github.com/amlalabs/amla-sandbox
Documentation: amla-sandbox.readthedocs.io
PyPI: pypi.org/project/amla-sandbox

We’re building the infrastructure that lets agents act safely. amla-sandbox is the execution layer. Try it, break it, tell us what’s missing.