Introducing amla-sandbox: Secure Code Execution for AI Agents
A secure execution environment that lets AI agents write and run code safely—with 98% fewer tokens than tool-call loops.
Today we’re releasing amla-sandbox, a Python package that gives AI agents the ability to write and execute code—safely.
TL;DR: amla-sandbox is a secure WASM-based execution environment. Agents write Bash that calls your tools. Token usage drops 98% compared to tool-call loops. uv pip install git+https://github.com/amlalabs/amla-sandbox and you’re running.
The Problem: Agents Want to Code
LLMs are trained on code. When you ask an agent to process data, it wants to write a script—loop through records, filter by condition, transform and aggregate. That’s what it knows how to do.
But most popular agent frameworks execute LLM-generated code through subprocess calls on your host system:
| Framework | Execution Method | Source |
|---|---|---|
| LangChain | Direct Python exec | CVE-2023-29374 |
| AutoGPT | Shell subprocess | CVE-2024-6091 |
| AutoGen | Python subprocess | Docs: “LLM can generate arbitrary code” |
| SWE-Agent | Bash subprocess | Trend Micro: AI Agent Code Execution |
| MetaGPT | Python subprocess | Issue #731: Arbitrary code execution |
That’s arbitrary code execution on your host. One prompt injection away from disaster.
The standard response is to wrap execution in Docker. But Docker is infrastructure. It’s slow to start. It requires orchestration. And it still doesn’t solve the capability problem: the agent either has access to everything in the container, or you’re back to allow-listing individual operations.
The Cost Problem
Security aside, there’s an economic problem.
Traditional agent loops work like this:
sequenceDiagram
participant LLM
participant Tools
LLM->>Tools: tool call 1
Tools-->>LLM: result (enters context)
Note right of LLM: Re-process full context
LLM->>Tools: tool call 2
Tools-->>LLM: result (enters context)
Note right of LLM: Re-process full context
LLM->>Tools: tool call 3
Tools-->>LLM: result (enters context)
Note right of LLM: Context keeps growing...Each round trip requires an LLM invocation. The context accumulates. In a typical agent conversation, tool responses make up 67.6% of the total tokens, meaning tools comprise nearly 80% of what the agent actually sees.
The agent knows how to write a single script that does all 10 operations. The framework forces it into a call-by-call loop because that’s the only way to maintain control.
What if there was a way to let agents code—actually code—while maintaining security guarantees?
sequenceDiagram
participant LLM
participant Sandbox
participant Tools
LLM->>Sandbox: script (calls tools 1, 2, 3...)
Sandbox->>Tools: tool call 1
Tools-->>Sandbox: result
Sandbox->>Tools: tool call 2
Tools-->>Sandbox: result
Sandbox->>Tools: tool call 3
Tools-->>Sandbox: result
Sandbox-->>LLM: final output only
Note right of LLM: One LLM call, minimal contextAnthropic’s engineering team documented this approach: code execution reduces token usage from 150,000 tokens to 2,000 tokens—a 98.7% reduction. Large data flows through the code execution environment, not the LLM’s context window.
amla-sandbox: Let Agents Code Safely
amla-sandbox is a secure WASM-based execution environment. The agent writes code in JavaScript or shell. The code runs in isolation. Tool access is controlled by constraints you define.
from amla_sandbox import create_sandbox_tool
def get_weather(city: str) -> dict:
"""Get current weather for a city."""
return {"city": city, "temp": 72, "conditions": "sunny"}
def send_email(to: str, subject: str, body: str) -> dict:
"""Send an email."""
return {"status": "sent", "to": to}
# Create a sandbox with your functions
sandbox = create_sandbox_tool(tools=[get_weather, send_email])
# JavaScript: async/await for tool calls
result = sandbox.run("""
const weather = await get_weather({ city: "San Francisco" });
console.log("Weather:", weather);
console.log("Conditions:", weather.conditions);
""", language="javascript")
# Shell: Unix pipelines with tool command
result = sandbox.run("""
tool get_weather --city "Tokyo" | jq '.temp'
""", language="shell")The code runs inside a WASM sandbox—not a subprocess, not a container. The sandbox can’t make syscalls, access the network, or touch the filesystem except through the tools you provide (see WebAssembly security model). The agent can only call tools you’ve registered.
What’s Inside the Sandbox
The WASM runtime includes:
A full shell environment — Agents write Bash with pipes, variables, and boolean operators (&&, ||).
Shell utilities — grep, jq, sort, uniq, head, tail, wc, cut, tr, xxd. Text processing without shelling out to your host.
# Agent can use Unix pipelines inside the sandbox
cat /workspace/logs.json |
jq '.[] | select(.level == "error")' |
sort -k2 |
uniq -c |
head -20A virtual filesystem — /workspace/ for input files and scratch space.
Tool calling via tool command — Your Python functions become shell commands: tool get_weather --city SF.
JavaScript via QuickJS — Full ES2020 support with async/await. Call tools directly as async functions:
const weather = await get_weather({ city: 'San Francisco' });
const users = await search_database({ query: 'active' });
for (const user of users) {
console.log(user.name);
}All of this runs in WASM. No Docker. No VMs. No infrastructure beyond pip install.
Framework Integration
amla-sandbox works with your existing tools and frameworks.
LangGraph (recommended)
from langgraph.prebuilt import create_react_agent
from amla_sandbox import create_sandbox_tool
sandbox = create_sandbox_tool(tools=[get_weather, search_db, send_notification])
agent = create_react_agent(model, [sandbox.as_langchain_tool()])
result = agent.invoke({"messages": [("user", "Process the daily reports")]})Tool ingestion — Convert existing tools from LangChain, OpenAI, or Anthropic formats:
from amla_sandbox.tools import from_langchain, from_openai_tools
# Your existing LangChain tools
sandbox_tools = [from_langchain(tool) for tool in langchain_tools]
# Or OpenAI function definitions
sandbox_tools = from_openai_tools(openai_function_schemas)Constraints and Call Limits
You can constrain what agents are allowed to do:
sandbox = create_sandbox_tool(
tools=[transfer_money, get_weather],
constraints={
# Limit transfers to $1000 max, only USD/EUR
"transfer_money": {
"amount": "<=1000",
"currency": ["USD", "EUR"],
},
},
max_calls={
"transfer_money": 5, # Max 5 transfers per session
"get_weather": 100, # Weather is cheap, allow more
},
)When the agent’s code calls await transfer_money({amount: 5000}), the call is rejected before it executes. The constraint amount <= 1000 is verified at the gateway.
Why WASM?
We evaluated several isolation approaches:
| Approach | Startup | Portability | Isolation | Complexity |
|---|---|---|---|---|
| Subprocess | <1ms | High | None | Low |
| Docker | ~500ms | Medium | Good | High |
| Firecracker | ~125ms | Low | Excellent | Very High |
| gVisor | 50-100ms | Low | Good | High |
| WASM | <1ms | High | Good | Low |
WASM gives us:
- No syscalls — The runtime can’t access the network, filesystem, or any host resources except through explicit tool bindings (WebAssembly security model).
- Portability — One binary runs on macOS, Linux, Windows. No platform-specific containers.
- Fast startup — Microseconds, not milliseconds. Wasmtime’s instantiation time went from ~2ms to 5µs. Critical when you’re running thousands of agent executions.
The tradeoff: WASM is slower than native code for compute-heavy tasks. For agent workloads (mostly I/O-bound tool calls), the overhead is negligible.
Getting Started
pip install amla-sandboxMinimal example:
from amla_sandbox import create_sandbox_tool
def get_weather(city: str) -> dict:
"""Get current weather for a city."""
return {"city": city, "temp": 72, "conditions": "sunny"}
sandbox = create_sandbox_tool(tools=[get_weather])
result = sandbox.run("""
const sf = await get_weather({ city: "San Francisco" });
const ny = await get_weather({ city: "New York" });
console.log(sf, ny);
""", language="javascript")
print(result)The documentation covers everything from basic usage through production patterns.
What This Enables
When code execution is safe and cheap, new architectures become viable:
Data processing agents that write actual data pipelines instead of calling tools one row at a time.
Research agents that fetch, parse, filter, and synthesize information in a single execution—not a 50-step tool loop.
Agentic workflows where agents can branch and compose operations in code rather than being limited to single tool calls.
Open Source
The Python code is MIT licensed. The WASM binary is bundled with the package.
- Repository: github.com/amlalabs/amla-sandbox
- Documentation: amla-sandbox.readthedocs.io
- PyPI: pypi.org/project/amla-sandbox
We’re building the infrastructure that lets agents act safely. amla-sandbox is the execution layer. Try it, break it, tell us what’s missing.