Sandboxed execution: why it matters for “run code” tools

December 18, 202511 min read

The fastest way to make an agent feel powerful is to let it run code. It can transform data, test hypotheses, automate repetitive tasks, and bridge the gap between natural language and programmable systems. It can also turn a small product mistake into a severe security incident. The problem is not only malicious input. It is the ordinary unpredictability of models acting on behalf of users in environments that may contain sensitive files, network access, credentials, and side effects the team forgot to account for.

A “run code” feature should therefore start from the assumption that every execution request is untrusted, even when the user seems benign and the prompt looks harmless. Code generation frequently combines user-provided material, model inference, and hidden environment details. That mixture is exactly what makes it valuable and exactly what makes it risky. Sandboxing is the mechanism that allows experimentation without granting the model ambient authority over the rest of your system.

Isolation is what turns code execution from an incident waiting to happen into a bounded product capability.

Assume the execution environment will be probed

Whether intentionally or accidentally, executed code will test the boundaries of the environment. It will try to read from the filesystem, open network connections, inspect environment variables, and consume more CPU or memory than expected. Even simple scripts can behave badly because of infinite loops, large intermediate objects, package installation attempts, or overly broad file access patterns. Designing a sandbox therefore means designing for curiosity, mistakes, and abuse, not just for the happy path.

The first requirement is isolation from host secrets and host state. If code can access deployment credentials, shared disks, or unrestricted network routes, you have not built a sandbox; you have built a remote execution engine with a polite interface. Execution should happen in an ephemeral environment with a sharply constrained view of the world and no durable foothold after completion.

Minimum guarantees worth insisting on

  • An ephemeral filesystem that is discarded after the run completes.
  • Resource quotas for CPU, memory, wall-clock time, and process count.
  • Network controls that default to deny and allow only the destinations the workflow truly needs.
  • No host credentials, shell history, or persistent user tokens mounted into the environment.

Treat network access as a product decision

Many teams focus on filesystem isolation and forget that outbound network access is often the more meaningful capability. With open egress, even a heavily restricted container can exfiltrate data, contact unexpected systems, or trigger external side effects. In some workflows, network access is required. In many others, it is only convenient. The safe default is to deny it entirely unless a concrete user task requires it and you can explain that requirement clearly.

This is where execution policy becomes product policy. A data-cleaning sandbox may need no network at all. A package-audit workflow may need access to a narrow allowlist of registries. A support debugging workflow may need internal API access but only through a proxy that enforces identity, logging, and request shaping. The point is not merely to toggle network on or off. It is to define which workflows need which routes and to make that explicit in design reviews.

# Example: bounded execution contract
python -c "print('hello sandbox')"
# runtime: 5s max
# memory: 256MB max
# network: denied by default

Bound the work, not just the permissions

A surprising number of reliability issues in code execution are not about data theft at all. They are about runaway work. The agent writes a script that expands a dataset unexpectedly, starts a recursive process, or waits forever on a dependency. If your only protection is “the container is isolated,” users can still experience timeouts, stalled sessions, and noisy-neighbor problems across shared infrastructure. Resource limits are therefore as much a user experience feature as a security feature.

Practical sandboxes set CPU, memory, runtime, and output limits up front and communicate those limits back to the user. When a run exceeds a quota, the product should say so plainly. This matters because users often interpret silent termination or generic errors as model incompetence. A transparent explanation—execution exceeded the 5 second runtime cap, for example—gives them a path to refine the task instead of abandoning the workflow.

Make execution legible after the fact

Isolation alone is not enough. Operators and users need a record of what happened inside the sandbox. That does not require invasive tracing of every instruction, but it does require basic execution receipts: command invoked, runtime duration, resource usage summary, exit status, and high-level outputs or artifacts produced. Without this, debugging turns into a guess about whether the code failed, the environment blocked it, or the product lost the result on the way back to the interface.

The receipt also supports trust. If the system says it analyzed a CSV file and generated a summary, the user should be able to inspect the generated artifact and understand what environment constraints were in place. A sandbox is not only a hidden security layer. It is part of the product contract around what code execution means and what it does not mean.

Users trust code execution more when the system can show its work and its limits, not when it simply says “done.”

The safest power feature is the one with real boundaries

Code execution can unlock extraordinary value, especially for technical users. But the feature only remains viable if the system around it treats isolation as a first-class requirement. That means ephemeral environments, strong resource controls, narrow network policy, no ambient credentials, and clear receipts. Anything less leaves too much to chance.

If your team is considering a “run code” tool, the design review should begin with the sandbox, not end with it. Once the boundaries are real, the product can responsibly explore where execution adds value. Without those boundaries, the feature is powerful in exactly the wrong way.

Continue reading with more posts from the same category.

← Back to all posts