AI Agent Code Execution API

Luis Héctor Chávez

Luis Héctor Chávez

Faris Masad

Faris Masad

Lately, there has been a proliferation of new ways to leverage Large Language Models (LLMs) to do all sorts of things that were previously thought infeasible. But the current generation of LLMs still have limitations: they are not able to get exact answers to questions that require specific kinds of reasoning (solving some math questions, for example); similarly, they cannot dynamically react to recent knowledge beyond a particular context window (anything that happened after their training cutoff window comes to mind). Despite these shortcomings, progress has not stopped: there have been advances in building systems around LLMs to augment their capabilities so that their weaknesses are no longer limitations. We are now in an age where AI agents can interact with multiple underlying LLMs optimized for different aspects of a complex workflow. We are truly living in exciting times!

Code execution applications

LLMs are pretty good at generating algorithms in the form of code, and the most prominent application of that particular task has been coding assistants. But a more significant use case that applies to everyone (not just software engineers) is the ability to outsource other kinds of reasoning. One way to do that is in terms of sequences of instructions to solve a problem, and that sounds pretty much like the textbook definition of an algorithm. Currently, doing that at a production-level scale is challenging because leveraging LLMs' code generation capabilities for reasoning involves running untrusted code, which is difficult for most users. Providing an easy path for AI Agents to evaluate code in a sandboxed environment so that any accidents or mistakes would not be catastrophic will unlock all sorts of new use cases. And we already see the community building upon this idea in projects like open-interpreter.

Two options

But how should this sandbox behave? We have seen examples of multiple use cases. Google's Bard recently released "implicit code execution,” which seems to be used primarily for math problems. The problem is boiled down to computing the evaluation of a function over a single input and then returning the result. As such, it is inherently stateless and should be able to handle a high volume of requests at low latency.

On the other hand, ChatGPT sessions could benefit from a more stateful execution, where there is a complete project with added files and dependencies, and outputs that can be fetched later. The project can then evolve throughout the session to minimize the amount of context needed to keep track of the state. With this use case, it's fine for the server to take a bit longer to initialize since the project will be maintained for the duration of the chat session.

Since we know that there are a lot of people with these requirements out there, we made a prototype of both of these approaches where the sandbox runs in the Replit infrastructure (since we already have the technology to run untrusted code). We're releasing one as a self-serve platform for the community to experiment with!

Code execution API

The first approach is a stateless API container server that is deployable through Replit Autoscale Deployments: https://replit.com/@luisreplit/eval-python.

Demo of py eval using agent code exec

You can easily customize the Docker container image you’ll use and add all your necessary dependencies. Requests are handled in as little as 100ms and use the omegajail unprivileged container sandbox. This solution works best for simple math evaluation using Python and can be easily integrated with your OpenAI application with GPT-3.5 and GPT-4 support for custom function invocation.

An example of the code-exec API

Here's a simple example that allows you to ask arbitrary math questions and get the final answer instead of evaluating the code yourself. You can integrate it into your code by installing the replit-code-exec package.

To set up your copy of the API server, you need to follow these easy 2-3 steps (the second is optional):

  • Open the https://replit.com/@luisreplit/eval-python in your browser and Fork it to your account.
  • (Optional): if you want to change the Docker container, run evalctl image ${DOCKER_IMAGE} (e.g. evalctl image python:3 or evalctl image replco/python-kitchen-sink:latest).
    • Open .replit and change the EVAL_FILENAME, EVAL_RUN_COMMAND, EVAL_ENV to suit the new container, if needed.
  • Deploy the Repl! (just pressing Run is not enough)
    • This is only compatible with Autoscale Deployments.
    • Make sure you set the EVAL_TOKEN_AUTH Deployments secrets when doing so for authentication.
import openai
import replit_code_exec

code_exec = replit_code_exec.build_code_exec(...)

def solve_math(prompt: str, model: str = 'gpt-3.5-turbo-0613') -> str:
    completion = openai.ChatCompletion.create(
        model=model,
        temperature=0.7,
        functions=[code_exec.openai_schema],
        function_call={"name": "code_exec"},
        messages=[
            {
                "role": "system",
                "content": ("You are an assistant that knows how to solve math " +
                            "problems by converting them into Python programs."),
            },
            {
                "role": "user",
                "content": ("Please solve the following problem by creating a Python " +
                            "program that prints the solution to standard output using " +
                            "`print()`: " + prompt),
            },
        ],
    )
    return code_exec.from_response(completion)

You can deploy this very cheaply for experimentation since you’re only charged for the time your Deployment actively uses CPU. For more information, check out the full docs in the GitHub repository.

Video of math solver being executed in Replit

Stateful agent environment

We've concluded this experiment and have decided not to support it at this time. For most use cases, code-exec should support your needs.

The second prototype is a more stateful one, and it uses a full Repl as the sandbox. With this, you can install any packages after creation and also read and write files, as well as arbitrary programs, giving your agent complete control over the execution environment. The Repl will automatically terminate after some period of inactivity. This prototype is more experimental, but you can read the documentation for more information.

The future

We will keep experimenting with new ways of augmenting the capabilities of LLMs: this is just the beginning for us. We wanted to release these tools because we know you'll build something amazing with Replit. We're hiring, so if this interests you, make sure to apply for one of our open positions.

More