How Replit makes sense of code at scale

Gian Segato

Gian Segato

The Data Team @ Replit

The Data Team @ Replit

Data privacy and data security is one of the most stringent constraints in the design of our information architecture. As already mentioned in past blog posts, we only use public Repls for analytics and AI training: any user code that's not public — including all enterprise accounts — is not reviewed. And even for public Repls, when training and running analyses, all user code is anonymized, and all PII removed.

For any company making creative tools, being able to tell what its most engaged users are building on the platform is critical. When the bound of what’s possible to create is effectively limitless, like with code, sophisticated data is needed to answer this deceivingly simple question. For this reason, at Replit we built an infrastructure that leverages some of the richest coding data in the industry.

Replit data is unique. We store over 300 million software repositories, in the same ballpark of the world’s largest coding hosts like GitHub and Bitbucket. We have a deeply granular understanding of the timeline of each project, thanks to a protocol called Operational Transformation (OT), and to the execution data logged when developers run their programs. The timeline is also enriched with error stack traces and LSP diagnostic logs, so we have debugging trajectories. On top of all that, we also have data describing the developing environment of each project and deployment data of their production systems.

It’s the world’s most complete representation of how software is made, at a massive scale, and constitutes a core strategic advantage. Knowing what our users are interested in creating allows us to offer them focused tools that make their lives easier. We can streamline certain frameworks or apps. Knowing, for instance, that many of our users are comfortable with relational databases persuaded us to improve our Postgres offering; knowing that a significant portion of our most advanced users are building API wrappers pushed us to develop Replit ModelFarm. We can also discover untapped potential in certain external integrations with third-party tools, like with LLM providers. It informs our growth and sales strategy while supporting our anti-abuse efforts. And, of course, it allows us to train powerful AI models.

Ultimately, this data provides us with a strong feedback loop to better serve our customers and help them write and deploy software as fast as possible.

While all this is true, it’s also not an easy feat. Several challenges come with making the best use of this dataset, and all stem from the astronomical magnitude of the data. We store on Google Cloud Storage several petabytes of code alone, and our users edit on average 70 million files each day, writing to disk more than 1PiB of data each month. This doesn’t even consider OT data, and execution logs, both in the same ballpark for space taken and update frequency. Going from this gargantuan amount of raw programming files to answering the question “What is being built?”, all while being mindful of user privacy, is, well, hard.

There are three fundamental challenges: writing the data, reading it, and finally making sense of it.

Reading and writing are engineering challenges. At its core, the problem is that there are some constraints placed on the format, location, and layout of the data to begin with: it’s not optimized for post-factum work, but rather for runtime. From a writing standpoint, the Replit filesystem has been designed to serve and run Repl containers fast, securely, and cheaply, not to analyze them more easily. The ultimate stakeholder of the Repl FS is the user, not Replit employees.

Assuming that the filesystem lives somewhere (something for the platform team to figure out), we had to deal with egress, network latency, and compute costs. The naive approach would be to spin up a cluster in Spark. This is suboptimal for a few reasons: If you're not lucky and your Spark instance doesn't support the same regions where your data lives, network latency would be painful; answering even basic questions could take hours. Moreover, handling millions of different filesystems and OT timelines as pure Python objects is painful, and some things can’t be done out of the box, like installing system-wide executables in the cluster. Storing loose files is also hard and expensive due to the non-atomicity of the endeavor: we need access to the whole FS, and being able to run executables on it would also make our lives a lot easier.

None of this is impossible to overcome, but the biggest problem in operating a Spark cluster would ultimately be the developer experience. One of the design constraints we set for the project was to have non-data teams at Replit interested in tapping into this trove of information write only simple Python code at most, and ideally in most cases just SQL. Reading, interacting, and analyzing this data needed to be as accessible as possible across the whole company, from product managers to the sales team, and we didn’t want to create a hard dependency on our data engineering team.

An alternative to Spark would be to pipe the data to an external specialized tool, like Elastic. At that point, the user experience of searching for things would be much better. However, it also would’ve meant sacrificing a lot of flexibility on the kind of questions we want to ask, that are fuzzy and require more than just string matching. On top of that, specialized tools are inherently redundant, and thus expensive: you’re basically replicating your original dataset into another format on another service, doubling space and egress, and increasing the surface to monitor for user privacy.

All this is only half the challenge.

Here comes the third problem: making sense of the data.

Assuming we can read tens of million files each day, assuming we can run arbitrary code to analyze them, and assuming we can make the experience simple enough to only require a bit of scrappy Python to start getting the first insights, we then would have to determine how to classify and then summarize what we’re reading, which, fundamentally, is a product challenge more than an engineering one.

Code is unstructured data. There’s no “average” function for projects with 150+ programming files. Aggregating its signal at scale is hard for a variety of reasons. For one, how would you do this while preserving privacy expectations? What’s the most expressive but also concise taxonomy? Use cases are extremely hard to pin down because you can go very deep if you need or want to. You could classify users macroscopically, for instance, by saying that many users want to build “web apps”. But that’s not that useful, because that’s not actionable: are those users prototyping an AI backend, or are they brainstorming ideas for an e-commerce client? And even then, you can go even further if you want to: does the e-commerce have user authentication, or not? Does it fetch its data from a NoSQL database or a relational one? Is it a full-stack Javascript app, a serverless backend, or just a frontend? Does it have an ML stack of some kind? What’s the developer coding experience level? What are their pre-existing skill sets?

These are all very important questions when the final goal is orienting the product strategy, running a growth experimentation program, qualifying leads, identifying abuse threats, closing partnerships, optimizing Replit's infrastructure, and training new AI models. We need both granularity and aggregation.

“But, wait, there’s a solution!”, you may say. One of the core innovations of the AI wave is having very powerful LLMs that allow us to loosely define a way to map unstructured data to structured, tabular one. The primary issue is that indiscriminately sending 100% of the Repls' contents to a third-party LLM is not something that most folks would be comfortable with, privacy-wise. And then there’s obviously the cost: according to some napkin math, running a decently smart LLM on the almost two billion public file edits each month would cost us hundreds of million of dollars, which we don’t think we can afford (just yet).

To summarize, we have a humongous amount of bytes that take very long to read and that have some internal structure we can’t easily parse, and we need to make sense of them according to vaguely defined questions.

That’s why we built something custom.

Introducing, Backer

The first thing we did was to split the problem into two super high-level categories: timeline data, and static code data.

The two operate at two very different time scales. Timeline data is a microsecond description of virtually everything that is happening when a user is programming: It comprises execution logs, LSP diagnostics, and OT actions, and it describes the fine-grained mechanics of building software. As such, its main goal is training data for our AI models, like Code Repair.

The latter, instead, is expressed in “days” terms. It’s the evolution of what our users’ code looks like over time, how their files change as they progress through their projects. It contains signals on user intent, and its main goal is to influence anything tied to it, which is virtually anything: product strategy, sales, marketing, growth channels optimization, anti-abuse, infra, and so on.

We decided to keep the timeline data for AI under the Spark ecosystem (specifically, Databricks), with ad-hoc GCP cloud functions spun up only when needed. First, the AI team is perfectly comfortable dealing with Spark, and second, the data doesn’t need to stay fresh all the time, so a dependency on our data engineering stack is fine. And, finally, Databricks offers the regional support we care about for latency reasons. This repository is connected to our internal warehouse via dbt to make sure we only use public data. We described the architecture and the nature of that peculiar training set in its own blog post here.

That left us to deal with coding data, which meant interacting with the Repls' filesystems.

Last year, to overcome the mounting limitations of off-the-shelf solutions, the Replit platform team launched a custom filesystem called Margarine. The ins and outs of that system are well described in another blog post (here), but for what concerns our topic at hand the key thing to understand is that Margarine acts as a smart caching layer: it moves blocks of bytes between GCS, a local SSD, and memory, based on usage patterns. The fresher the data needs to be, the higher the probability that it will live in the memory access stack: it's like a glorified multi-petabyte cache. This approach allows for fast boot times and efficient access to large amounts of data without having to load entire Repls all at once.

The architecture of this filesystem is already inherently distributed, and it’s already guaranteeing the freshest version of what users are building. So here came the light bulb moment: what if we built an ETL layer directly on Margarine?

As it turns out, the Replit platform team had already built something to that effect: Backer.

Backer is a service living in the Margarine orbit that is responsible for backing up the latest state of users' projects in case anything goes wrong. When a user modifies their Repl's filesystem, a chain reaction begins: the kernel detects the change, Btrfs processes it, and Margarine eventually persists it to stable storage, typically within about a minute. Once the changes are persisted, Margarine signals Backer via Pub/Sub, which then stores these updates in cold storage as a backup.

What we did was simply extend this backup service to also run an arbitrary number of Docker-wrapped scripts every time the user changes something in their project.

To make this work, when running the Docker extraction image Backer mounts a volume pointing to the filesystem of the updated Repl, within 24 hours of the Repl being updated to debounce events. The developer experience is nice, the contents of the project are just a folder from the point of view of the script. In case fancier things are needed, this is a fully configurable Docker container, so it’s as flexible as it can get. Backer also injects an environment variable containing a GCP access token: the image can use it to securely authenticate with any Google Cloud service (mostly, BigQuery), to pipe out the scan result while retaining internal user permissions. To do so, though, Backer needs its own Pub/Sub message queue – we tried to write things directly to GCS and BQ, but we’d reach the rate limit almost immediately out of the sheer frequency of updates.

The system is elegant for a few reasons. First, it decouples responsibilities very nicely. The platform team doesn’t need to care about the particular schema or shape of the data: it just exposes a transparent way to directly access the files in a secure and private way that doesn’t include enterprise data, run some custom logic, and save results. Second, it’s distributed by design. Infra is also taking care of spinning up and down the fleet of machines on GCP according to needs and balancing the load across regions. This effectively reduced to zero all the bottlenecks in computing, latency, and egress costs, because all are already absorbed by the rest of our infrastructure.

The data engineering team decided to use a pre-existing Airflow project to build and deploy the Docker images via Continuous Integration actions. Extracting custom data from any recently edited project is as simple as git pushing a new folder containing a simple main.py in the Data Platform project.

Great. We could read the data.

At this point, the question became: but what data?

Creating a taxonomy for code

Now that we had a scalable way to read coding files, we needed to find a both efficient and effective method to make sense of it.

As mentioned above, just piping everything through an LLM would be economically infeasible. We could retrieve the most relevant chunks of each project and just use them for In Context Learning, but different teams will have different questions, and we thought that building an entire Retrieval-Augmented Generation (RAG) system outside of the Replit workspace would’ve nullified the benefits of Backer. And most importantly, not all questions can be efficiently addressed with chunking and LLMs. For instance, let’s take the question “How many users working on multiplayer projects are building sophisticated web applications?”. There’s a lot going on here. There’s the notion of “multiplayer user”, the notion of “sophisticated”, and finally the one of “web app”.

We could sample projects and just randomly pick a few to classify (say, 1000), and let the LLM take care of it, but that would’ve kept the underlying distribution intact, effectively shooting up the economics even further. Say we want to tag only a niche set of behaviors, representing 1% of the total population (eg. the L120 cohort; or all projects importing a certain package): blindly sampling would’ve required 100x more tokens than evening out the distribution and just tagging that 1% alone.

Given all these considerations, we came up with a Progressive Classification design, whereby we use progressively more precise (and expensive) filters the deeper we get.

It starts with connecting the Backer results to our dbt data ecosystem. Technically it was very easy – just a matter of injecting an environment variable containing the project ID in the Backer Docker image, and reporting it each time we extract anything. But the implications were huge. It meant that we could correlate coding data to behavioral data, isolating project segments just by looking at user behavior around coding.

Moving deeper, we then proceeded to extract critical metadata from the actual files. We divided this effort into three large buckets: general statistics, specific code statistics, and high-signal bits in code.Regarding the former, we started saving things like lines of code, distribution of file extensions, or file tree (with some limitations: some projects are incredibly large). This is already plenty enough to answer questions like “Are users with CSV files in the projects more or less likely to convert?”. This is all basic string matching and ASCII counting. No LLM needed!

Code-specific statistics involved more powerful tools, like parsing Abstract Syntax Tree (AST) and running code linting. AST is a high-level representation of program logic. It allowed us to track things like the number of variables defined in code, the number of methods, and cyclomatic complexity. This metadata is extremely powerful when segmenting coders: a software developer in the bottom decile for the number of methods defined in their code, for instance, is overwhelmingly likely to be a beginner. We proved that training classifiers on such tabular data is powerful, precise, and very cheap when compared to more sophisticated deep-learning approaches. As it turns out, there’s still space for old-school machine learning in modern stacks! BQ having an embedded ML stack has proven handy.

Among code-specific information, there’s a special place for external dependencies. They’re very hard to pin down precisely, due to dependency hell, but if done correctly they yield a tremendous amount of signal about user intent.

Since we operate in a Nix environment, system dependencies (for example, a system-installed Chrome driver to run web-scraping jobs) were fairly easy to extract by simply parsing the replit.nix file. A different story was dealing with language-specific packages (like scikit-learn, express, or openai). Fully describing all the nuances of getting from coding files to structured external packages would be too long, but the high-level strategy has been taking the inner intersection between manually-parsed import statements in code, and formally-declared requirements in metadata files like poetry.toml and package.json.

Finally, we looked at very specific strings in code. Some of them are much more efficient ways to detect intent than chunking entire portions of the code, as it happens in normal RAG systems. You don’t need that many tokens. The final output of this extraction is an extremely compact representation of a project that can express most of the developer intent in at least 3 orders of magnitude fewer tokens than the whole code itself. We found that call graph and variable names alone, along with file names and directory structure, can already retain most of the predictive power while using <0.1% of the token count.

Finally, at last, this compressed description can be handed to an (internally hosted, private, purposefully fine-tuned) LLM for the most elusive questions, like “What’s the final purpose of this web app?”, or “How many years of experience does the user who coded this have?”.

It’s a progression that goes deeper and deeper, adding more sophisticated filters as it goes down. It starts with behavioral information, moving on to string matching, then to AST and LSP parsing, and finally to LLMs exclusively when needed and when all the prior information can’t answer the question at hand.

This architecture allows us to get powerful insights on what’s being built on Replit, and how we can better serve our users in their endeavors. We get to do this without spending unreasonable sums on GPU inference, while still retaining all the critical precision we need to make good decisions.

Conceptually, it’s roughly the same idea of inducing bias in a model architecture: we can afford to make strong assumptions about the task at hand – in this case, understanding user intent behind code – which informs the architecture design and allows us to save both data and computing costs, so we can build a better product, and ultimately a better business.

Every tech company knows there's gold to mine in their data, but figuring out how to extract and structure it efficiently when dealing with petabytes of unstructured information can be challenging. We hope our journey gives you some inspiration to think about ways you can do this in your own products, especially if you're thinking of using AI at scale. Progressive classification is a key component in our architecture that is generalizable to other industries and domains. By starting with simple metadata and moving to more complex analyses only when necessary, while creating compact representations of complex data before applying expensive AI techniques, it’s possible to balance cost and insight depth. By applying these principles, you can turn your data challenges into strategic advantages, just as we did at Replit.

Work at Replit

Come work with us to solve technically challenging and highly ambitious problems like these, and enable the next billion software creators online.

More