Dec 12 Incident Update: Secrets and Static Hosting Unavailable

Luis Héctor Chávez

Between Dec 12 and Dec 16, Secrets in interactive Repls and files in our legacy static hosting were unavailable. The root cause was a configuration push to GCS storage that was misinterpreted as a request for all the files to be evicted from storage. Deployments and Secrets in Deployments were unaffected. We have since recovered all known user Secrets and instituted new data retention procedures in our storage systems to ensure that this issue doesn’t reoccur. This will allow us to have a faster recovery going forward. This post summarizes what happened and what we're doing to improve Replit so that this does not happen again.

Technical details

Here is the timeline of what happened:

  • On Dec 11, we pushed an update to our Terraform configuration for the Google Cloud Storage lifecycle of several buckets. This included a bucket where Secrets, legacy static hosting, and some Extensions are stored. This was done in an attempt to automatically delete any noncurrent object to save space. These buckets had object versioning enabled beforehand, and the update inadvertently set the daysSinceNoncurrentTime field to zero.
  • Setting this field to zero caused the request to be interpreted as equivalent to setting age to zero, because it also missed explicitly setting isLive to false. This behavior was unexpected, which is tracked in the underlying Terraform provider in a GitHub issue.
  • We quickly discovered the issue, and between Dec 12 and Dec 16 a recovery process was able to recover all known user Secrets.
  • For any user who re-entered their Secrets during the recovery process, we made sure not to overwrite their newly entered data.

What are the next steps for Replit?

The root cause of this incident was caused by the process and validation of our Infrastructure-as-Code states before pushing them to production. We have improved this going forward. Additionally, we are taking several actions:

  • We're adding static validation of our Terraform states to ensure that they always have object validation enabled and have a daysSinceNoncurrentTime setting of at least two so that we do not experience this issue in the future.
  • We're going to improve our storage of Secrets for better isolation and backup performance.
  • We're auditing our Google Cloud Storage buckets more broadly to make sure that they contain a narrower scope of objects to mitigate any future impacts.

What if my Repl is still affected?

If, for some reason, your Repl is still affected, you can email [email protected] with a link to your Repl(s).

More blog posts