All Posts
Engineering··6 min read

Your Rewrite Will Fail (And You Should Do It Anyway)

There's a famous Joel Spolsky post from the year 2000 — "Things You Should Never Do" — where he argues that rewriting software from scratch is the single worst strategic mistake a company can make. Netscape did it. It killed them.

He's right. Most rewrites fail. They take twice as long as estimated, the new system ships with half the features of the old one, and somewhere around month eight the team realizes they've been rebuilding institutional knowledge that was encoded in 10,000 lines of "ugly" code they didn't understand.

And yet.

The Codebase That Time Forgot

I got brought into a company last year — mid-size SaaS, decent revenue, growing team. The product worked. Customers paid for it. By every business metric, things were fine.

Then I looked at the code.

The primary application was a Rails monolith that had been "temporarily" running on a single EC2 instance since 2019. Deploys took 45 minutes and required a Slack message to the team saying "don't push anything for the next hour." The test suite took 90 minutes to run, so nobody ran it. The database had 340 tables, and the engineer who understood the most critical 40 of them had left two years ago.

Every new feature took 3-4x longer than it should have. Every bug fix introduced two more. The team was demoralized. Three engineers had quit in the last year, all citing the codebase in their exit interviews.

This is the part Joel's post doesn't cover: what happens when the cost of not rewriting exceeds the cost of rewriting.

The Honest Math

Here's how I think about it. A rewrite is a bad idea when:

The existing system works and can be incrementally improved You're rewriting because the tech is "old" (old is fine, unmaintainable is not) The team doing the rewrite doesn't deeply understand the existing system There's no clear forcing function (scaling wall, security hole, regulatory deadline)

A rewrite starts making sense when:

You've lost more than two senior engineers in a year who cited the codebase Feature velocity has dropped below 30% of where it was two years ago You're spending more time on workarounds than on actual features The architecture fundamentally cannot support where the business needs to go

That SaaS company hit all four. The rewrite wasn't optional — it was the only path that didn't end with the product dying slowly.

How to Not Screw It Up

We did the rewrite. It took seven months (we estimated five — classic). But it worked, because we followed a few rules:

Strangle, don't replace. We didn't go dark for seven months and emerge with a new system. We built the new platform behind a feature flag and migrated functionality piece by piece. The old system kept running. Customers never noticed the transition.

Steal the tests, not the code. The old test suite was slow and flaky, but it encoded years of edge cases. We used those test cases as the spec for the new system. Every test we ported was a bullet we didn't have to dodge later.

Staff it with people who hate the old system. Not people who want to build something shiny — people who deeply understand what's broken and are angry about it. That anger is fuel, and their knowledge of the old system's ghosts keeps the new one honest.

Set a time box and cut scope ruthlessly. Seven months. If it's not in the new system by then, it doesn't exist. We killed about 30% of the old feature surface — features that analytics showed nobody used. The product got smaller and better.

The Result

Deploys went from 45 minutes to 3. Test suite from 90 minutes to 8. Feature velocity tripled in the first quarter after migration. They've hired four engineers since, and none of them have to learn why there's a column called legacy_flag_do_not_use_2 in the users table.

Joel was right that most rewrites fail. But "never rewrite" is advice for the average case, and if you're reading this, your codebase probably isn't average. It's probably on fire.

Just don't kid yourself about the timeline.