sketch blog: Our first outage from LLM-written code

We had a series of mini-outages at sketch.dev on July 15th, caused by LLM-written code.

Timeline

A deployment appeared initially stable. Some time afterwards, CPU spiked, and service incrementally slowed to a crawl.

Profiles pointed to some absurdly complicated SQL queries that were doing multiple full table scans. The hypothesis was that they had reached a tipping point at which the database could no longer keep up. Either way, those queries needed reform.

We fixed the queries and deployed. The same thing happened: initial stability followed by a downward spiral.

A bit of discussion indicated that the trigger for the CPU spikes both times was our CEO logging in. We re-deployed to get a clean start, permanently banned him from the service, and moved on.

Profiles continued to indicate database contention, but that no longer seemed plausible. Walking up the stack, there was one infrequent code path responsible for triggering all of those queries. This code path had recently been refactored. We reverted the refactoring, deployed, un-banned the CEO, and set about analysis.

Proximate cause

We (obviously) use Sketch to build Sketch. This refactoring was done by an LLM, and then reviewed by a human. The bug was in a large chunk of code that moved from one file to another.

Before the move, it looked like this:

for {
	repos, repoResp, err := ghClient.Apps.ListUserRepos(ctx, *installation.ID, repoOpt)
	if err != nil {
		// Log error but continue with other installations
		log.Printf("Error fetching repositories for installation %d: %v", *installation.ID, err)
		break
	}
	// ...
}

After the move, it looked like this:

for {
	repos, repoResp, err := ghClient.ListUserRepos(ctx, *installation.ID, repoOpt)
	if err != nil {
		// Log error but continue with other installations
		log.Printf("Error fetching repositories for installation %d: %v", *installation.ID, err)
		continue
	}
	// ...
}

The break became a continue, turning errors into infinite loops.

Root causes

This is a small enough change in a larger code movement that we didn’t notice it during code review.

We as an industry could use better tooling on this front. Git will detect move-and-change at the file level, but not at the patch hunk level, even for pretty large hunks. (To be fair, there are API challenges.)

It’s very easy to miss important changes in a sea of green and red that’s otherwise mostly identical. That's why we have diffs in the first place.

This kind of error has bitten me before, far before LLMs were around. But this problem is exacerbated by LLM coding agents. A human doing this refactor would select the original text, cut it, move to the new file, and paste it. Any changes after that would be intentional.

LLM coding agents work by writing patches. That means that to move code, they write two patches, a deletion and an insertion. This leaves room for transcription errors.

Looking at the code, it’s easy to see why a transcription error occurred:

if err != nil {
	// Log error but continue with other installations
	log.Printf("Error fetching repositories for installation %d: %v", *installation.ID, err)
	break
}

The comment said but continue. The code said break.

There were two competing sources of signal here for what token to predict at the critical moment: transcription and local prediction. Transcription said break. Local prediction said continue. Unfortunately for us, local prediction won.

Prevention

We just added clipboard support to Sketch's agent environment. While patching files, the agent can now copy to and read from a clipboard. Because clipboards reproduce code byte-for-byte, but refactors sometimes require indentation level changes, particularly in Python, the tool also supports adjusting indentation while pasting. (Integrating an LSP for automatic re-indent-on-paste is future work.) It doesn’t have many miles on it yet, but initial results look promising, at least with the more powerful models.

And I’d love to see git add cross-hunk change detection. It would always have been helpful, but in a world in which more code is written by imperfect transcription, it will be even more important.