How to use a jagged chunk of a git repository

2025-01-30 by Josh Bleecher Snyder

Alan Kay: "Simple things should be simple, complex things should be possible." Git absolutely nails part two. (And in git's defense, it is possible that there are no simple things in version control.)

This blog post is about manipulating git (both "intentionally deceiving" and "running git commands") to send a very precise fragment of a git repository to a server, do some work with it, and transfer that work back to the original repository, smoothly and transparently. Prepare to roll up your sleeves.

Background

merde.ai uses LLMs to resolve git merge/rebase conflicts for you (website, blog post). In order to do that well, it needs part of a git repository: relevant history with diffs and commit messages. And yet it doesn't need all of your git history. That would be bad in many ways.

What we'd really like is to grab a chunk of a repository that is just big enough for the problem at hand, but no bigger.

False starts

Git being git, it has lots of officially supported built-in ways to do a partial clone. Here are a few of them:

Shallow clone by depth: git clone --depth=1 <url>. This limits the history transferred to the most topologically recent commits. This is useful for CI systems that only need the most recent commit to test.
Shallow clone by date: git clone --shallow-since="3 days ago" <url>. This limits the history transferred by the commit time. This is useful for humans who don't care about the full history of a project.
Sparse checkout: git clone --no-checkout <url>; cd <repo>; git sparse-checkout set <path1> <path2>; git checkout. This lets you check out only specific paths. This is useful for large monorepos in which you only care about a particular set of directories. As of git 2.42.0 it is still experimental.
File size filters: git clone --filter=blob:limit=1m <url>. This lets you skip large files, which might be assets that you don't need for your task.

And there are more (--single-branch, --filter=blob:none, and lots more filtering options).

None of these fit the bill for us. The shape of what we need from the repository doesn't fall along any tidy lines like time or directory.

And even if it did, there's another problem: We want to push information rather than fetch it. In theory, the two directions are symmetrical, but in practice, it gets ugly fast. HTTP wasn't really built for servers to send requests to clients, and websockets add a bunch of goo and layers to manage. Plus the implementation of fetching involves running coordinated processes on both sides, and it is scary to spin up a fairly unrestricted git serving process from client code.

Git being git, there's an answer for that, git bundle: git bundle create repo.bundle main~5..main. This will create a bundle that contains only new information from the last five commits on main. You can take this bundle, which is a single file, encrypt with age, burn a CD, attach it to your carrier pigeon, and then trap and read and decrypt and apply it on the other end, assuming that your airgapped machine has the rest of the history already. Git bundle is also good for making a simple, clean, complete, single-file backup of a git repository.

We could use this to grab only the commits in the part of the git graph involved in the merge/rebase. (From here out, I'll just say merge. The same ideas apply to rebase.)

Unfortunately, this is too little information. Suppose that you edit foo.txt, and it is the source of a merge conflict. It is very useful to know the original contents of foo.txt. But git bundle assumes that you're doing an incremental backup, and you already have the complete state of the world prior to the conflict. We don't.

Also, git bundle reasonably assumes that you're using it as intended, and it refuses to apply bundles to repositories unless you can supply the history that the bundle lacks. This is good! But not for us.

Doing it the hard way

Git being git, there is always a way. We just need to step outside the well-trodden, well-supported paths and get our hands dirty.

Let's go back to first principles. What do we really want?

We want the commit metadata (messages, graph structure) of the commits in the merge.
We want all the files that changed in those commits, including before and after versions. (Not just the files with conflicts! The other changes are important for establishing context and intent.)
We want all that in a form we can actually use on the server.

Let's do that now.

Everything we want here is a git object: commits, trees, and blobs. (merde doesn't yet support merging submodules.) So let's start by building a complete list of the git objects we want.

First, we need to identify the commits we are interested in. There are three inputs here: the main branch, the topic branch, and their merge base, which is the shared ancestor from which they diverged. (merde doesn't support octopus merges.)

We can get this with:

git rev-list main topic --not base

(This excludes base, but we want base. Embarrassingly, I fussed around a bunch with adding base via the git command, and adjusting my parsing, until I remembered I could just...add base directly in my code. White line fever strikes again!)

Now we have a list of commits. In order to select the relevant repository contents, we need to find the trees for those commits. These trees contain references to other trees (subdirectories) and to the blobs for the files.

There's a git for that. For every commit SHA in our list, append ^{tree} (so it looks like 1a34fe51^{tree}), and then pipe that through:

git cat-file --buffer --batch-check=%(objectname)

This bulk-extracts the trees for our commits. Note that if one of the commits was a reversion of a previous commit, such that the entire repo contents were in the same state as before, we might end up with some duplicate trees. (Content addressability ftw!) That's OK.

Now we're in a position to figure out which directories and files we should collect. We iterate over all the trees and run:

git ls-tree -r -t

This is like a filesystem walk for a git tree. It prints all objects (subtrees/blobs) and their associated path, as reachable from the root tree. What we're looking for is paths that have different objects when viewed across all trees. If any given path has the same object across all trees, then it hasn't changed, so it won't be part of any diff or any conflict, so we can ignore it.

We have now identified all the objects we care about! It's the list of commits we generated earlier, each of their trees, and all the trees/objects corresponding to paths whose contents vary.

Lastly, we need to collect all of this into a tidy parcel. More git! We pipe the list of the SHAs of all of these objects to:

git pack-objects --stdout

This will make a packfile (of course). Packfiles are a single, compressed file containing a set of git objects. A perfect packet of bytes to send to the server.

The code for everything we described above is available on GitHub.

Potemkin repositories

Of course, now the server has to actually do something with this packfile. git unpack-objects will unpack all the objects into a repo, but the instant you try to do anything with them, git will become very unhappy very quickly. There are lots of missing objects! This is the situation that git bundle was protecting us from.

The next step is to turn this unwieldy, ragged, ripped corner of a git repository into a fully-formed repository that we can use. We do that by erecting facades everywhere.

git fsck helps you diagnose problems in your repository. We have plenty of those! And we are going to lie in order to fix them.

There's a lot of bookkeeping in this next part, so we'll gloss over some of the gory details and stay focused on the high level approach.

git fsck gives us a list of all the blobs, trees, and commits that are broken or missing. (Somewhat reluctantly. But we ask oh so nicely.) We now walk through and repair them, from the leaves inward.

Start with blobs. A missing blob corresponds to a file that the client has, but we don't. That's fine. We will invent a file to put there, and hash it. Everywhere we see this client blob hash, we will swap in the new server blob hash.

What about trees? We need to rewrite all the trees, recursively. The trees we received refer to client blob hashes, but that's no good. For every tree, do the blob hash swap as needed and create a new tree. And now this generates a new hash swap we need to do: everywhere we see this client tree hash, we will swap in the new server tree hash.

Having set up fake trees, we need to fix the commits. We need to do the tree hash swap, which generates new commit hashes. That's no good, because commits refer to their parent hashes, but those are client commit hashes. We also have to track client commit hashes and their corresponding server commit hashes.

At the end of all of this, we have a fully functioning, legit potemkin git repo. You can check out commits, diff, merge, revert, branch, cavort, and whirl. But of course, lots of the files are mysteriously…devoid of useful content. That smells like success.

But we're not done yet.

Once we've completed the hard work of LLM-assisted merging (that was the whole point!), we need to send the results back to the client. git pack-objects and git unpack-objects looks like the right tool for the job again. We can calculate what objects the server has that the client needs, much like we did on the client initially.

There's only one hitch: all of the hashes will be wrong! Our new merge commit hash is wrong, because its parent is a server commit hash, not a client commit hash. Our new merge commit's tree hash is wrong, because its subtrees are server commit tree hashes, not client tree hashes.

We need to undo all of the client/server hash translation we did on the way in. This creates a bunch of objects that the client will understand (but the server doesn't!). We then make a packfile from those and return it to the client, which adds them to the repository, none the wiser of the shenanigans involved in creating those objects. And it makes a new branch pointing to the merge commit, so that there's an easy way to refer to it as a human.

Home safe

Adding objects to a repository doesn't alter the working directory or the index. Aside from the extra disk space used (which would eventually get garbage collected), there's no downside, no risk that we accidentally trample your data.

This makes using merde very low risk. The worst case scenario is you end up with some extra stuff in the repository's object store which disappears in a month or so. Or merde fails to complete the merge. That happens. Success is better than failure, but failure is better than generating garbage merges. Of course, garbage merges still get through sometimes. That's LLMs; that's software; that's life.

The best case scenario, of course, is that you are happily oblivious to the fact that a server somewhere just lied through its teeth to git and that an LLM did a bunch of token burbling, you forget you had a conflict, and you just…get back to whatever you really wanted to be doing in the first place.

sketch.dev · merde.ai · pi.dev