Munging Git version control history

Published

I've been digging up my old Lua modules to republish them on the new site, but most of them are in one big Git repository. I've since learned my lesson, and now create a small single-purpose repo for every separate thing. (One of the best things about Git is that when you start something new, even if it's unlikely to turn into anything serious, you might as well just do git init and keep a history, just in case it turns out to be useful.)

I want to put my Lua modules up in public like everyone does nowadays, but I don't want to publish a monolithic repository, which contains not only my open source projects, but also some experimental modules that turned out to be a bad idea, and are best kept to myself to avoid embarrassment. So to do this I've had to figure out how to split out the bits of the code tree and history relevant to each project and turn them into stand-alone repositories. Fortunately, Git has some nifty features to do this.

The notes below are mostly for my own reference, in case I have to do this kind of thing again. Warning: most of these commands are destructive and could cause all kinds of problems if used wrongly, so work on a copy of the original repo. Also, I'm not exactly a Git wizard, so I'm fumbling around with Git features I'm not familiar with, and probably not doing things in the best way.

Preparing the new repository

I start by making a copy of the original repository (just with cp -a or whatever, not bothering to figure out the right git clone incantation). I do everything in the copy in case I screw it up (which I did a few times).

I delete the remotes stuff from .git/config, because it's going to be a new repository, and I don't want to accidentally break a remote copy of the original one.

Then I clean up any uncommitted changes in the working directory with git reset --hard and then use git clean -f -x -d to delete any untracked files.

After the history has been adjusted the tags in the new repository won't refer to the new revised commits. It's probably possible to fix this automatically with the --tag-name-filter option to git filter-branch, but I haven't tried that. I've only been dealing with a couple of release tags, which I want to rename anyway, so I've just deleted all the old tags like so:

for name in $(git tag -l); do
    git tag -d $name;
done

Later I'll recreate the ones I want by finding the commit IDs in the new history that match up to the appropriate ones in the original repository, based on the commit messages.

Separating out part of a repository

In most cases, I can git rid of all the other projects, leaving just the one I'm trying to extract, with a command like this, where dir is the subdirectory in the repository which contains the project I'm trying to extract:

git filter-branch --subdirectory-filter dir -- --all

That command will create a new history containing only the commits that touch that directory, and with everything outside it removed from the trees for each commit. It also moves the contents of the subdirectory to the top level, which is what I want.

There was one project (my DataFilter module) which was a bit more tricky. After trying the subdirectory filter above I found that I only had part of the history, because I'd originally put it in a subdirectory with a different name, and renamed it half-way through the development. There doesn't seem to be a way to get --subdirectory-filter to accept more than one directory, at least that I could find from the man page and grubbing through the source. If you use the option more than once then all but the last will be ignored. It might be that you can somehow include multiple subdirectories in one argument, but I couldn't find out whether that's possible. All I know is that the subdirectory name is added onto the end of something the Git documentation calls a ‘tree-ish’, but the syntax of those isn't documented anywhere I can find.

So in in this case I gave up on the easy approach and used the --tree-filter option instead. This accepts an argument which should be a bit of shell code. The code will be run for each commit being revised, and should adjust the directory hierarchy it's given to match what you want. So in my case I use a bit of shell to delete everything that isn't one of the two directories whose content should be kept in the history:

git filter-branch --prune-empty --tree-filter \
    'rm -rf $(ls | grep -v "datafilter\\|data\\.filter")' \
    -- --all

Running that takes ages, because it's checking out so many working copies, then deleting most of their contents, and then snaffling up everything that's left to make a new Git tree object. The --prune-empty option will discard commits that don't touch the files that are left (which is almost all the commits).

Changing email address of committer

Update: My solution below isn't great, because it changes the email address for all commits. If your repository has commits by multiple people that you want to keep distinct, then you'll want to use the --commit-filter with a conditional bit of scripting to only change the appropriate commits.

While I was doing this (and certainly before I uploaded to a public Git host) I wanted to update my commits to identify me with my new email address. That can also be done with git filter-branch:

git filter-branch -f --env-filter \
    "GIT_AUTHOR_EMAIL=new-addr';
     GIT_COMMITTER_EMAIL='new-addr';" HEAD

That will do another filter, this time updating the metadata about each commit based on what the fragment of shell code does with its environment. Of course this would be much more complicated if more than one person had committed. I believe I could have combined this filtering step with the previous one, but doing them separately works as well. The -f option will give it permission to overwrite the backup ref that filter-branch leaves behind.

Cleaning out the old stuff

After checking that my new history looks right, and manually recreating the tags that apply to this project, I still have a bunch of Git objects left in the repo that are now not needed. To clean up this cruft I need to delete the references to those objects so that they can be garbage collected. The command git for-each-ref will list the refs that exist. I've had to get rid of the old origin ones, as well as the stuff under refs/original which is left as a backup by git filter-branch. Something like this does the trick:

git update-ref -d refs/remotes/origin/master
rm .git/refs/remotes/origin/HEAD
git update-ref -d refs/original/refs/heads/master
git update-ref -d refs/original/refs/remotes/origin/master

Then force Git to forget about what those refs used to contain, and then throw away objects that are no longer referenced:

git reflog expire --expire=now --all
git gc --prune=now

Rebuilding history piecemeal

There was one project where I had to do a bit more to get a nice tidy history, even after breaking it out into its own repository. The DataFilter project mentioned above ended up with a pointless merge (an artifact of something going on in a different module's code), but also with a strange first commit from a different project, which didn't add any files to the surviving part. The latter is presumably something that happens when you use --tree-filter instead of --subdirectory-filter, although I'm not sure why.

I decided the easiest way to fix these warts was to do a final phase of filtering by hand, building a new history from scratch and then making that the master branch. Since I wanted to strip off the first commit, I had to create a new commit by hand based on the one I wanted to be first, with the same log message and time stamp, and the same tree of files, but with no parents. To do that, first get the necessary information about the original commit you're trying to clone:

git cat-file commit sha1-id

Copy the commit message to a file (in this example called msg), and build a command line like this with the tree ID, timestamps, and if necessary the name and email environment variables:

GIT_AUTHOR_DATE="1189329695 +0100" \
    GIT_COMMITTER_DATE="1189329695 +0100" \
    git commit-tree tree-sha1-id <msg

That command will print out the SHA1 of the new commit, so you can work with it. I gave it a temporary tag to keep track of it.

You won't have to create every commit like that, because you can use git rebase to paste sections of history onto the new root commit. For example, the first piece I pasted on was done like this. First go to the last commit in the history that you want pasted:

git checkout last-copied-from

Then use git rebase to copy every commit starting from the one whose parent is old-root up to and including last-copied-from, as descendants of new-parent:

git rebase --onto new-parent old-root

That's all for now. Next I need to upload this stuff somewhere, on the off chance that anybody else is interested.