SW engineering, engineering management and the business of software
Editor’s note: This article was taken from the slicetocode.com blog. As that website is now defunct and the content is CC3.0 licensed, I have decided to republish it here.
Originally published 25 August 2013 at http://slidetocode.com/2013/8/25/how-git-works by the author(s) of the slicetocode blog.
There is always during that manly ritual of viewing a friend’s new car that he will pop open the bonnet to show me the engine powering his new steed. I politely comment on its elegance and power, perhaps throwing in an admiring whistle if I feel his new fan belt and spark plugs really deserve that extra modicum of praise, but I am inevitably disappointed when he quickly closes the bonnet again and moves on to show me its hubcaps, or how capacious its boot is. It is disappointing to me because I don’t want to move on from the engine. I want to find out how it works. Can we not disassemble it? compare the exhaust manifold to that in his previous car? see what improvements they have made. Maybe we could improve it even further? Sadly not.
Given this natural hacker-reflex to probe and tinker, it is odd how little I knew about Git until recently. Perhaps my daily life is so reliant on Git not failing to do its job, that I don’t want to poke around lest I find a flaw. Until now I have been content with kicking the tyres and going for a spin around the block, rather than dismantling that mysterious .git directory. Nevertheless, I recently became curious and dug in. I was thrilled to discover that Git is even more beautiful internally, than it is functional externally.
At heart a Git repository is a key-value object store where all objects are indexed by their SHA-1 hash value. All commits, files, tags and filesystem tree nodes are different types of objects living in this repository.
When an object is added to the repository it is hashed, and from then on it is referred to by its SHA-1 hash value. Effectively a Git repository is a large hash table with no provision made for hash collisions. Luckily, with SHA-1 the probability of hash collisions is so vanishingly small that it is nothing to be concerned about.
To see an example of some simple objects, initialise a super-simple git repository with the following commands.
git init . echo Hello world Git! > Readme.md git add Readme.md git commit -m "Added a readme" Now type find .git/objects -type f. It will print something similar to the following .git/objects/02/b365d4af3ef6f74b0b1f18c41507c82b3ee571 .git/objects/37/ce98f6635fa1192d85243bcaa4622537b2eb87 .git/objects/f0/5245cba72f23f998a5e372812d1a390375314c
The first line corresponds to an object with SHA-1 hash
02b365d4af3ef6f74b0b1f18c41507c82b3ee571. When stored the first two hex digits determine the directory, and the remaining digits determine the filename.
To read the contents of these files you must decompress them. You can do this with a python one liner. For example to read the
02b365d... object I type
python -c "import zlib,sys;print repr(zlib.decompress(sys.stdin.read()))" < .git/objects/02/b365d4af3ef6f74b0b1f18c41507c82b3ee571
'blob 17\x00Hello world Git!\n'
Immediately you can recognise as the contents of the Readme.md file. The first line shows the type(blob) and size(17) of the object, the remainder is filled with a snapshot of the Readme.md file.
To inspect the contents of the other two objects we use the git cat-file command, which pretty prints the object, omitting its type and size.
git cat-file -p 37ce98f6635fa1192d85243bcaa4622537b2eb87 100644 blob 02b365d4af3ef6f74b0b1f18c41507c82b3ee571 Readme.md
This is a tree object. Git stores the file system structure in these tree objects. The first column shows the unix permissions, the second column is either blob or tree depending on whether it is a pointer to a file or another directory, the third is the hash of the object pointed to, and the fourth is the filename. In this case there is only one file tracked by git, Readme.md, and you can see that this tree node reflects that by listing one file, and pointing to the blob holding its contents.
A more interesting example from a different repo is
100644 blob 5fe92a0481023dfa3d2e64a0556dda3bbb852e5d init.scm 100644 blob 20fa5e19fcb963f8a4ff249a815413153fb6b4e3 opdefines.h 100644 blob 69c742cc2544e336230d637b8115d69f0c050720 scheme-private.h 100644 blob badef17026a45893a7b3174db325e868c3a688b7 scheme.c 100644 blob fedc7b4cc4ef9a746fb9b6c4a22679e58c7ad133 scheme.h 040000 tree 7d6df008df749a86cc6d82b6fb6c42889df97c6b tsx-1.1
Here you can see that not only are there five files, but there is also another tree node pointed to. This is a subdirectory. As with file snapshots, tree nodes are created, hashed, stored in the object database, then referred to from then on by their hash value.
The final object in our original repository is a commit.
git cat-file -p f05245cba72f23f998a5e372812d1a390375314c tree 37ce98f6635fa1192d85243bcaa4622537b2eb87 author Duncan Steele <firstname.lastname@example.org> 1377416934 +0100 committer Duncan Steele <email@example.com> 1377416934 +0100 Added a readme
Here you can see the format of the commit, with a header containing author, committer details and timestamp, followed by the commit message itself. If you type git log you will recognise that the commit number is just the hash of this commit object.
The first line of the commit is a pointer to the tree object that stores the snapshot of the files at this revision number. In this case, this is the tree object we just discussed.
Now reconstruct the entire repository at that commit. Read the root tree object from the commit object, traverse that tree object recursively if necessary and reconstruct all files, permissions from those tree objects, and finally fill them with the contents stored in the blobs pointed to by the tree objects.
An uncommon aspect of this commit is the lack of a parent commit - this is because it is the first commit in the repository. All other commits will have one or more parents specified in the header, where multiple parents imply a merge commit.
If you were to change that Readme.md file and commit again you will see three new objects in the database. A new blob containing a second snapshot of Readme.md, a new tree object updated for that snapshot and a second commit object. You may wonder why it is a snapshot, not the diff you are familiar with seeing.
Don’t let Git’s interface fool you, all those diffs are calculated on the fly. When you commit, git stores snapshots, it does not store diffs from the previous commit.
Much of the compression in Git comes from the fact that if a file or tree node that has not changed since the previous commit, that file or tree node will have the same hash as before and it will not take up space twice in the database. In fact if you have multiple copies of the same file, the tree nodes may show different filenames and permissions, but they will all point to the same blob object. Add to all this the compression of the objects themselves and you can see that the repository is already remarkably compact. Nevertheless Git has one further trick up its sleeve - Packfiles.
As a repository grows, the object count climbs from the hundreds, to the thousands, and clearly it becomes inefficient to store the data in flat files. Instead, git can store these objects in a single, indexed, pack file.
git repack -a -d to pack all commits so far into the pack file and remove the now unnecessary loose files. Running
find .git/objects -type f again will yield something similar to
.git/objects/info/packs .git/objects/pack/pack-17e2136f90aef681851aa4ffc2f6441ab35908f4.idx .git/objects/pack/pack-17e2136f90aef681851aa4ffc2f6441ab35908f4.pack
All the loose objects have been packed together in the
.pack file, which is indexed via the
.idx file. The repository contains the same objects, they are just packed in a single file to speed up access and reduce the repository’s disk space usage. You can see this with
git verify-pack -v .git/objects/pack/pack-17e2136f90aef681851aa4ffc2f6441ab35908f4.pack f05245cba72f23f998a5e372812d1a390375314c commit 195 130 168 37ce98f6635fa1192d85243bcaa4622537b2eb87 tree 37 48 377 02b365d4af3ef6f74b0b1f18c41507c82b3ee571 blob 17 27 425 non delta: 3 objects .git/objects/pack/pack-17e2136f90aef681851aa4ffc2f6441ab35908f4.pack: ok
You can see exactly the same objects are stored in the pack as were stored in the flat files, and the results of running
git cat-file on the objects are unchanged.
An additional benefit of pack files is that they allow git to compress your repository even further. My statement earlier that git stores snapshots, not deltas is not entirely true. The objects themselves are snapshots, but when they are stored in a pack file, git will compare that object to other similar objects, then rather than store both objects in full, git will store one object in full, and the other as a delta from that object. Thus a large file with a number of small changes will be storedinternally as a single snapshot and a number of deltas from that snapshot (known as a delta chain). If you run git verify-pack on a less trivial repository you will see the details of these delta chains as well.
I hope you have enjoyed kicking the tires of Git with me. There are many complexities beneath the surface, but I have been stunned to discover how simple Git really is. Something that I am certain has contributed to its robustness, speed and success.
Amended the post to clarify a few points as a result of feedback and the redit discussion. Aditionally please note that git does not support all permission modes, the only supported modes are:
100644 normal file 100755 executable 120000 symlink 040000 directory 160000 submodule
If you want to read more about Git’s internals, read this
Posted on 25 August 2013
Based on a work at http://slidetocode.com/blog
Slide to code blog is licensed under a Creative Commons Attribution 3.0 Unported License