Git guts

Today I will dive into the guts of git to showcase the simplicity and elegance in which git manages the content internally in it’s own content addressable file system. Armed with this knowledge, you will be able to get a deeper understanding of the underlying data structure to help you figure out and troubleshoot issues that may inevitably come up as you use git.

To start, I shall create a new directory and initialize git.

$ mkdir git-guts
$ cd git-guts
$ ls -a
. ..
$ git init
Initialized empty Git repository in /Users/anuradha/dev/workbench/git-guts/.git/
$ ls -a
. .. .git

At this point, there are no files under version control yet. Here are the files that have been created during initialization:


Of these, the hooks are boilerplate and none are yet active. To make them active, they need to be renamed to remove the .sample suffix.

In this post, I shall focus on the .git/objects directory, as that is where all the content is stored as hashed “objects”. To show what happens, let’s add a file to source control and observe the changes:

$ echo "bar" > foo
$ git add foo
$ git commit -m "initial commit"
[master (root-commit) 64f3e97] initial commit
1 files changed, 1 insertions(+), 0 deletions(-)
create mode 100644 foo
$ find .git/objects/ -type f

Adding a single file to the repository caused the creation of three objects. Each object is uniquely identified by a 40-character SHA-1 hash of its content, which brings us to one of the key aspects of git, which is that it’s nearly impossible to alter the contents of any single file without causing a change to the cryptographic hash, and unlike version control systems that pre-date this approach of cryptographically ascertaining the integrity of the content, it’s quite hard to tamper with the file or maliciously change history. This coupled with the ability to sign tags using a private key adds an additional level of authenticity and non-repudiation to the release process.

Let’s analyze the three types of objects. To see the type of object, the git cat-file -t HASH command can be used. It shows that the three types of objects are:

  • blob
  • commit
  • tree

To see the contents of each file, the git cat-file -p HASH command can be used as shown below:

$ git cat-file -p 5716ca5987cbf97d6bb54920bea6adde242d87e6

This is the first of the three objects, which is the “blob”. It is the actual contents of the file. Note that the file is addressable using the hash, making this structure a content-addressable filesystem. But you may wonder, how does git know what the file name is? This object is only named by the hash. I will get to that shortly.

Let’s look at the next object.

$ git cat-file -p 64f3e9762509b0ce9cbb252f69847957e5368632
tree 6a09c59ce8eb1b5b4f89450103e67ff9b3a3b1ae
author Anuradha Weeraman 1358159197 +0530
committer Anuradha Weeraman 1358159197 +0530

initial commit

This is the “commit” object, which is also stored as an object in the file system. Note that there are two fields for the author and the committer, since the two can be different individuals in the case of a large distributed development project. This way original contributions are acknowledged and not lost during the merging and contribution incorporation process. This file also has a hash reference to the commit “tree”. Let’s look at the tree object next.

$ git cat-file -p 6a09c59ce8eb1b5b4f89450103e67ff9b3a3b1ae
100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6 foo

This is the last of the three objects, which is the “tree” object. It contains a descriptor of all the files that are part of the commit. It does that by taking the information from the staging area / index and creating an object at the time of the commit. It shows the permissions of the file in a somewhat different format to the standard UNIX file permissions; the last three digits tells you what the permissions of the file was at the time it was committed. The line also indicates the hash of the blob followed by the name of the file. This is how git knows what the blob should be called in the file system when the code is checked out.

Let’s also take a look at what the HEAD of the tree is pointing to:

$ cat .git/HEAD
ref: refs/heads/master
$ cat .git/refs/heads/master

It now has a reference to the last “commit” object. So when you clone or pull down master, git knows what the last commit was introduced into the repository.

All I’ve described so far was a single commit. How does git keep track of the history and the commit graph based on this structure, you might wonder. Let’s make a change to the foo file and commit it.

$ echo foo > foo
$ git add foo
$ git commit -m "Second commit"
[master 2c8200f] Second commit
1 files changed, 1 insertions(+), 1 deletions(-)
$ find .git/objects -type f


There are three new objects in the system now, a new blob, a tree, and a commit. The blob and tree objects are similar to the ones discussed earlier, but there’s a change to the commit object:

$ git cat-file -p 2c8200f75860bede9aaa0c156c133d15fa418bd5
tree 205f6b799e7d5c2524468ca006a0131aa57ecce7
parent 64f3e9762509b0ce9cbb252f69847957e5368632
author Anuradha Weeraman 1358161997 +0530
committer Anuradha Weeraman 1358161997 +0530

Second commit

It references the parent commit. This way the entire commit graph can be traversed and mapped using these commit objects. The .git/refs/heads/master file is updated to refer to the latest commit. git reflog is a very useful tool which shows the updates to the HEADs over time and can be used to diagnose issues which you might otherwise consider unrecoverable. Git is very protective of data so it’s actually quite hard to lose data, unless you manually trash the object repository. In most occasions, it may turn out to be a dangling unreferenced commit which you can track down using git reflog and recover it. Here’s a post that explains this process for those who are interested.

Now, to make things a little more interesting and to create some awareness of what the git utilities are doing behind the scenes to make our lives easy, let’s create these objects manually using a few low level commands with the help of this new knowledge that we just acquired. For the purpose of this exercise, I will create a brand new repository and initialize git.

Let’s create the blob object for the file “foo” with the content “bar” as in the original example:

$ echo bar | git hash-object -w --stdin

The -w switch tells git to write the object to the repository, and --stdin instructs it to read the contents from standard input. It then outputs the hash of the object that it just created.

Let’s look at the repository to see if it really was created:

$ find .git/objects -type f

So far git has been telling us the truth.

Now, let’s create a tree object. Since git relies on the index, or the staging area in order to determine the contents of the tree, we will use the git update-index command to set things up in the staging area. Note that the current directory is still empty, there is no “foo” file in the current directory. It’s only available as a hashed object inside .git, and still .git doesn’t know it’s called “foo”. To update the staging area to write the tree object:

$ git update-index --add --cacheinfo 100644 5716ca5987cbf97d6bb54920bea6adde242d87e6 foo

This is equivalent to performing git add foo. Now git knows the file name of the object, but the tree object is not yet written to the object repository. To do that:

$ git write-tree

This writes the tree object, and returns its hash. Let’s look at the file system again:

$ find .git/objects -type f
$ git cat-file -p 6a09c59ce8eb1b5b4f89450103e67ff9b3a3b1ae
100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6 foo

Still, the repository does not contain a “foo” file. Right now these objects are dangling, as there’s no commit object referencing them. It’s not possible to checkout a copy of the foo file yet. Let’s create the commit object now:

$ echo "initial commit" | git commit-tree 6a09c5

The short hash of the tree object and optionally and preceding commits are passed in as arguments to the git commit-tree command which returns the hash of the commit object. At this point the repository still has no idea what the last commit was, so performing the git log command would result in an error:

$ git log
fatal: bad default revision 'HEAD'

To fix this:

$ echo c3352776341945bcdddd400d3765635bb2be5671 > .git/refs/heads/master

Let’s look at the log again:

$ git log
commit c3352776341945bcdddd400d3765635bb2be5671
Author: Anuradha Weeraman
Date: Mon Jan 14 18:06:51 2013 +0530

initial commit

There you have it. Git now recognizes your last commit.

If you now list the directory where you initialized the git repository, you would not notice any files, since all these objects were created directly in the git object repository. Now that we have created the commit object and the log shows the last commit, we’re able to load the file into the directory to create a working copy. The way we do that is by resetting the contents of the repository to the HEAD which points at the latest commit.

To illustrate this more clearly:

$ ls -a
. .. .git (empty directory)
$ git reset --hard
HEAD is now at c335277 initial commit
$ ls -a
. .. .git foo
$ cat foo

and Voila.

Hope this helps, and you now have a better understanding of the git guts.

A boy’s first computer

The week so far has been an eventful one. Being bed-ridden has made me pensive and nostalgic about my childhood, and long for the simpler days. I was specifically dwelling on the subject of interpreters and compilers which took me back to when I was nine, when I asked my uncle, who I considered as the pre-eminent guru in all things computers at the time, how to compile a .bat file.

At that time I was given an old 286 to play with. And when we moved around, so did my computer. Every week I flip open the large case case using a convenient latch on the two sides and peek in. I was enamored by the machine and eventually learnt what some of its parts were, and wondered how they worked. I sought books and the help of my uncle to learn about it. I once spent a weekend at my uncle’s where he showed me the difference between dir /p and dir /w, and told me to try it out for about twenty minutes while he went to speak to someone. He taught me the basics of DOS which I was usually very eager to try out on my own machine.

After I was done peeking, I usually close the box up and meticulously clean it. It was kept in perfect condition next to my work desk on a blue color custom built table for a computer which had a pull out keyboard and a place to keep a printer as well as some shelves below for various things. It was pretty big by today’s standards, but then everything was so. The computer case was about 2.5′ x 2.5′ x 10″. It was big, I couldn’t carry it by myself but I made sure I packed it safely during my vacation trips.

It also featured a 4 MB hard drive, 1 MB of RAM, a 14″ monochrome monitor and a 5.25″ floppy disk drive. There was a lot of trial and error to figure it out and spent many late nights trying to understand DOS, WordStar, WordPerfect, DBase III+, BASIC, Lotus 123. The command line baffled me, and piqued my interest, and I learnt to love the blinking cursor on the green screen waiting for the next command to be input.

It was a used machine at the time, so it came with some customizations and funky DOS shell like interface which was navigable through function keys, but also let you escape into the shell. I spent a lot of time trying to figure out how it worked and how to modify it. It also came with a couple of games which I still fondly remember: digger, pacman and paratrooper. I played very little games after that. I recently tracked this down to skill-envy (a pseudo psychology construct that I just coined), as playing games made me wonder too much how it was constructed, and not having the skills to build a similar game myself made me envious of the game author and the knowledge of the black art he possessed. Hence I preferred to stay away from games. I know, it’s childish, but in my defense I was a child.

Part of this black art was machine language. Printing the contents of .exe files showed a series of unintelligible characters and yet the only executable programs I could create at the time were plain text and readable .bat file. That was when I asked my uncle how I could convert a .bat file to the .exe file which I viewed as being inherently superior due to its mysterious nature. Knowing what I was trying to get at, he suggested I learn QuickBasic. I only had GWBasic installed on the computer. I came to realize that the syntax of QuickBasic was more or less the same, minus the explicit line numbers, so I taught myself GWBasic on my 286. Later I got a copy of QuickBasic and lo and behold, there was an option to compile programs into the mysterious .exe file format that I can directly execute from the command line. This revelation was a turning point for me and I was hooked on QB.

Having outgrown the 286 I pestered my father to purchase a newer computer, and this time a 66MHZ 486 DX2 with the “turbo” button. If turbo was turned off the computer ran slower, which baffled me. The computer also featured a 40 MB hard drive. That should last forever, I thought at the time.