Repository Formats Matter

I've seen many posts recently about SCM user interfaces and how one system is easier to learn, more powerful than another or better supports a particular development style. I submit that these arguments fail to capture the most salient feature of any source code management system—how the system manages the actual source code. This fundamental underpinning of the system, the repository structure, limits the kind of information the system can capture, the robustness and reliability of the data and to a great extent can limit the kinds of repository interactions possible.

A few days ago, Havoc makes a push for Subversion as a reasonable choice for projects. His complaints focus on the Git user interface, while again making this mistake that Git forces users to engage in distributed development.

I agree with Havoc that few projects are large enough in scale to require the kind of hierarchy seen in the Linux kernel. In fact, most projects have fewer than 10 developers working on them, and with close coordination, rarely see the need for any branching and merging at all.

However, as far as I know, none of the SCMs that provide distributed development insist that developers hide their work on long-lived branches and send patches up to a master maintainer. The distributed SCMs all allow either centralized or distributed development; it all depends on the conventions used within a project and individual developer style.

At X.org, we migrated from CVS to Git and yet have retained our largely centralized development model. There are few people publishing alternate trees, and we grant direct repository access to the same set of developers who used to have CVS access.

For really bizarre stuff that is experimental in nature, we occasionally publish a temporary alternate repository as a way to distance work from the mainline further than a branch within the master repository would; we allow developers to publish such trees on a public server that is visible through the same web interface as the master repositories, so there remains a single central location to discover what work is going on within a given module.

Git provides us with three principle functional advantages:

Offline repository access. Until you've used it, it's hard to understand just how often one can commit changes to a repository if the operation takes mere seconds. Havoc himself likes to save editor state every few minutes; with Git, he would be free to commit that state to the repository without significant additional delay.

The ability to make very fine grained changes to the code encourages people to separate work into small comprehensible pieces. Both proactive review and reactive debugging benefit substantially from this kind of detail, allowing people to highlight significant small changes which would otherwise be lost in large functionally-neutral restructuring.

Offline repository access is not the same as distributed development; changes are still pushed to a single shared public repository and included in a single line of development. Of course, simultaneous offline development often results in conflicts, but we've had that with CVS forever, and Git provides better merge-resolution tools than CVS ever did.
Private branches. For those of us with ultra-secret hardware plans, we develop drivers for unreleased hardware in parallel with the development of the public project. Git makes this supremely easy by allowing us to keep the ultra-secret new hardware changes in a private repository while still tracking the public repository. When we're allowed to release the source code for the new hardware, we simply merge the private branch to the upstream master and push that to the public repository. All of the development history for the new hardware then becomes a part of the public source repository.
Distributed backups. Even given freedesktop.org's reasonably reliable RAID disk array and daily tape backups, it's nice to know that around the world there are hundreds of people with complete backups of our source code repositories. If freedesktop.org is destroyed by earthquake, fire, flood or volcano, we can be confident that somewhere on the planet there will be complete and recent backups.

Alternatively, if the freedesktop.org administration becomes evil and starts to manipulate source code to subvert users machines, the distributed nature of our system means that the external developers will detect such changes and can easily repair them.

That's nice for us, but none of these may be compelling for people new to the distributed revision control world. Similarly, Git provides some nice tools to view and manage the repository (gitk, Git-bisect, etc.), again, useful but not compelling.

I would like to argue that none of the user-interface and high-level functional details are nearly as important as the fundamental repository structure. When evaluating source code management systems, I primarily researched the repository structures and essentially ignored the user interface details. We can fix the user interface over time and even add features. We cannot, however, fix a broken repository structure without all of the pain inherent in changing systems.

Given this argument, it should be clear that I think git's repository structure is better than others, at least for X.org's usage model. It seems to hold several interesting properties:

Files containing object data are never modified. Once written, every file is read-only from that point forward.
Compression is done off-line and can be delayed until after the primary objects are saved to backup media. This method provides better compression than any incremental approach, allowing data to be re-ordered on disk to match usage patterns.
Object data is inherently self-checking; you cannot modify an object in the repository and escape detection the first time the object is referenced.

Many people have complained about git's off-line compression strategy, seeing it as a weakness that the system cannot automatically deal with this. Admittedly, automatic is always nice, but in this case, the off-line process gains significant performance advantages (all objects, independent of original source file name are grouped into a single compressed file), as well as reliability benefits (original objects can be backed-up before being removed from the server). From measurements made on a wide variety of repositories, git's compression techniques are far and away the most successful in reducing the total size of the repository. The reduced size benefits both download times and overall repository performance as fewer pages must be mapped to operate on objects within a Git repository than within any other repository structure.

Subversion appears to me to have the worst repository structure of all; worse even than CVS. It supports multiple backends, with two available in open source and one (by google) in closed source. The old Berkeley DB-based backend has been deprecated as unstable and subject to corruption, so we will ignore that as obviously unsuitable. The new FSFS backend uses simple file-based storage and is more reliable, if somewhat slower in some cases.

The FSFS backend places one file per revision in a single directory; a test import of Mozilla generated hundreds of thousands of files in this directory, causing performance to plummet as more revisions were imported. I'm not sure what each file contains, but it seems like revisions are written as deltas to an existing revision, making damage to one file propagate down through generations. Lack of strong error detection means such errors will be undetected by the repository. CVS used to suffer badly from this when NFS would randomly zero out blocks of files.

The Mozilla CVS repository was 2.7GB, imported to Subversion it grew to 8.2GB. Under Git, it shrunk to 450MB. Given that a Mozilla checkout is around 350MB, it's fairly nice to have the whole project history (from 1998) in only slightly more space.

Mercurial uses a truncated forward delta scheme where file revisions are appended to the repository file, as a string of deltas with occasional complete copies of the file (to provide a time bound on operations). This suffers from two possible problems—the first is fairly obvious where corrupted writes of new revisions can affect old revisions of the file. The second is more subtle -- system failure during commit will leave the file contents half written. Mercurial has recovery techniques to detect this, but they involve truncating existing files, a piece of the Linux kernel which has constantly suffered from race conditions and other adventures.

I was looking seriously at Mercurial for X.org development, and was fortunate to spend a week last January with key developers from both Mercurial and Git. Discussions with both groups led me to understand that Git provided more of what X.org needed in terms of repository flexibility and stability than Mercurial did. The key detractors for Git was (and remains) the steep learning curve for the native Git interface; ameliorated for some users by alternate interfaces (such as Cogito), but not for core developers.

The other killer Git feature is speed. We've all gotten very spoiled by Git; many operations which take minutes under CVS now complete fast enough to leave you wondering if anything happened at all. This alone should be enough to convince anyone leaning towards Subversion or Bzr; fine-grained commits are only reasonable if the commit operation takes almost no time.

We were not particularly interested in the kind of massive distributed development model seen in the kernel, but the ability to work off-line (some of us spend an inordinate amount of time on airplanes) and still provide fine-grained detail about our work makes a purely central model less than ideal. Plus, the powerful merge operations that Git provides for the kernel developers are still useful in our environment, if not as heavily exercised.

I know Git suffers from its association with the wild and wooly kernel developers, but they've pushed this tool to the limits and it continues to shine. Right now, there's nothing even close in performance, reliability and functionality. Yes, the user interface continues to need improvements. Small incremental changes have been made which make the tools more consistent, and I hope to see those discussions continue. Mostly, the developers respond to cogent requests (with code) from the user community; if you find the UI intolerable, fix it. But, know that while the UI improves, the underlying repository remains fast, stable and reliable.

And yes, Havoc, anyone seriously entertaining moving to SVN should have their heads examined.