This post is the first part of a series on how we built Bitbucket Data Center to scale. Check out the entire series here
Today is Bitbucket Data Center's second birthday! It really was two years ago that Stash Data Center 3.5 (as it was then called) became the first collaborative Git solution on the market built for massive scale.
On the day it was born, Bitbucket Data Center had just 7 customers (who had worked with us throughout the development and pre-release phases), and a small handful of add-ons whose vendors had made sure their products had already earned the Data Center compatible badge on day one.
Since those humble beginnings, Bitbucket Data Center has changed its name and experienced enormous growth in adoption, functionality, and deployment flexibility. Some highlights we're particularly proud of include
Hundreds of thousands of users worldwide.
Over 100 Data Center compatible add-ons in Atlassian Marketplace.
Smart Mirroring, enabling large enterprises with teams around the world (but less-than-stellar network connections) to enjoy git clone and git fetch just as fast as local offices.
Disaster Recovery andintegrity checking, providing peace of mind for large enterprises that their repository data will always remain safe and available.
The Amazon CloudFormation template and Quick Start guide , taking the hassle out of deploying Bitbucket Data Center in the Amazon Web Services (AWS) environment and taking advantage of its managed services and auto-scaling capabilities.
And most recently,SAML support, enabling single sign-on for Bitbucket Data Center users across not just Atlassian products but all SAML-compatible products used by your development teams.
But the number one feature provided by Bitbucket Data Center since the beginning — and still the primary reason why many customers adopt it — is performance at scale . Large enterprises with thousands of active users and build agents hitting their central Bitbucket instance can't serve all their load with a single machine. Instead, sysadmins must use the scale features of Bitbucket Data Center to handle heavy loads without sacrificing performance for their users.
To celebrate Bitbucket Data Center's latest milestone we'll describe some of the work we've been doing – behind the scenes – to make Bitbucket Data Center the first massively scalable Git solution and still the leading and most performant product available today.
The scaling challenge
When it comes to scale, the most demanding load many Bitbucket instances deal with is managing Git hosting requests (simultaneous user-initiated commands, like git clone , fetch , push , and so on.
This is because when you run a Git command that must communicate with a remote repository on Bitbucket, your Git client opens one or more connections to your Bitbucket instance (depending on whether you are using HTTP or SSH). When each of these connection reaches the backend server, after authentication and other processing, the connection spawns a Git process and streams its standard input, output, and error output back to the client.
These Git processes on the server are CPU and memory intensive, especially when they generate packfiles to transfer repository contents over the network. By comparison, most other kinds of operation you can perform against a Bitbucket instance (like browsing around, interacting with pull requests, and so on) are generally much lighter and faster.
These graphs of the CPU and memory usage of a typical git clone operation on a server might help to illustrate just how resource intensive Git hosting operations can be. The blue lines shows the resource consumption of the Git process alongside the red lines representing that of Bitbucket. Bitbucket does a bit of work when the connection comes in and then hands it off to Git, using very little CPU itself. The first thing Git does is create a packfile, consuming about 100% of a CPU core for a while, then it does some compression which consumes even more (Git is multi-threaded). Its memory consumption also climbs during these phases, often by a few hundred MBytes or more. After the packfile has been generated, streaming it back to the client uses hardly any CPU in Git or Bitbucket, but the memory allocated by Git isn't released until the request has been fully served and the process exits.