What is Git LFS?
As you can read on Git LFS Project page:
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server.
What is important here is that the file contents of large files are stored outside the Git repository, on the remote server, the Git LFS Server. The LFS support in TeamForge is based on the open source Gerrit LFS plugin. This plugin provides a Git LFS Server implementation that offers two different backends: the
fs one, that simply stores the file content on the filesystem of the server and
S3 one, that uses AWS S3 storage.
How does Git LFS work?
Before we continue, let’s have a brief look at how Git LFS works. First of all, on the client side, there is a
git-lfs extension that needs be installed on the user machines. This extension is responsible for filtering out large files and storing/retrieving them to/from Git LFS Server. I just want to make clear that
git-lfs package is able to talk to any Git LFS Server implementation. Have a look here to see more of them. Finally, on the server side, there is Git LFS Server. As mentioned before, the Git LFS Server is the part that is provided by Gerrit LFS plugin from our integration.
So why do we use the Gerrit plugin and not anther implementation? Well, because without this plugin the Git LFS server is completely independent from our TeamForge integration with Gerrit. That means, one needs to configure the whole thing manually. One need to set up the required URLs on the client side, take care about user authorization and access rights for LFS files separately. Certainly, that’s all possible. There is even a blog post about such a configuration with Artifactory used as Git LFS Server. However, with our TeamForge integration this is all working out-of-the-box. No need to worry about configuration, users, and permissions. All this is already in place.
Git LFS in TeamForge
Back to our LFS setup. Per default, the Git LFS in TeamForge is configured to use the file system backend. As a result the files are located on the local file system outside of Git repository. This has consequences for replication feature. Now, I don’t want to go too deep into details of TeamForge replication feature, and the view I provide here is simplified a bit. If you are interested in more details on replication topic, you might refer to this blog post. As a starting point, let’s have a look at relevant details of replication mechanism presented below. To make things simple there is only one single repository, without LFS.
What we see here is a replication of single Gerrit repository between two geographical locations. Below we have two users, that use the corresponding repository. The blue one (in United States) can access the data directly from the master server located in the same location. The green one (in Europe) fetches data from the replica. One thing to note: the replication destination is read-only from the client perspective. Hence, the replica user still pushes to the master location. As a result, the push operations, that typically contain less data, are going to the more distant location (master). But the costly fetch operations are performed in the same geographical location.
Git LFS and TeamForge replication
Now let’s have a look at Git LFS in context of replication. In short, the idea of LFS is to store the large files outside of Git repository. However, the Git replication, will only replicate Git repositories – it knows nothing about Git LFS. That basically means that LFS data simply will not be replicated with Git replication mechanism. That’s what happens if one decides to use Git LFS with file system storage:
As you can see above, while the Git data gets replicated to the slaves, the LFS content stays on the master. This works fine, but every time the user fetches data from the replica it has to go to the master to get LFS files content from there. Depending on the LFS usage pattern, this might work OK, or be insufficiently slow, as the advantages of the replication simply do not apply to any files stored in LFS.
Ideally, the LFS data should be replicated to the replica location as well, so that we fetch the data from the same geographical location. One of the solutions to this problem is to use LFS S3 backend, which is configured to take advantage of Cross region replication for Amazon S3.
In the picture above we see Git LFS on the master in us-west-2 region which replicates to eu-west-2 region in Europe. Consequently, all the fetches in Europe are again from the same region, and only push operations go to the master server.
Of course, this is just an example. While you configure the LFS S3 backend, you can specify any regions as long as there is cross-region replication enabled between them.
Finally, the question is: how to get there? Well, this is what this blog post series is about.
- The next blog post will tell you how to configure S3 bucket with cross-region replication so that it is ready to be used with Git LFS.
- In the last blog post I will show you how to configure Git LFS to use the configuration that was prepared in the first blog post and how to move the existing Git LFS data from a filesystem backend to S3.