On my current project, Git worktrees have become an essential part of my workflow to run test suites in parallel. Nevertheless, when I look at my colleagues, I think worktrees are still not used as much as they could be. In this post I will shed some light on when I use worktree (and when not), how to think of worktrees, and how I use them in practice.
The code in this post has been executed at build time of this website with the following version of Git:
git --version
git version 2.50.1
Worktrees or stashes?
To start learning more about Git worktrees, the official Git documentation is a great starting point. The documentation helpfully provides an example of when you might want to use them:
You are in the middle of a refactoring session and your boss comes in and demands that you fix something immediately. You might typically use git-stash[1] to store your changes away temporarily, however, your working tree is in such a state of disarray (with new, moved, and removed files, and other bits and pieces strewn around) that you don’t want to risk disturbing any of it. Instead, you create a temporary linked worktree to make the emergency fix, remove it when done, and then resume your earlier refactoring session.
Although the example uses some technical terms, the general gist should be clear if we take a worktree or working tree to mean a directory you use for coding, staging and committing your changes. A linked worktree is then the equivalent of a second working directory, containing a copy of the code at a commit of your choosing. We will clarify these terms when we dive into the details, but for now this is a fine working definition.
Unfortunately though, the example does not clarify what is meant with a “state of disarray”. I will admit, it does not convince me on why you should use worktrees. Most of the time I actually prefer stashing my changes and switching to a different branch. In fact, the only time I use linked worktrees, is when stashing would not solve my problem. Of course, this begs the question, which problem do worktrees solve, that stashing cannot?
When a single directory is not enough
When working on a data engineering project, it is generally a good practice to have a suite of end-to-end tests. Even though end-to-end tests should preferably be run in CI, it can still be a good idea to first run the test suite locally to avoid unnecessarily blocking a runner if your test would have failed anyway. Indeed, such an end-to-end test suite will often take a while to run, as it has to push a substantial amount of data through your pipelines to cover most of the happy path. At my current project, running the full end-to-end test suite takes around an hour. Because the test suite takes such a long time to finish, we have also set it up such that it writes its output to disk.
Due to this persistence to disk and the long execution time, I am effectively blocked from making changes in my working tree while the test is running. If in the meantime a colleague requests a code review and I want to run the test suite for their branch, I now have to choose between stopping the already running test suite and putting the code review on hold. Neither of these options is ideal.
The solution to this problem is to create a second directory with the code from my colleague's branch and without the output of the already running test suite. This is precisely what a linked worktree is!
Of course, long-running test jobs are only one particular example. The same line of reasoning holds for any long-running job that could block your initial directory for a long time, compilation being another such example. In other words,
Git worktrees allow me to cleanly run long-running jobs in parallel.
For this use case, I sometimes like to think of this workflow as similar to cloning a repository multiple times, but without any of the downsides of trying to keep multiple clones in sync.
Working trees and repositories
Before we explain what a worktree is and how the git worktree
commands
operate, let us briefly go over what the terms “working trees” and
“repositories” mean and how they relate to each other. When we initialize (or
clone) a non-bare repository, Git creates both a repository and a working tree:
git init /tmp/project
Initialized empty Git repository in /tmp/project/.git/
The directory /tmp/project
is known as the working tree and can roughly
speaking be thought of as a workspace that allows you to stage and commit your
files. The /tmp/project/.git
directory is called the repository and is fully
managed by Git. It contains all the data and metadata Git requires for its
version control capabilities. Although you might colloquially call
/tmp/project
the repository, it is important to keep in mind that the
repository is actually the .git
directory it contains.
Let us create a simple commit and sketch the structure so far:
cd /tmp/project touch foo git add foo git commit -m "Add foo"
[main (root-commit) 1ad00df] Add foo 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 foo
Figure 1: Project structure after initialization of a Git repository.
In this figure and in all the next ones, we will depict the repository in orange and the working tree(s) in green.
It is important to understand this is only the default setup for initializing a repository as even without resorting to worktrees, we can instruct Git to create a repository outside of the working tree, or to leave out the working tree altogether, by specifying additional initialization flags. Let us have a brief look at how this works.
Bare repositories
Bare repositories are mainly used on remote servers to host Git repositories to push and pull code to and from. As such, you do not typically directly work with a bare repository. Unlike a non-bare repository, a bare repository therefore is not initialized with an associated working tree (although it can have one).
A bare repository is initialized by using the --bare
flag as follows:
git init --bare /tmp/project-2.git
Initialized empty Git repository in /tmp/project-2.git/
Figure 2: A bare repository without any working trees.
If you are using a bare repository, it is common practice to name it after the
project name, but to suffix it with .git
.
Separate repository
Let us now consider the other example of separating the repository from its
working tree. We can create a non-bare repository which is not contained in its
associated working tree by using the --separate-git-dir
configuration:
git init --separate-git-dir /tmp/project-3.git /tmp/project-3
Initialized empty Git repository in /tmp/project-3.git/
Figure 3: A non-bare repository not contained in the initial working tree.
If the repository is not contained in the working tree, how is Git informed of
the location of the working tree? You might assume the repository contains a
pointer to the working tree, but it is actually the other way around! The
working tree itself contains a .git
file which points back to the
repository:
cat /tmp/project-3/.git
gitdir: /tmp/project-3.git
This is why you can still use all Git commands in the working tree, even when it is separated from its repository.
I am not aware of a naming convention for repositories that are separate from
their working tree, but I think it is good practice to follow the same naming
scheme as for bare repositories. Just be aware that project-3.git
does not
refer to a bare repository in this case.
Adding more working trees
We can extend this idea with the git worktree
commands: With Git worktrees, we
can create any number of working trees for a repository at any location we
desire. Adding working trees is done through the git worktree add
command:
git worktree add ../project-worktree -b project-worktree
Preparing worktree (new branch 'project-worktree') HEAD is now at 1ad00df Add foo
As a technical limitation, a branch can only be checked out in one working tree
at a time. Upon creation of a working tree, we therefore also create a new
branch using the -b
flag, circumventing this limitation.
So far we have loosely used the terms worktree and working tree interchangeably,
but there is actually a slight difference: To keep track of different working
trees, Git uses some metadata, and it is the combination of this metadata and
the working tree that is known as the worktree. The initially created worktree
is the main worktree and any worktrees added with the git worktree add
command
are called linked worktrees.
Similar to the case of a working tree separated from its repository, linked
worktrees do not contain a .git
directory, but a .git
file. This file
points back to the worktrees
subdirectory of the repository:
cd ../project-worktree
cat .git
gitdir: /tmp/project/.git/worktrees/project-worktree
Because of this pointer, we can interact with the repository from our linked
worktree /tmp/project-worktree
with the exact same commands as from our main
worktree. For instance, let us add and commit a file bar
and sketch the
structure of the repository and its working trees:
touch bar
git add bar
git commit -m "Add bar"
[project-worktree d2b5a23] Add bar 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 bar
Figure 4: A repository with a main worktree and a single linked worktree.
Git provides commands to list all worktrees associated to a repository. The main worktree is listed first:
git worktree list
/tmp/project 1ad00df [main] /tmp/project-worktree d2b5a23 [project-worktree]
If you do not need a worktree anymore, you can remove it as follows:
git worktree remove project-worktree
By default, Git tries to ensure you do not lose any data when you remove a worktree. As such, Git will refuse to remove the main worktree or any linked worktree that contains modified or untracked files.
How I structure my worktrees
Keeping track of multiple worktrees causes a bit of extra overhead, so if you do start using them, I would recommend keeping some structure to them. Personally, I like to create worktrees either for a long-lived branch, such as a development branch, or for a recurring type of work like developing, bug fixing or code reviewing. In both of these cases, I like to use specific naming conventions to keep track of them.
If I want to follow the development branch dev
of project
, I would create a
worktree called project@dev
. For a worktree used for a specific workflow, I
would use a +
-sign instead. For instance, the worktree for code reviews would
be called project+review
. In both cases, I keep these linked worktrees in the
same directory as the main worktree. The main worktree retains its original
name project
and I often configure it to simply track the main branch of my
repository.
These conventions allow me to start with only the main worktree and add and
remove linked worktrees as I feel necessary. Any working tree containing a +
or @
in its name is a linked worktree and can safely deleted without
permanently corrupting or deleting my repository. This works even if I
accidentally delete the directory instead of using git worktree remove
.
Portable worktrees
To round off this post, I would like to share an interesting discovery I made while reading through the worktree documentation:
If the working tree for a linked worktree is stored on a portable device or network share which is not always mounted, you can prevent its administrative files from being pruned by issuing
the git worktree lock
command, optionally specifying--reason
to explain why the worktree is locked.
Why would you want to use this? One reason I can think of, is that this can be useful when you are working in a memory constrained environment. Since linked worktrees (similar to repositories with separate working trees) only contain a pointer to the Git repository, you could, for example, create them on an embedded system. In this way you could still use Git to develop, but without the memory footprint of storing the entire Git repository. This is precisely the use case I came across in Developing for CircuitPython with git-worktree. It is a nice trick to have in your toolbox and a good example of the value of clear documentation!