This writeup discusses some of the things that I thought about when designing my backup strategy.
For now, there are two types of data that I care about:
The real difference between the two is that entries in group 1 are expected to change incrementally, whereas entries in group 2 aren’t expected to change. They either exist in my collection or they don’t.
For the first group, I use git
to version control
repositories of related things, like notes or code by project. For the
second group, I use git-annex
. These tools don’t serve as
backups, but they play a role in the strategy. In both cases, they
provide a nice mechanism for being able to pack and unpack a lot of data
with simple commands.
The first part of a backup strategy is to determine the level and quality of risk to your data. That process begins with an inventory.
I’m backing up the following kinds of things:
It’s a digital pile of my thoughts and experience over my lifetime. I definitely don’t want to lose that. But I’m also not guarding state secrets, here. Here’s a breakdown of the risk:
It seems like I can cover almost all of my known risk by having multiple backups that are separated by space and time. That should be enough to get started developing an appropriate backup strategy.
Eventually, it’s worth considering more factors, but first focus on a minimal viable strategy. Afterwards, you might consider things like:
There are a few ways you could make a backup of a Git repository. My general impression is the following:
git clone --mirror <repo> <backup>
when
you want easy, in-place updates.git bundle
when you want an archived copy..git
directory. In theory that
contains everything, but it’s not an atomic transaction and you risk
corrupting files. (At least) in multi-user, hot-copy scenarios.The experiments that I ran are in Git Repo Backups, but the summary above describes the results pretty well.
This could vary based on what went wrong. You have to restore the repository. To do it without disruption to client repos, restore it to the same remote ref.
Restoring the repository could be as simple as copying it to the old location, but it could be more involved. If the disk fails you have to replace it. Worst case scenario is you have to redeploy on a new host. To do that, you have to do a whole bunch of prerequisite infrastructure. Honestly, if it comes to that, since I’m the only user, I’m probably just adding the backup as a remote on each of my projects, then making a new backup somewhere else.
This is how you create a backup of a central annex.
git clone --mirror git@host.local:/path/to/annex.git
cd annex.git
git annex init
git annex group . backup
git annex wanted . standard
git annex sync --content
Because this is intended as a backup and not something that contributes back to the main repo, don’t add the backup as a remote of the main repo.
To update, you might be able to just do
git annex sync --content
. You might also consider doing
git remote update
beforehand.
The backup is a lagging copy of the main annex.
Honestly, if you put the duplicated backup in the same place as the original, you probably only have to do step 1. Of course, it goes without saying that you should diagnose and resolve whatever issue made it so you needed the backup in the first place.
I made a shell script to back up Git repositories. It takes a source
directory and a target directory. It looks the source directory for Git
repositories (as indicated by a .git
suffix or a
.git
subdirectory), then decides whether to mirror it (if
the target doesn’t already exist) or to update (if the target already
exists).
For Git Annex, it’s simpler because I only really have one annex.
Using git annex sync --content
with the right preferred
content settings is good enough. The reason I wrote a script for normal
Git repositories is simply because I have have so many of them I have so
many of them and they are constantly in flux, but I keep them in the
same few directories.
It’s important to note that the script or the
git annex sync
command must be run by someone in the
git
user group.
To schedule it, choose an appropriate git user, run
crontab -e
and add the script and commands to the
crontab.