Deciding My Backup Strategy

Kyle Bowman

This writeup discusses some of the things that I thought about when designing my backup strategy.

Begin with an Inventory

For now, there are two types of data that I care about:

Text-based things that I produce: code and notes are the main ones.
Data artifacts and media. Picture, videos, PDFs, etc..

The real difference between the two is that entries in group 1 are expected to change incrementally, whereas entries in group 2 aren’t expected to change. They either exist in my collection or they don’t.

For the first group, I use git to version control repositories of related things, like notes or code by project. For the second group, I use git-annex. These tools don’t serve as backups, but they play a role in the strategy. In both cases, they provide a nice mechanism for being able to pack and unpack a lot of data with simple commands.

Assessing Risk

The first part of a backup strategy is to determine the level and quality of risk to your data. That process begins with an inventory.

I’m backing up the following kinds of things:

Code for personal projects
Notes for personal projects
Photos, Videos, Audio for personal projects

It’s a digital pile of my thoughts and experience over my lifetime. I definitely don’t want to lose that. But I’m also not guarding state secrets, here. Here’s a breakdown of the risk:

The data contains no sensitive information, so exfiltration isn’t a serious concern.
The data is valuable (to me), so data loss is a concern.
- I suppose someone could try a DoS attack against me, but even if someone wanted to target me, they can’t touch offline copies.
- Everything is versioned by Git, but it’s possible that a corrupted file could ruin a git repository. If that happens, I don’t want it to immediately propogate to all my backups. There should be at least one lagging set of backups.

It seems like I can cover almost all of my known risk by having multiple backups that are separated by space and time. That should be enough to get started developing an appropriate backup strategy.

Eventually, it’s worth considering more factors, but first focus on a minimal viable strategy. Afterwards, you might consider things like:

How easy is it to restore a backup? Can I reduce downtime?
How can I validate a backup as it’s created?

Backing Up Data in Git

There are a few ways you could make a backup of a Git repository. My general impression is the following:

Use git clone --mirror <repo> <backup> when you want easy, in-place updates.
Use git bundle when you want an archived copy.
Avoid simply copying the .git directory. In theory that contains everything, but it’s not an atomic transaction and you risk corrupting files. (At least) in multi-user, hot-copy scenarios.

The experiments that I ran are in Git Repo Backups, but the summary above describes the results pretty well.

Recovering Data from a Git Backup

This could vary based on what went wrong. You have to restore the repository. To do it without disruption to client repos, restore it to the same remote ref.

Restoring the repository could be as simple as copying it to the old location, but it could be more involved. If the disk fails you have to replace it. Worst case scenario is you have to redeploy on a new host. To do that, you have to do a whole bunch of prerequisite infrastructure. Honestly, if it comes to that, since I’m the only user, I’m probably just adding the backup as a remote on each of my projects, then making a new backup somewhere else.

Backing Up Data in Git Annex

This is how you create a backup of a central annex.

git clone --mirror git@host.local:/path/to/annex.git
cd annex.git
git annex init
git annex group . backup
git annex wanted . standard
git annex sync --content

Because this is intended as a backup and not something that contributes back to the main repo, don’t add the backup as a remote of the main repo.

To update, you might be able to just do git annex sync --content. You might also consider doing git remote update beforehand.

Recovering Data from a Git Annex Backup

The backup is a lagging copy of the main annex.

You could duplicate the backup in the same way that you created the first backup.
Then, be sure to update the remotes of all the client annexes.
Be sure to update your cron/backup script.

Honestly, if you put the duplicated backup in the same place as the original, you probably only have to do step 1. Of course, it goes without saying that you should diagnose and resolve whatever issue made it so you needed the backup in the first place.

Putting It All Together

I made a shell script to back up Git repositories. It takes a source directory and a target directory. It looks the source directory for Git repositories (as indicated by a .git suffix or a .git subdirectory), then decides whether to mirror it (if the target doesn’t already exist) or to update (if the target already exists).

For Git Annex, it’s simpler because I only really have one annex. Using git annex sync --content with the right preferred content settings is good enough. The reason I wrote a script for normal Git repositories is simply because I have have so many of them I have so many of them and they are constantly in flux, but I keep them in the same few directories.

It’s important to note that the script or the git annex sync command must be run by someone in the git user group.

To schedule it, choose an appropriate git user, run crontab -e and add the script and commands to the crontab.