Testing a Centralized Git Annex Repository

Kyle Bowman

Overview

The goal of this line of experiments is to determine the following:

  1. How I can add files to a central repository from another device.
  2. How I can disseminate files from that central repository to other devices
  3. How I can use that central repository for long-term storage and backups.

Key Results

Treat the central repository mostly like a normal bare repository. But remember that git annex leverages multiple branches and you need to be sure that you are pushing the right ones.

Setting up a Central Repository

  1. Follow these instructions
  2. Note that the central repository is a bare/shared repository that also uses git annex init, but doesn’t do anything else.
  3. Optionally, set up preferred content with a standard group like “backup”.

The key is that it must be a bare repository. Without a work tree, a remote can push directly into each branch. That way, you can update all the files in the central repository without the need to navigate to it to handle merges.

Cloning a Central Annex

  1. Do a normal clone.
  2. Run git annex init after cloning.

Adding Files to the Central Repository

  1. Use git annex commands to handle the symlink shenanigans.
  2. Use normal git commands to push the files to the central repository.
  3. (?) You might need to run git annex sync --content to move the files to the central repository (and not just the links).

Remember: You must push the main branch and the git-annex branch. You can do both in a single command with git push origin main git-annex.

Note About the Git Annex Doc

The doc is a wiki. It is updated continuously by a variety of users. Changes are tracked in Git.

The punchline of all these facts is that it is up to you to check the dates of each part of the documentation against the software version that you have installed. Many pages of the wiki are likely more recent than your software.

I think git annex push / pull might be what I want, but those features are from newer versions than I installed via Debian.

Guiding Questions

How does sync work?

Sync Reference

I think I understand that, when git annex sync runs, it does the following:

  1. (At some point, staged files are committed.)
  2. Merge local sync into local main.
  3. Fetch / merge remote main into local main and smart resolve conflicts.
  4. Push local main into remote staging.

Question: If that’s true, can I determine all incoming changes to the local main by looking at local staging + remote main for all unsynced remote mains?

Question: It also sounds like pushing changes are never applied into the checked out branch of the remote repository.

  1. Does that mean that I could use a bare repository since there’s technically no “checked-out” branch of a bare repository? Does it automatically get content?
  2. Would I have to create a way to automatically sync that repository to get the content?

Using Git Annex Assistant

BLUF: I don’t want assistant. I want to be able to pack/unpack files, not to sync everything by default. By extension, this rules out the webapp too.

Assitant is literally synchthing. You don’t have to have to use commands. Connect two repositories via remotes. Start the assitant on each. Add a file in one place and watch it appear in the other. It does not work transitively. Suppose you have A <-> B <-> C. If you add a file to A, it will transfer to B but it will not transfer to C. If you change a file In B, it will change both A and C.

Question: You know, I didn’t totally figure out the role of preferred content here. Maybe that’s why it didn’t sync from A to C? In my case, A and C are both manual. (I think. At least C is.)

Using Git Annex Watch

I can’t tell if this is useful or not. It looks like it auto adds and that’s about it.

Using Preferred content

The git annex wanted . standard && git annex group . backup works.

The trick is to understand how syncing works. Specifically, you have to use preferred content with --auto or git annex sync --content.

Lingering Questions:

Is there a way to get a remote repository to automatically sync it’s content?

What if I just poll it periodically with a cron job…?

Experiments

Centralized Server

mkdir central.git
cd central.git
git init --bare --shared
git annex init
cd ..
git clone central.git local # 1

Notes:

I did not get the expected standard out:

warning: You appear to have cloned an empty repository. \
Checking connectivity... done.

Instead, I got the following standard out:

warning: remote HEAD refers to nonexistent ref, unable to checkout

Not getting the connectivity is expected check: I didn’t clone over SSH. I’m a little confused about the nonexistent ref. I’m proceeding anyways, but noting this quirk for future investigation if needed.

Central.git files
Central.git init data

Now, compare that to local:

Local init data
cd local 
git annex init local
echo "hello, world" > test
git annex add test
git commit -m "Add test file."
git status
On branch main
Your branch is based on 'origin/main', but the upstream is gone.
  (use "git branch --unset-upstream" to fixup)

nothing to commit, working tree clean

That might explain the problem. My git config is set to use “main” as the canonical branch. I think git-annex might default to “master” as the canonical branch. We will play along with the message and see if we can work it out.

git push origin main git-annex
git annex sync --content
Git Push Stdout
Git Annex Sync Stdout

I think that worked! I didn’t have to go central to use sync, and I see an annex object that has the content!

Annex object contents

Let’s verify by creating another repo and ensuring that we can copy contents down to it.

cd ..
git clone central.git other
cd other
git annex sync --content
ls

Huzzah! I see the test file and it’s contents! Let’s now make sure that the metadata-only approach works.

# Starting in other/
echo "lots of content" > largefile
git annex add largefile
git commit -m "Add largefile"
git push origin main git-annex
git annex sync --content
# Cd to local/
cd ../local
git annex sync
ls
# Expect new symlink to appear, but the symlink to be broken

Yippee! I see a broken symlink to largefile! Let’s test dropping files locally.

# Starting in local/
git annex drop test
ls
# Expect that test is now a broken symlink

Yep! Now, let’s ensure that all the metadata is accounted for.

# Start in local/
git annex sync # Notify other repos that test has been dropped
git annex whereis test 
# Expect origin and other
cd ../other
git annex sync
git annex whereis test
# Expect origin and other

Heck yeah. Everything is working as expected. The big news is that now I don’t have to go into the central repository to sync changes.

# Start from other/
git annex drop test
git rm test
git commit -m "Drop test file."
git push origin main git-annex
git annex sync # I don't think this is strictly needed for bare repo
cd ../local
git annex sync
ls
# Expect test to be completely gone.

Let’s check the assumption that I need to run git annex sync --content when I’m using git push origin main git-annex.

# Start from local
echo "hello, world" > test
git annex add test
git commit -m "Add test file again."
git push origin main git-annex
# Skip syncing from local
cd ../other
git annex sync --content
ls # Expect test with functional symlink
cat test # Expect contents

Deployment

If you use a git user for your remotes (and you will recognize that if you clone remotes with git@hostname:/path/to/repo.git), then you probably have probably also set the git user’s shell to git-shell. That’s a nice security feature because you don’t want people to be able to ssh git@hostname then wreak havoc. The problem is that using git-shell means that you can’t use git-annex. You have a few solutions:

  1. I think you can add custom commands to git-shell. Adding git-annex to the list of approved custom commands might work.
  2. You could also use an annex user that uses the git-shell by default. This is %100 analogous to the git user case and is probably the way to go if you want to have different groups for annex and git. Doing this, you will clone repos with annex@hostname:/path/to/repo.git.
  3. You could change the git user’s shell to git-annex-shell. AFAIK, the git-annex-shell is basically the same as git-shell. I didn’t find any differences in my ad hoc checks.

Key Learnings: