JonBlog
Thoughts on website ideas, PHP and other tech topics, plus going car-free
Implementing a Docker registry in a Git repo
Categories: Docker, Ideas, Outline

Introduction

I’ve noticed several things with Docker registries:

  • The APIs to interrogate them are either cumbersome, unfinished, or usually only partially implemented
  • Free private registries are rarely found on the web, probably because of the cost of storage and data transfer

GitLab are one of the few providers that offer free registry access, but today I bumped into some unexpected and infuriating access problems, which set me wondering whether there is a solution that would give a greater choice of providers.

It occurred to me that we need a system:

  • to synchronise an image layer by layer
  • to only store layers once, and if they are duplicated, store a pointer to the layer
  • to store layers in a compressed format
  • that does not degrade performance when it contains a lot of large files
  • that allows both push and pull
  • that has a well-tested authentication and authorisation system around it

It occurs to me that this describes Git LFS quite well. This is an extension to Git that allows the storage of large blobs inside version control. So, the question I’ve been pondering is: could LFS be used to efficiently store Docker layers?

Tasks

There would be a few things I’d need to look into:

  1. Get a simple example working first. This can be done by using docker save, untarring the result, then git committing the resulting folder. The image layers will themselves be tar files.
  2. Investigate the data storage limits with the main Git hosts, such as GitHub, Bitbucket, GitLab, etc.
  3. Investigate whether the main Git hosts have data transfer limits.
  4. Investigate whether the main Git hosts have storage/bandwidth meters, to help users avoid nasty surprises.
  5. Most Git access is authorised via an on-machine private key that grants access to all repositories under a single user. This would need to be per-project where a CI server does a push, to avoid giving overly-wide permissions. Look at how easy that is to set up.
  6. On the remote repository side, I assume that LFS will handle de-duplication of layer files automatically. This is worth checking.
  7. Create a command to list all tags, optionally filtering by image and/or tag
  8. Create a command to list all images, optionally filtering by image and/or tag
  9. Create a push command, which takes a list of one or more image:tag strings
  10. Create a pull command, which takes a list of one or more image:tag strings
  11. I need to understand whether docker pull does any hash checking or security analysis, and if so, see if that can be added to the proposal here.
  12. Run some timing tests for push and pull operations.

Task progress

  1. Done
  2. Done
  3. Done
  4. Partially done
  5. Not done
  6. Confirmed in Bitbucket GUI (two copies of a 4M layer consume 4M total)
  7. Not done
  8. Not done
  9. Not done
  10. Not done
  11. Not done
  12. Not done

Research

The storage limits I have found are:

Provider Storage
GitHub 1G (source)
Bitbucket 1G by default, 100G costs 10USD/month (source)
GitLab 10G across a whole project (source)

In terms of data transfer:

Provider Transfer
GitHub 1G/month (same source as before)
Bitbucket Nothing found, but there’s a mention of bandwidths in the AUP
GitLab No limit (same source as before)

Availability of meters:

Provider Meter availability
GitHub Not checked
Bitbucket There’s a good storage space meter
GitLab Not checked

Suggested repo format

I want a repo to be able to hold many images, along with tags and labels etc. In the below commands, the Git directory is called “registry”, but that’s not part of the format, since that holds the Git repository.

  • “images” (folder)
    • <Image ID> (folder, as many of these as desired)
      • “image” (folder)
        • <tag name> (folder)
          • (unpacked image tarball) (files/folders)

Labels are baked into images when they are built, so will appear in a image JSON metadata file in the root of the unpacked folder. Tag names are also included in the unpacked image data (in a file called repository) so we need a separate directory level to prevent conflicts (images can have multiple tags for the same Image ID).

General commands

Here are a few commands for general experimentation.

Pull an image:

docker pull alpine:3.5

Convert an image to a file:

mkdir alpine
cd alpine
docker save alpine:3.5 > alpine.tar

Unpack the image (using short Image IDs for now):

mkdir -p registry/images/6c6084ed97/3.5/image
tar -xvf alpine.tar -C registry/images/6c6084ed97/3.5/image

Create a Git repo and initialise it for LFS:

cd registry
git init
git lfs install
git lfs track "*.tar"
git remote add origin git@bitbucket.org:username/docker-lfs.git
git push -u origin master

An introduction to Git LFS is here.

Extra notes

On the local side, a repo containing a Docker image with several tags would contain all layers repeated for however many tags there are. However this may not be a problem in practice, since images would be pulled one tag at a time, and only temporarily exist in a local Git repo before being converted to an importable image. At that time, the repo would be destroyed.

Language

If a utility binary were to emerge from this idea, I think it would either be written in Bash or Go.

Leave a Reply