Implementing a Docker registry in a Git repo

10.Mar.2019

Categories: Docker, Ideas, Outline

Introduction

I’ve noticed several things with Docker registries:

The APIs to interrogate them are either cumbersome, unfinished, or usually only partially implemented
Free private registries are rarely found on the web, probably because of the cost of storage and data transfer

GitLab are one of the few providers that offer free registry access, but today I bumped into some unexpected and infuriating access problems, which set me wondering whether there is a solution that would give a greater choice of providers.

It occurred to me that we need a system:

to synchronise an image layer by layer
to only store layers once, and if they are duplicated, store a pointer to the layer
to store layers in a compressed format
that does not degrade performance when it contains a lot of large files
that allows both push and pull
that has a well-tested authentication and authorisation system around it

It occurs to me that this describes Git LFS quite well. This is an extension to Git that allows the storage of large blobs inside version control. So, the question I’ve been pondering is: could LFS be used to efficiently store Docker layers?

Tasks

There would be a few things I’d need to look into:

Get a simple example working first. This can be done by using docker save, untarring the result, then git committing the resulting folder. The image layers will themselves be tar files.
Investigate the data storage limits with the main Git hosts, such as GitHub, Bitbucket, GitLab, etc.
Investigate whether the main Git hosts have data transfer limits.
Investigate whether the main Git hosts have storage/bandwidth meters, to help users avoid nasty surprises.
Most Git access is authorised via an on-machine private key that grants access to all repositories under a single user. This would need to be per-project where a CI server does a push, to avoid giving overly-wide permissions. Look at how easy that is to set up.
On the remote repository side, I assume that LFS will handle de-duplication of layer files automatically. This is worth checking.
Create a command to list all tags, optionally filtering by image and/or tag
Create a command to list all images, optionally filtering by image and/or tag
Create a push command, which takes a list of one or more image:tag strings
Create a pull command, which takes a list of one or more image:tag strings
I need to understand whether docker pull does any hash checking or security analysis, and if so, see if that can be added to the proposal here.
Run some timing tests for push and pull operations.

Task progress

Done
Done
Done
Partially done
Not done
Confirmed in Bitbucket GUI (two copies of a 4M layer consume 4M total)
Not done
Not done
Not done
Not done
Not done
Not done

Research

The storage limits I have found are:

Provider	Storage
GitHub	1G (source)
Bitbucket	1G by default, 100G costs 10USD/month (source)
GitLab	10G across a whole project (source)

In terms of data transfer:

Provider	Transfer
GitHub	1G/month (same source as before)
Bitbucket	Nothing found, but there’s a mention of bandwidths in the AUP
GitLab	No limit (same source as before)

Availability of meters:

Provider	Meter availability
GitHub	Not checked
Bitbucket	There’s a good storage space meter
GitLab	Not checked

Suggested repo format

I want a repo to be able to hold many images, along with tags and labels etc. In the below commands, the Git directory is called “registry”, but that’s not part of the format, since that holds the Git repository.

“images” (folder)
- <Image ID> (folder, as many of these as desired)
  - “image” (folder)
    - <tag name> (folder)
      - (unpacked image tarball) (files/folders)

Labels are baked into images when they are built, so will appear in a image JSON metadata file in the root of the unpacked folder. Tag names are also included in the unpacked image data (in a file called repository) so we need a separate directory level to prevent conflicts (images can have multiple tags for the same Image ID).

General commands

Here are a few commands for general experimentation.

Pull an image:

docker pull alpine:3.5

Convert an image to a file:

mkdir alpine
cd alpine
docker save alpine:3.5 > alpine.tar

Unpack the image (using short Image IDs for now):

mkdir -p registry/images/6c6084ed97/3.5/image
tar -xvf alpine.tar -C registry/images/6c6084ed97/3.5/image

Create a Git repo and initialise it for LFS:

cd registry
git init
git lfs install
git lfs track "*.tar"
git remote add origin git@bitbucket.org:username/docker-lfs.git
git push -u origin master

An introduction to Git LFS is here.

Extra notes

On the local side, a repo containing a Docker image with several tags would contain all layers repeated for however many tags there are. However this may not be a problem in practice, since images would be pulled one tag at a time, and only temporarily exist in a local Git repo before being converted to an importable image. At that time, the repo would be destroyed.

Language

If a utility binary were to emerge from this idea, I think it would either be written in Bash or Go.

Introduction

Tasks

Task progress

Research

Suggested repo format

General commands

Extra notes

Language

Leave a Reply

Pages

Categories

Archives

Link stuff