Introduction
I’ve noticed several things with Docker registries:
- The APIs to interrogate them are either cumbersome, unfinished, or usually only partially implemented
- Free private registries are rarely found on the web, probably because of the cost of storage and data transfer
GitLab are one of the few providers that offer free registry access, but today I bumped into some unexpected and infuriating access problems, which set me wondering whether there is a solution that would give a greater choice of providers.
It occurred to me that we need a system:
- to synchronise an image layer by layer
- to only store layers once, and if they are duplicated, store a pointer to the layer
- to store layers in a compressed format
- that does not degrade performance when it contains a lot of large files
- that allows both push and pull
- that has a well-tested authentication and authorisation system around it
It occurs to me that this describes Git LFS quite well. This is an extension to Git that allows the storage of large blobs inside version control. So, the question I’ve been pondering is: could LFS be used to efficiently store Docker layers?
Tasks
There would be a few things I’d need to look into:
- Get a simple example working first. This can be done by using
docker save
,untar
ring the result, thengit commit
ting the resulting folder. The image layers will themselves be tar files. - Investigate the data storage limits with the main Git hosts, such as GitHub, Bitbucket, GitLab, etc.
- Investigate whether the main Git hosts have data transfer limits.
- Investigate whether the main Git hosts have storage/bandwidth meters, to help users avoid nasty surprises.
- Most Git access is authorised via an on-machine private key that grants access to all repositories under a single user. This would need to be per-project where a CI server does a push, to avoid giving overly-wide permissions. Look at how easy that is to set up.
- On the remote repository side, I assume that LFS will handle de-duplication of layer files automatically. This is worth checking.
- Create a command to list all tags, optionally filtering by image and/or tag
- Create a command to list all images, optionally filtering by image and/or tag
- Create a push command, which takes a list of one or more image:tag strings
- Create a pull command, which takes a list of one or more image:tag strings
- I need to understand whether
docker pull
does any hash checking or security analysis, and if so, see if that can be added to the proposal here. - Run some timing tests for push and pull operations.
Task progress
- Done
- Done
- Done
- Partially done
- Not done
- Confirmed in Bitbucket GUI (two copies of a 4M layer consume 4M total)
- Not done
- Not done
- Not done
- Not done
- Not done
- Not done
Research
The storage limits I have found are:
Provider | Storage |
GitHub | 1G (source) |
Bitbucket | 1G by default, 100G costs 10USD/month (source) |
GitLab | 10G across a whole project (source) |
In terms of data transfer:
Provider | Transfer |
GitHub | 1G/month (same source as before) |
Bitbucket | Nothing found, but there’s a mention of bandwidths in the AUP |
GitLab | No limit (same source as before) |
Availability of meters:
Provider | Meter availability |
GitHub | Not checked |
Bitbucket | There’s a good storage space meter |
GitLab | Not checked |
Suggested repo format
I want a repo to be able to hold many images, along with tags and labels etc. In the below commands, the Git directory is called “registry”, but that’s not part of the format, since that holds the Git repository.
- “images” (folder)
- <Image ID> (folder, as many of these as desired)
- “image” (folder)
- <tag name> (folder)
- (unpacked image tarball) (files/folders)
- <tag name> (folder)
- “image” (folder)
- <Image ID> (folder, as many of these as desired)
Labels are baked into images when they are built, so will appear in a image JSON metadata file in the root of the unpacked folder. Tag names are also included in the unpacked image data (in a file called repository
) so we need a separate directory level to prevent conflicts (images can have multiple tags for the same Image ID).
General commands
Here are a few commands for general experimentation.
Pull an image:
docker pull alpine:3.5
Convert an image to a file:
mkdir alpine
cd alpine
docker save alpine:3.5 > alpine.tar
Unpack the image (using short Image IDs for now):
mkdir -p registry/images/6c6084ed97/3.5/image
tar -xvf alpine.tar -C registry/images/6c6084ed97/3.5/image
Create a Git repo and initialise it for LFS:
cd registry
git init
git lfs install
git lfs track "*.tar"
git remote add origin git@bitbucket.org:username/docker-lfs.git
git push -u origin master
An introduction to Git LFS is here.
Extra notes
On the local side, a repo containing a Docker image with several tags would contain all layers repeated for however many tags there are. However this may not be a problem in practice, since images would be pulled one tag at a time, and only temporarily exist in a local Git repo before being converted to an importable image. At that time, the repo would be destroyed.
Language
If a utility binary were to emerge from this idea, I think it would either be written in Bash or Go.