diff options
author | Zeger-Jan van de Weg <git@zjvandeweg.nl> | 2019-08-19 11:22:23 +0300 |
---|---|---|
committer | Zeger-Jan van de Weg <git@zjvandeweg.nl> | 2019-08-19 11:22:23 +0300 |
commit | 53218e3384567f6b23a7dffac1683173ab0c2be0 (patch) | |
tree | cecacadbaaa522990696f46053711eea279db2a0 | |
parent | d1a53c93536bcf0672594ffabd0fb019c135074d (diff) | |
parent | 4f9d7cb6a17269ff1888ba5b974b94cbe10afd38 (diff) |
Merge branch 'jv-proposal-snapgraph' into 'master'
Snapshot storage for repositories
See merge request gitlab-org/gitaly!1279
-rw-r--r-- | doc/README.md | 6 | ||||
-rw-r--r-- | doc/proposals/snapshot-storage.md | 97 |
2 files changed, 102 insertions, 1 deletions
diff --git a/doc/README.md b/doc/README.md index 88f223f7c..39e60caf8 100644 --- a/doc/README.md +++ b/doc/README.md @@ -26,4 +26,8 @@ For configuration please read [praefects configuration documentation](doc/config #### Technical explanations -- [Delta Islands](doc/delta_islands.md) +- [Delta Islands](delta_islands.md) + +#### Proposals + +- [Snapshot storage](proposals/snapshot-storage.md) diff --git a/doc/proposals/snapshot-storage.md b/doc/proposals/snapshot-storage.md new file mode 100644 index 000000000..a504c9877 --- /dev/null +++ b/doc/proposals/snapshot-storage.md @@ -0,0 +1,97 @@ +# Proposal: snapshot storage for Git repositories + +## High level summary + +Gitaly as it exists today is a service that stores all its state on a +local filesystem. This filesystem must therefore be durable. In this +document we describe an alternative storage system which can store +repository snapshots in SQL and object storage (e.g. S3). + +Key properties: + +- Use a SQL database as a catalogue of the repositories in snapshot storage +- Git data (objects and refs) is stored as cold "snapshots" in object + storage +- snapshots can have a "parent", so a repository is stored as a linked + list of snapshots. The linked list relation is stored in SQL. +- to use the repository it must first be copied down to a local + filesystem + +Possible applications: + +- incremental repository backups +- cold storage for repositories + +## Primitives + +### Git repository snapshots + +In [MR 1244](https://gitlab.com/gitlab-org/gitaly/merge_requests/1244) +we have an example of how we can use Git plumbing commands to +efficiently create incremental snapshots of an entire Git repository, +where each snapshot may be stored as a single blob. We do this by +combining a full dump of the ref database of the repository, +concatenated with either a full or incremental Git packfile. + +A snapshot can either be full (no "parent") or it can be incremental +relative to a previous snapshots (its parent). The snapshots are +incremental in exactly the same way that `git fetch` is incremental. + +### Snapshot list + +Once we can make full and incremental snapshots of a repository, we can +represent that repository as a linked list of snapshots where the first +element must be a full snapshot, and each later element is incremental. + +Within this snapshot list, we can think of a project as a reference to +its latest snapshot: it is the head of the list. + +### Rebuilding a repository from its snapshots + +To rebuild a repository from its snapshots we must "install" all +packfiles in its list on the Gitaly server we are using. This means more +than just downloading, because a snapshot only contains the data that +goes in `.pack` files, and this data is useless without a corresponding +`.idx`. This works just the same as `git clone` and `git fetch`, where +it is up to the client (the user) to have their local computer compute +`.idx` files. Once all the packfiles in the graph of the repository have +been instantiated along with their `.idx` companions, we bulk-import the +ref database from the most recent snapshot. + +After this it is possible that we have a lot of packfiles, which is not +good for performance. We also won't have a `.bitmap` file. So a final +`git repack -adb` will be needed for performance reasons. + +### Compacting a snapshot list + +The only reason we represent a repository as a list of multiple +snapshots is that this makes it faster to make new snapshots. For faster +restores, and to keep the total list size in check, we can collapse +multiple snapshots into one. This comes down to restoring the repository +in a temporary directory, up to a known snapshot. Then we make a new +full (i.e. non-incremental) snapshot from that point-in-time copy, and +replace all snapshots up to and including that point with a single +(full) snapshot. + +### Snapshot graph representation + +We could represent snapshots lists with a SQL table `snapshots` with a +1-to-1 relation mapping back into itself (the "parent" relation). + +Each record in the `snapshots` table would have a corresponding object +storage blob at some immutable URL. + +We need this SQL table as a catalogue of our object storage objects. + +## Where to build this + +Considering that Praefect will have a SQL database tracking all its +repositories, and that Praefect is aware of when repositories change and +a new snapshot is warranted, it would be a candidate for managing +snapshots. + +However, we could also build this in gitlab-rails. That should work fine +for periodic snapshots, where we take snapshots regardless of whether we +know/think there was a change in the repository. + +We probably don't want to build this in Gitaly itself. |