Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitaly.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorZeger-Jan van de Weg <git@zjvandeweg.nl>2019-08-19 11:22:23 +0300
committerZeger-Jan van de Weg <git@zjvandeweg.nl>2019-08-19 11:22:23 +0300
commit53218e3384567f6b23a7dffac1683173ab0c2be0 (patch)
treececacadbaaa522990696f46053711eea279db2a0
parentd1a53c93536bcf0672594ffabd0fb019c135074d (diff)
parent4f9d7cb6a17269ff1888ba5b974b94cbe10afd38 (diff)
Merge branch 'jv-proposal-snapgraph' into 'master'
Snapshot storage for repositories See merge request gitlab-org/gitaly!1279
-rw-r--r--doc/README.md6
-rw-r--r--doc/proposals/snapshot-storage.md97
2 files changed, 102 insertions, 1 deletions
diff --git a/doc/README.md b/doc/README.md
index 88f223f7c..39e60caf8 100644
--- a/doc/README.md
+++ b/doc/README.md
@@ -26,4 +26,8 @@ For configuration please read [praefects configuration documentation](doc/config
#### Technical explanations
-- [Delta Islands](doc/delta_islands.md)
+- [Delta Islands](delta_islands.md)
+
+#### Proposals
+
+- [Snapshot storage](proposals/snapshot-storage.md)
diff --git a/doc/proposals/snapshot-storage.md b/doc/proposals/snapshot-storage.md
new file mode 100644
index 000000000..a504c9877
--- /dev/null
+++ b/doc/proposals/snapshot-storage.md
@@ -0,0 +1,97 @@
+# Proposal: snapshot storage for Git repositories
+
+## High level summary
+
+Gitaly as it exists today is a service that stores all its state on a
+local filesystem. This filesystem must therefore be durable. In this
+document we describe an alternative storage system which can store
+repository snapshots in SQL and object storage (e.g. S3).
+
+Key properties:
+
+- Use a SQL database as a catalogue of the repositories in snapshot storage
+- Git data (objects and refs) is stored as cold "snapshots" in object
+ storage
+- snapshots can have a "parent", so a repository is stored as a linked
+ list of snapshots. The linked list relation is stored in SQL.
+- to use the repository it must first be copied down to a local
+ filesystem
+
+Possible applications:
+
+- incremental repository backups
+- cold storage for repositories
+
+## Primitives
+
+### Git repository snapshots
+
+In [MR 1244](https://gitlab.com/gitlab-org/gitaly/merge_requests/1244)
+we have an example of how we can use Git plumbing commands to
+efficiently create incremental snapshots of an entire Git repository,
+where each snapshot may be stored as a single blob. We do this by
+combining a full dump of the ref database of the repository,
+concatenated with either a full or incremental Git packfile.
+
+A snapshot can either be full (no "parent") or it can be incremental
+relative to a previous snapshots (its parent). The snapshots are
+incremental in exactly the same way that `git fetch` is incremental.
+
+### Snapshot list
+
+Once we can make full and incremental snapshots of a repository, we can
+represent that repository as a linked list of snapshots where the first
+element must be a full snapshot, and each later element is incremental.
+
+Within this snapshot list, we can think of a project as a reference to
+its latest snapshot: it is the head of the list.
+
+### Rebuilding a repository from its snapshots
+
+To rebuild a repository from its snapshots we must "install" all
+packfiles in its list on the Gitaly server we are using. This means more
+than just downloading, because a snapshot only contains the data that
+goes in `.pack` files, and this data is useless without a corresponding
+`.idx`. This works just the same as `git clone` and `git fetch`, where
+it is up to the client (the user) to have their local computer compute
+`.idx` files. Once all the packfiles in the graph of the repository have
+been instantiated along with their `.idx` companions, we bulk-import the
+ref database from the most recent snapshot.
+
+After this it is possible that we have a lot of packfiles, which is not
+good for performance. We also won't have a `.bitmap` file. So a final
+`git repack -adb` will be needed for performance reasons.
+
+### Compacting a snapshot list
+
+The only reason we represent a repository as a list of multiple
+snapshots is that this makes it faster to make new snapshots. For faster
+restores, and to keep the total list size in check, we can collapse
+multiple snapshots into one. This comes down to restoring the repository
+in a temporary directory, up to a known snapshot. Then we make a new
+full (i.e. non-incremental) snapshot from that point-in-time copy, and
+replace all snapshots up to and including that point with a single
+(full) snapshot.
+
+### Snapshot graph representation
+
+We could represent snapshots lists with a SQL table `snapshots` with a
+1-to-1 relation mapping back into itself (the "parent" relation).
+
+Each record in the `snapshots` table would have a corresponding object
+storage blob at some immutable URL.
+
+We need this SQL table as a catalogue of our object storage objects.
+
+## Where to build this
+
+Considering that Praefect will have a SQL database tracking all its
+repositories, and that Praefect is aware of when repositories change and
+a new snapshot is warranted, it would be a candidate for managing
+snapshots.
+
+However, we could also build this in gitlab-rails. That should work fine
+for periodic snapshots, where we take snapshots regardless of whether we
+know/think there was a change in the repository.
+
+We probably don't want to build this in Gitaly itself.