From 2621fe80ce42c0dec90be7ba99a4a2432bdcdc88 Mon Sep 17 00:00:00 2001 From: Jacob Vosmaer Date: Mon, 23 Mar 2020 03:10:39 +0000 Subject: Add developer documentation about Git object quarantine --- doc/object_quarantine.md | 147 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 147 insertions(+) create mode 100644 doc/object_quarantine.md (limited to 'doc/object_quarantine.md') diff --git a/doc/object_quarantine.md b/doc/object_quarantine.md new file mode 100644 index 000000000..46447c376 --- /dev/null +++ b/doc/object_quarantine.md @@ -0,0 +1,147 @@ +# Git object quarantine during git push + +While receiving a Git push, GitLab can reject pushes using the +`pre-receive` Git hook. Git has a special "object quarantine" +mechanism that allows it to eagerly delete rejected Git objects. + +In this document we will explain how Git object quarantine works, and +how GitLab is able to see quarantined objects. + +## Git object quarantine + +Git object quarantine was introduced in Git 2.11.0 via +https://gitlab.com/gitlab-org/git/-/commit/25ab004c53cdcfea485e5bf437aeaa74df47196d. +To understand what it does we need to know how Git receives pushes on +the server. + +### How Git receives a push + +On a Git server, a push goes into `git receive-pack`. This process does the following things: + +1. receive the Git objects pushed by the client and write them to disk +1. receive the ref update commands from the client and keep them in memory +1. check connectivity (no missing objects) +1. run `pre-receive` and feed it the intended ref update commands +1. if `pre-receive` rejects the push, clean up and stop +1. apply ref update commands one by one. For each command, run the `update` hook which can reject the ref update. +1. after all ref updates have been applied run the `post-receive` hook +1. report success to the client and end the session + +Object quarantine exists for the sake of the cleanup that happens when +`pre-receive` rejects the push (step 5 above). It changes the _timing_ of the +cleanup. Without object quarantine, objects that were part of a +rejected push would sit around until `git gc` would judge them as both +unused and "old". How long that takes depends on how often `git gc` +runs (or `git prune`), and on the configuration of when objects are +"old". Because of object quarantine, rejected objects can be deleted +immediately: Git can just `rm -rf` the quarantine directory and +they're gone. + +### Git implementation + +The Git implementation of this mechanism rests on two things. + +#### 1. Alternate object directories + +The objects in a Git repository can be stored across multiple +directories: 1 main directory, usually `/objects`, and 0 or more +alternate directories. Together these act like a search path: when +looking for an object Git first checks the main directory, then each +alternate, until it finds the object. + +#### 2. Config overrides via environment variables + +Git can inject custom config into subprocesses via environment +variables. In the case of Git object directories, these are +`GIT_OBJECT_DIRECTORY` (the main object directory) and +`GIT_ALTERNATE_OBJECT_DIRECTORIES` (a search path of `:`-separated +alternate object directories). + +#### Putting it all together + +1. `git receive-pack` receives a push +1. `git receive-pack` [creates a quarantine directory `objects/incoming-$RANDOM`](https://gitlab.com/gitlab-org/git/-/blob/v2.24.0/builtin/receive-pack.c#L1715) +1. `git receive-pack` [configures the unpack process](https://gitlab.com/gitlab-org/git/-/blob/v2.24.0/builtin/receive-pack.c#L1721) to write objects into the quarantine directory +1. `git receive-pack` unpacks the objects into the quarantine directory +1. `git receive-pack` [runs the `pre-receive` hook](https://gitlab.com/gitlab-org/git/-/blob/v2.24.0/builtin/receive-pack.c#L1498) with special `GIT_OBJECT_DIRECTORY` and `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variables that add the quarantine directory to the search path +1. If the `pre-receive` hook rejects the push, `git receive-pack` removes the quarantine directory and its contents. The push is aborted. +1. If the `pre-receive` hook passes, `git receive-pack` [merges the quarantine directory into the main object directory](https://gitlab.com/gitlab-org/git/-/blob/v2.24.0/builtin/receive-pack.c#L1510). +1. `git receive-pack` enters the ref update transaction + +Note that by the time the `update` hook runs, the quarantine directory +has already been merged into the main object directory so it no longer +matters. The same goes for the `post-receive` hook which runs even +later. + +Because `pre-receive` has the special quarantine configuration data in +environment variables, any `git` process spawned by `pre-receive` will +inherit the quarantine config and will be able to see the objects that +are being pushed. + +## GitLab and Git object quarantine + +### Why does all this matter to GitLab + +GitLab uses Git hooks, among other things, to implement features that +can reject Git pushes. For example, you can mark a branch as +"protected" in the GitLab web UI, and then certain types of users can +no longer push to that branch. That feature is implemented via the [Git +`pre-receive` hook](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/ruby/gitlab-shell/hooks/pre-receive). + +As mentioned above, Git object quarantine normally works more or less +automatically because `git` commands spawned by the `pre-receive` hook +inherit the special environment variables that contain the path to the +quarantine directory. In the case of GitLab's hooks we have a problem, +however, because the GitLab hooks are "dumb". All the GitLab hooks do +is take the inputs of the hook executable (the list of ref update +commands) and send them to the GitLab Rails internal API via a POST +request. The application logic that decides whether the push is +allowed resides in Rails. The hook just waits and reports back result +of the POST API request to GitLab. + +During the POST, the internal GitLab API makes Gitaly calls back into the repo to +examine the objects being pushed. For example, if force pushes are not +allowed, GitLab will call the IsAncestor RPC. That RPC call then wants +to look at a commit that is in the process of being pushed. But +because that commit is in quarantine, the RPC will fail because the +commit cannot be found. + +### How GitLab passes the object quarantine information around + +To overcome this problem, the GitLab `pre-receive` hook [reads the +object directory configuration from its +environment](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/ruby/gitlab-shell/lib/object_dirs_helper.rb#L3), +and passes this information [along with the HTTP API +call](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/ruby/gitlab-shell/lib/gitlab_access.rb#L24). +On the Rails side, we then [put the object directory information in +the "request +store"](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/api/internal/base.rb#L43) +(i.e., request-scoped thread-local storage). And then during that +Rails request, when Rails makes Gitaly requests on this repo, we send +back the quarantine information [in the Gitaly `Repository` +struct](https://gitlab.com/gitlab-org/gitlab/-/blob/f81f30c29a0edce20f6737fdccc3315c8baab9d1/lib/gitlab/gitaly_client/util.rb#L8-17). +And finally, inside Gitaly, when we spawn a Git process, we [re-create +the environment +variables](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/internal/git/alternates/alternates.go#L21-34) +that were present on the `pre-receive` hook, so that we can see the +quarantined objects. We do the same when we [instantiate a +Gitlab::Git::Repository in +gitaly-ruby](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/ruby/lib/gitlab/git/repository.rb#L44). + +### Relative paths + +During the Gitaly migration we had to handle a complication with the +object quarantine information: Git uses absolute paths for this. These +paths get generated wherever `git receive-pack` runs, i.e., on the +Gitaly server. During the migration, the repositories were also +accessible via NFS at the Rails side, but at a different path. That +meant that the absolute paths supplied by Git would be invalid part of +the time. + +To work around this, the GitLab `pre-receive` hook [converts the +absolute paths from Git into relative +paths](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/ruby/gitlab-shell/lib/object_dirs_helper.rb#L16), +relative to the repository directory. These relative paths then get +passed around inside GitLab. At the time Gitaly recreates the object +directory variables, it [converts the paths back from relative to +absolute](https://gitlab.com/gitlab-org/gitaly/-/blob/969bac80e2f246867c1a976864bd1f5b34ee43dd/internal/git/alternates/alternates.go#L23). -- cgit v1.2.3