Welcome to mirror list, hosted at ThFree Co, Russian Federation.

index.md « bundle_uri « blueprints « architecture « doc - gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: a056649a79872d93a602fd48350990207557b33a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
status: proposed
creation-date: "2023-08-04"
authors: [ "@toon" ]
coach: ""
approvers: [ "@mjwood", "@jcaigitlab" ]
owning-stage: "~devops::systems"
participating-stages: []
---

<!-- Blueprints often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->

# Utilize bundle-uri to reduce Gitaly CPU load

## Summary

[bundle-URI](https://git-scm.com/docs/bundle-uri) is a fairly new concept
in Git that allows the client to download one or more bundles in order to
bootstrap the object database in advance of fetching the remaining objects from
a remote. By having the client download static files from a simple HTTP(S)
server in advance, the work that needs to be done on the remote side is reduced.

Git bundles are files that store a packfile along with some extra metadata,
including a set of refs and a (possibly empty) set of necessary commits. When a
user clones a repository, the server can advertise one or more URIs that serve
these bundles. The client can download these to populate the Git object
database. After it has done this, the negotiation process between server and
client start to see which objects need be fetched. When the client pre-populated
the database with some data from the bundles, the negotiation and transfer of
objects from the server is reduced, putting less load on the server's CPU.

## Motivation

When a user pushes changes, it usually kicks off a CI pipeline with
a bunch of jobs. When the CI runners all clone the repository from scratch,
if they use [`git clone`](/ee/ci/pipelines/settings.md#choose-the-default-git-strategy),
they all start negotiating with the server what they need to clone. This is
really CPU intensive for the server.

Some time ago we've introduced the
[pack-objects](/ee/administration/gitaly/configure_gitaly.md#pack-objects-cache),
but it has some pitfalls. When the tip of a branch changes, a new packfile needs
to be calculated, and the cache needs to be refreshed.

Git bundles are more flexible. It's not a big issue if the bundle doesn't have
all the most recent objects. When it contains a fairly recent state, but is
missing the latest refs, the client (that is, the CI runner) will do a "catch up" and
fetch additional objects after applying the bundle. The set of objects it has to
fetch from will Gitaly be a lot smaller.

### Goals

Reduce the work that needs to be done on the Gitaly servers when a client clones
a repository. This is particularly useful for CI build farms, which generate a
lot of traffic on each commit that's pushed to the server.

With the use bundles, the server has to craft a smaller delta packfiles
compared to the pack files that contain all the objects when no bundles are
used. This reduces the load on the CPU of the server. This has a benefit on the
packfile cache as well, because now the packfiles are smaller and faster to
generate, reducing the chances on cache misses.

### Non-Goals

Using bundle-URIs will **not** reduce the size of repositories stored on disk.
This feature will not be used to offload repositories, neither fully nor
partially, from the Gitaly node to some cloud storage. In contrary, because
bundles are stored elsewhere, some data is duplicated, and will cause increased
storage costs.

In this phase it's not the goal to boost performance for incremental
fetches. When the client has already cloned the repository, bundles won't be
used to optimize fetches new data.

Currently bundle-URI is not fully compatible with shallow clones, therefore
we'll leave that out of scope. More info about that in
[Git issue #170](https://gitlab.com/gitlab-org/git/-/issues/170).

## Proposal

When a client clones a repository, Gitaly advertises a bundle URI. This URI
points to a bundle that's refreshed on a regular interval, for example during
housekeeping. For each repository only one bundle will exist, so when a new one
is created, the old one is invalidated.

The bundles will be stored on a cloud Object Storage. To use bundles, the
administrator should configure this in Gitaly.

## Design and implementation details

When a client initiates a `git clone`, on the server-side Gitaly spawns a
`git upload-pack` process. Gitaly can pass along additional Git
configuration. To make `git upload-pack` advertise bundle URIs, it should pass
the following configuration:

- `uploadpack.advertiseBundleURIs` :: This should be set to `true` to enable to
  use of advertised bundles.
- `bundle.version` :: At the moment only `1` is accepted.
- `bundle.mode` :: This can be either `any` or `all`. Since we only want to use
  bundles for the initial clone, `any` is advised.
- `bundle.<id>.uri` :: This is the actual URI of the bundle identified with
  `<id>`. Initially we will only have one bundle per repository.

### Enable the use of advertised bundles on the client-side

The current version of Git does not use the advertised bundles by default when
cloning or fetching from a remote.
Luckily, we control most of the CI runners ourself. So to use bundle URI, we can
modify the Git configuration used by the runners and set
`transfer.bundleURI=true`.

### Access control

We don't want to leak data from private repositories through public HTTP(S)
hosts. There are a few options for how we can overcome this:

- Only activate the use of bundle-URI on public repositories.
- Use a solution like [signed-URLs](https://cloud.google.com/cdn/docs/using-signed-urls).

#### Public repositories only

Gitaly itself does not know if a project, and its repository, is public, so to
determine whether bundles can be used, GitLab Rails has to tell Gitaly. It's
complex to pass this information to Gitaly, and using this approach will make
the feature only available for public projects, so we will not proceed with this
solution.

#### Signed URLs

The use of [signed-URLs](https://cloud.google.com/cdn/docs/using-signed-urls) is
another option to control access to the bundles. This feature, provided by
Google Cloud, allows Gitaly to create a URI that has a short lifetime.

The downside to this approach is it depends on a feature that is
cloud-specific, so each cloud provider might provide such feature slightly
different, or not have it. But we want to roll this feature out on GitLab.com
first, which is hosted on Google Cloud, so for a first iteration we will use
this.

### Bundle creation

#### Use server-side backups

At the moment Gitaly knows how to back up repositories into bundles onto cloud
storage. The [documentation](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/gitaly-backup.md#user-content-server-side-backups)
describes how to use it.

For the initial implementation of bundle-URI we can piggy-back onto this
feature. An admin should create backups for the repositories they want to use
bundle-URI. With the existing configuration for backups, Gitaly can access cloud
storage.

#### As part of housekeeping

Gitaly has a housekeeping worker that daily looks for repositories to optimize.
Ideally we create a bundle right after the housekeeping (that is, garbage collection
and repacking) is done. This ensures the most optimal bundle file.

There are a few things to keep in mind when automatically creating bundles:

- **Does the bundle need to be recreated?** When there wasn't much activity on
  the repository it's probably not needed to create a new bundle file, as the
  client can fetch missing object directly from Gitaly anyway. The housekeeping
  tasks uses various heuristics to determine which strategy is taken for the
  housekeeping job, we can reuse parts of this logic in the creation of bundles.
- **Is it even needed to create a bundle?** Some repositories might be very
  small, or see very little activity. Creating a bundle for these, and
  duplicating it's data to object storage doesn't provide much value and only
  generates cost and maintenance.

#### Controlled by GitLab Rails

Because bundles increase the cost on storage, we eventually want to give the
GitLab administrator full control over the creation of bundles. To achieve this,
bundle-URI settings will be available on the GitLab admin interface. Here the
admin can configure per project which have bundle-URI enabled.

### Configuration

To use this feature, Gitaly needs to be configured. For this we'll add the
following settings to Gitaly's configuration file:

- `bundle_uri.strategy` :: This indicates which strategy should be used to
  create and serve bundle-URIs. At the moment the only supported value is
  "backups". When this setting to that value, Gitaly checks if a server-side
  backup is available and use that.
- `bundle_uri.sign_urls` :: When set to true, the cloud storage URLs are not
  passed to the client as-is, but are transformed into a signed URL. This
  setting is optional and only support Google Cloud Storage (for now).

The credentials to access cloud storage are reused as described in the Gitaly
Backups documentation.

### Storing metadata

For now all metadata needed to store bundles on the cloud is managed by Gitaly
server-side backups.

### Bundle cleanup

At some point the admin might decide to cleanup bundles for one or more
repositories, an admin command should be added for this. Because we're now only
using bundles created by `gitaly-backup`, we leave this out of scope.

### Gitaly Cluster compatibility

Creating server-side backups doesn't happen through Praefect at the moment. It's
up to the admin to address the nodes where they want to create backups from. If
they make sure the node is up-to-date, all nodes will have access to up-to-date
bundles and can pass proper bundle-URI parameters to the client. So no extra
work is needed to reuse server-side backup bundles with bundle-URI.

## Alternative Solutions

No alternative solutions are suggested at the moment.