doc/user/project/repository/managing_large_repositories.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322

---
stage: Systems
group: Gitaly
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Managing monorepos

Monorepos have become a regular part of development team workflows. While they have many advantages, monorepos can present performance challenges
when using them in GitLab. Therefore, you should know:

- What repository characteristics can impact performance.
- Some tools and steps to optimize monorepos.

## Impact on performance

Because GitLab is a Git-based system, it is subject to similar performance
constraints as Git when it comes to large repositories that are gigabytes in
size.

Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/06/speed-up-your-monorepo-workflow-in-git/#characteristics-of-monorepos).

Large repositories pose a performance risk performance when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them.

Git itself has performance limitations when it comes to handling
monorepos.

Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits.

[Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built
on top of [Git](https://git-scm.com/). This means that any limitations of
Git are experienced in Gitaly, and in turn by end users of GitLab.

## Optimize GitLab settings

You should use as many of the following strategies as possible to minimize
fetches on the Gitaly server.

### Rationale

The most resource intensive operation in Git is the
[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is
responsible for figuring out all of the commit history and files to send back to
the client.

The larger the repository, the more commits, files, branches, and tags that a
repository has and the more expensive this operation is. Both memory and CPU
are heavily utilized during this operation.

Most `git clone` or `git fetch` traffic (which results in starting a `git-pack-objects` process on the server) often come from automated
continuous integration systems such as GitLab CI/CD or other CI/CD systems.
If there is a high amount of such traffic, hitting a Gitaly server with many
clones for a large repository is likely to put the server under significant
strain.

### Gitaly pack-objects cache

Turn on the [Gitaly pack-objects cache](../../../administration/gitaly/configure_gitaly.md#pack-objects-cache),
which reduces the work that the server has to do for clones and fetches.

#### Rationale

The [pack-objects cache](../../../administration/gitaly/configure_gitaly.md#pack-objects-cache)
caches the data that the `git-pack-objects` process produces. This response
is sent back to the Git client initiating the clone or fetch. If several
fetches are requesting the same set of refs, Git on the Gitaly server doesn't have
to re-generate the response data with each clone or fetch call, but instead serves
that data from an in-memory cache that Gitaly maintains.

This can help immensely in the presence of a high rate of clones for a single
repository.

For more information, see [Pack-objects cache](../../../administration/gitaly/configure_gitaly.md#pack-objects-cache).

### Reduce concurrent clones in CI/CD

CI/CD loads tend to be concurrent because pipelines are scheduled during set times.
As a result, the Git requests against the repositories can spike notably during
these times and lead to reduced performance for both CI/CD and users alike.

Reduce CI/CD pipeline concurrency by staggering them to run at different times.
For example, a set running at one time and another set running several minutes
later.

### Shallow cloning

In your CI/CD systems, set the
[`--depth`](https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt)
option in the `git clone` or `git fetch` call.

GitLab and GitLab Runner perform a [shallow clone](../../../ci/pipelines/settings.md#limit-the-number-of-changes-fetched-during-clone)
by default.

If possible, set the clone depth with a small number like 10. Shallow clones make Git request only
the latest set of changes for a given branch, up to desired number of commits.

This significantly speeds up fetching of changes from Git repositories,
especially if the repository has a very long backlog consisting of a number
of big files because we effectively reduce amount of data transfer.

The following GitLab CI/CD pipeline configuration example sets the `GIT_DEPTH`.

```yaml
variables:
  GIT_DEPTH: 10

test:
  script:
    - ls -al
```

### Git strategy

Use `git fetch` instead of `git clone` on CI/CD systems if it's possible to keep
a working copy of the repository.

By default, GitLab is configured to use the [`fetch` Git strategy](../../../ci/runners/configure_runners.md#git-strategy),
which is recommended for large repositories.

#### Rationale

`git clone` gets the entire repository from scratch, whereas `git fetch` only
asks the server for references that do not already exist in the repository.
Naturally, `git fetch` causes the server to do less work. `git-pack-objects`
doesn't have to go through all branches and tags and roll everything up into a
response that gets sent over. Instead, it only has to worry about a subset of
references to pack up. This strategy also reduces the amount of data to transfer.

### Git clone path

[`GIT_CLONE_PATH`](../../../ci/runners/configure_runners.md#custom-build-directories) allows you to
control where you clone your repositories. This can have implications if you
heavily use big repositories with a fork-based workflow.

A fork, from the perspective of GitLab Runner, is stored as a separate repository
with a separate worktree. That means that GitLab Runner cannot optimize the usage
of worktrees and you might have to instruct GitLab Runner to use that.

In such cases, ideally you want to make the GitLab Runner executor be used only
for the given project and not shared across different projects to make this
process more efficient.

The [`GIT_CLONE_PATH`](../../../ci/runners/configure_runners.md#custom-build-directories) must be
in the directory set in `$CI_BUILDS_DIR`. You can't pick any path from disk.

### Git clean flags

[`GIT_CLEAN_FLAGS`](../../../ci/runners/configure_runners.md#git-clean-flags) allows you to control
whether or not you require the `git clean` command to be executed for each CI/CD
job. By default, GitLab ensures that:

- You have your worktree on the given SHA.
- Your repository is clean.

[`GIT_CLEAN_FLAGS`](../../../ci/runners/configure_runners.md#git-clean-flags) is disabled when set
to `none`. On very big repositories, this might be desired because `git
clean` is disk I/O intensive. Controlling that with `GIT_CLEAN_FLAGS: -ffdx
-e .build/` (for example) allows you to control and disable removal of some
directories in the worktree between subsequent runs, which can speed-up
the incremental builds. This has the biggest effect if you re-use existing
machines and have an existing worktree that you can re-use for builds.

For exact parameters accepted by
[`GIT_CLEAN_FLAGS`](../../../ci/runners/configure_runners.md#git-clean-flags), see the documentation
for [`git clean`](https://git-scm.com/docs/git-clean). The available parameters
are dependent on the Git version.

### Git fetch extra flags

[`GIT_FETCH_EXTRA_FLAGS`](../../../ci/runners/configure_runners.md#git-fetch-extra-flags) allows you
to modify `git fetch` behavior by passing extra flags.

For example, if your project contains a large number of tags that your CI/CD jobs don't rely on,
you could add [`--no-tags`](https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt---no-tags)
to the extra flags to make your fetches faster and more compact.

Also in the case where you repository does _not_ contain a lot of
tags, `--no-tags` can [make a big difference in some cases](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746).
If your CI/CD builds do not depend on Git tags, setting `--no-tags` is worth trying.

For more information, see the [`GIT_FETCH_EXTRA_FLAGS` documentation](../../../ci/runners/configure_runners.md#git-fetch-extra-flags).

### Configure Gitaly negotiation timeouts

You might experience a `fatal: the remote end hung up unexpectedly` error when attempting to fetch or archive:

- Large repositories.
- Many repositories in parallel.
- The same large repository in parallel.

You can attempt to mitigate this issue by increasing the default negotiation timeout values. For more information, see
[Configure negotiation timeouts](../../../administration/gitaly/configure_gitaly.md#configure-negotiation-timeouts).

## Optimize your repository

Another avenue to keeping GitLab scalable with your monorepo is to optimize the
repository itself.

### Profiling repositories

Large repositories generally experience performance issues in Git. Knowing why
your repository is large can help you develop mitigation strategies to avoid
performance problems.

You can use [`git-sizer`](https://github.com/github/git-sizer) to get a snapshot
of repository characteristics and discover problem aspects of your monorepo.

For example:

```shell
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |
```

In this example, a few items are raised with a high level of concern. See the
following sections for information on solving:

- A large number of references.
- Large blobs.

### Large number of references

A reference in Git (a branch or tag) is used to refer to a commit. If you are
curious, you can go to any `.git` directory and look under the `refs` directory.

A large number of references can cause performance problems because, with more
references, object walks that Git does are larger for various operations such as
clones, pushes, and housekeeping tasks.

#### Mitigation strategies

To mitigate the effects of a large number of references in a monorepo:

- Create an automated process for cleaning up old branches.
- If certain references don't need to be visible to the client, hide them using the
  [`transfer.hideRefs`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-transferhideRefs)
  configuration setting. Because Gitaly ignores any on-server Git configuration, you must change the Gitaly configuration
  itself in `/etc/gitlab/gitlab.rb`:

  ```ruby
  gitaly['configuration'] = {
    # ...
    git: {
      # ...
      config: [
        # ...
        { key: "transfer.hideRefs", value: "refs/namespace_to_hide" },
      ],
    },
  }
  ```

In Git 2.42.0 and later, different Git operations can skip over hidden references
when doing an object graph walk.

### Large blobs

The presence of large files (called blobs in Git), can be problematic for Git
because it does not handle large binary files efficiently. If there are blobs over
10 MB or instance in the `git-sizer` output, this probably means there is binary
data in your repository.

#### Use LFS for large blobs

Store binary or blob files (for example, packages, audio, video, or graphics)
as Large File Storage (LFS) objects. With LFS, the objects are stored externally, such as in Object
Storage, which reduces the number and size of objects in the repository. Storing
objects in external Object Storage can improve performance.

For more information, refer to the [Git LFS documentation](../../../topics/git/lfs/index.md).

### Reference architectures

Large repositories tend to be found in larger organisations with many users. The
GitLab Quality Engineering and Support teams provide several [reference architectures](../../../administration/reference_architectures/index.md) that
are the recommended way to deploy GitLab at scale.

In these types of setups, the GitLab environment used should match a reference
architecture to improve performance.