Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/user/project/repository/monorepos/index.md')
-rw-r--r--doc/user/project/repository/monorepos/index.md356
1 files changed, 356 insertions, 0 deletions
diff --git a/doc/user/project/repository/monorepos/index.md b/doc/user/project/repository/monorepos/index.md
new file mode 100644
index 00000000000..144f46cd7d5
--- /dev/null
+++ b/doc/user/project/repository/monorepos/index.md
@@ -0,0 +1,356 @@
+---
+stage: Systems
+group: Gitaly
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Managing monorepos
+
+Monorepos have become a regular part of development team workflows. While they have many advantages, monorepos can present performance challenges
+when using them in GitLab. Therefore, you should know:
+
+- What repository characteristics can impact performance.
+- Some tools and steps to optimize monorepos.
+
+## Impact on performance
+
+Because GitLab is a Git-based system, it is subject to similar performance
+constraints as Git when it comes to large repositories that are gigabytes in
+size.
+
+Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/06/speed-up-your-monorepo-workflow-in-git/#characteristics-of-monorepos).
+
+Large repositories pose a performance risk performance when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them.
+
+Git itself has performance limitations when it comes to handling
+monorepos.
+
+Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits.
+
+[Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built
+on top of [Git](https://git-scm.com/). This means that any limitations of
+Git are experienced in Gitaly, and in turn by end users of GitLab.
+
+## Optimize GitLab settings
+
+You should use as many of the following strategies as possible to minimize
+fetches on the Gitaly server.
+
+### Rationale
+
+The most resource intensive operation in Git is the
+[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is
+responsible for figuring out all of the commit history and files to send back to
+the client.
+
+The larger the repository, the more commits, files, branches, and tags that a
+repository has and the more expensive this operation is. Both memory and CPU
+are heavily utilized during this operation.
+
+Most `git clone` or `git fetch` traffic (which results in starting a `git-pack-objects` process on the server) often come from automated
+continuous integration systems such as GitLab CI/CD or other CI/CD systems.
+If there is a high amount of such traffic, hitting a Gitaly server with many
+clones for a large repository is likely to put the server under significant
+strain.
+
+### Gitaly pack-objects cache
+
+Turn on the [Gitaly pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache),
+which reduces the work that the server has to do for clones and fetches.
+
+#### Rationale
+
+The [pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache)
+caches the data that the `git-pack-objects` process produces. This response
+is sent back to the Git client initiating the clone or fetch. If several
+fetches are requesting the same set of refs, Git on the Gitaly server doesn't have
+to re-generate the response data with each clone or fetch call, but instead serves
+that data from an in-memory cache that Gitaly maintains.
+
+This can help immensely in the presence of a high rate of clones for a single
+repository.
+
+For more information, see [Pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache).
+
+### Reduce concurrent clones in CI/CD
+
+CI/CD loads tend to be concurrent because pipelines are scheduled during set times.
+As a result, the Git requests against the repositories can spike notably during
+these times and lead to reduced performance for both CI/CD and users alike.
+
+Reduce CI/CD pipeline concurrency by staggering them to run at different times.
+For example, a set running at one time and another set running several minutes
+later.
+
+### Shallow cloning
+
+In your CI/CD systems, set the
+[`--depth`](https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt)
+option in the `git clone` or `git fetch` call.
+
+GitLab and GitLab Runner perform a [shallow clone](../../../../ci/pipelines/settings.md#limit-the-number-of-changes-fetched-during-clone)
+by default.
+
+If possible, set the clone depth with a small number like 10. Shallow clones make Git request only
+the latest set of changes for a given branch, up to desired number of commits.
+
+This significantly speeds up fetching of changes from Git repositories,
+especially if the repository has a very long backlog consisting of a number
+of big files because we effectively reduce amount of data transfer.
+
+The following GitLab CI/CD pipeline configuration example sets the `GIT_DEPTH`.
+
+```yaml
+variables:
+ GIT_DEPTH: 10
+
+test:
+ script:
+ - ls -al
+```
+
+### Git strategy
+
+Use `git fetch` instead of `git clone` on CI/CD systems if it's possible to keep
+a working copy of the repository.
+
+By default, GitLab is configured to use the [`fetch` Git strategy](../../../../ci/runners/configure_runners.md#git-strategy),
+which is recommended for large repositories.
+
+#### Rationale
+
+`git clone` gets the entire repository from scratch, whereas `git fetch` only
+asks the server for references that do not already exist in the repository.
+Naturally, `git fetch` causes the server to do less work. `git-pack-objects`
+doesn't have to go through all branches and tags and roll everything up into a
+response that gets sent over. Instead, it only has to worry about a subset of
+references to pack up. This strategy also reduces the amount of data to transfer.
+
+### Git clone path
+
+[`GIT_CLONE_PATH`](../../../../ci/runners/configure_runners.md#custom-build-directories) allows you to
+control where you clone your repositories. This can have implications if you
+heavily use big repositories with a fork-based workflow.
+
+A fork, from the perspective of GitLab Runner, is stored as a separate repository
+with a separate worktree. That means that GitLab Runner cannot optimize the usage
+of worktrees and you might have to instruct GitLab Runner to use that.
+
+In such cases, ideally you want to make the GitLab Runner executor be used only
+for the given project and not shared across different projects to make this
+process more efficient.
+
+The [`GIT_CLONE_PATH`](../../../../ci/runners/configure_runners.md#custom-build-directories) must be
+in the directory set in `$CI_BUILDS_DIR`. You can't pick any path from disk.
+
+### Git clean flags
+
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags) allows you to control
+whether or not you require the `git clean` command to be executed for each CI/CD
+job. By default, GitLab ensures that:
+
+- You have your worktree on the given SHA.
+- Your repository is clean.
+
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags) is disabled when set
+to `none`. On very big repositories, this might be desired because `git
+clean` is disk I/O intensive. Controlling that with `GIT_CLEAN_FLAGS: -ffdx
+-e .build/` (for example) allows you to control and disable removal of some
+directories in the worktree between subsequent runs, which can speed-up
+the incremental builds. This has the biggest effect if you re-use existing
+machines and have an existing worktree that you can re-use for builds.
+
+For exact parameters accepted by
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags), see the documentation
+for [`git clean`](https://git-scm.com/docs/git-clean). The available parameters
+are dependent on the Git version.
+
+### Git fetch extra flags
+
+[`GIT_FETCH_EXTRA_FLAGS`](../../../../ci/runners/configure_runners.md#git-fetch-extra-flags) allows you
+to modify `git fetch` behavior by passing extra flags.
+
+For example, if your project contains a large number of tags that your CI/CD jobs don't rely on,
+you could add [`--no-tags`](https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt---no-tags)
+to the extra flags to make your fetches faster and more compact.
+
+Also in the case where you repository does _not_ contain a lot of
+tags, `--no-tags` can [make a big difference in some cases](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746).
+If your CI/CD builds do not depend on Git tags, setting `--no-tags` is worth trying.
+
+For more information, see the [`GIT_FETCH_EXTRA_FLAGS` documentation](../../../../ci/runners/configure_runners.md#git-fetch-extra-flags).
+
+### Configure Gitaly negotiation timeouts
+
+You might experience a `fatal: the remote end hung up unexpectedly` error when attempting to fetch or archive:
+
+- Large repositories.
+- Many repositories in parallel.
+- The same large repository in parallel.
+
+You can attempt to mitigate this issue by increasing the default negotiation timeout values. For more information, see
+[Configure negotiation timeouts](../../../../administration/gitaly/configure_gitaly.md#configure-negotiation-timeouts).
+
+## Optimize your repository
+
+Another avenue to keeping GitLab scalable with your monorepo is to optimize the
+repository itself.
+
+### Profiling repositories
+
+Large repositories generally experience performance issues in Git. Knowing why
+your repository is large can help you develop mitigation strategies to avoid
+performance problems.
+
+You can use [`git-sizer`](https://github.com/github/git-sizer) to get a snapshot
+of repository characteristics and discover problem aspects of your monorepo.
+
+For example:
+
+```shell
+Processing blobs: 1652370
+Processing trees: 3396199
+Processing commits: 722647
+Matching commits to trees: 722647
+Processing annotated tags: 534
+Processing references: 539
+| Name | Value | Level of concern |
+| ---------------------------- | --------- | ------------------------------ |
+| Overall repository size | | |
+| * Commits | | |
+| * Count | 723 k | * |
+| * Total size | 525 MiB | ** |
+| * Trees | | |
+| * Count | 3.40 M | ** |
+| * Total size | 9.00 GiB | **** |
+| * Total tree entries | 264 M | ***** |
+| * Blobs | | |
+| * Count | 1.65 M | * |
+| * Total size | 55.8 GiB | ***** |
+| * Annotated tags | | |
+| * Count | 534 | |
+| * References | | |
+| * Count | 539 | |
+| | | |
+| Biggest objects | | |
+| * Commits | | |
+| * Maximum size [1] | 72.7 KiB | * |
+| * Maximum parents [2] | 66 | ****** |
+| * Trees | | |
+| * Maximum entries [3] | 1.68 k | * |
+| * Blobs | | |
+| * Maximum size [4] | 13.5 MiB | * |
+| | | |
+| History structure | | |
+| * Maximum history depth | 136 k | |
+| * Maximum tag depth [5] | 1 | |
+| | | |
+| Biggest checkouts | | |
+| * Number of directories [6] | 4.38 k | ** |
+| * Maximum path depth [7] | 13 | * |
+| * Maximum path length [8] | 134 B | * |
+| * Number of files [9] | 62.3 k | * |
+| * Total size of files [9] | 747 MiB | |
+| * Number of symlinks [10] | 40 | |
+| * Number of submodules | 0 | |
+```
+
+In this example, a few items are raised with a high level of concern. See the
+following sections for information on solving:
+
+- A large number of references.
+- Large blobs.
+
+### Large number of references
+
+[References in Git](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
+are branch and tag names that point to a particular commit. You can use the `git
+for-each-ref` command to list all references present in a repository. A large
+number of references in a repository can have detrimental impact on the command's
+performance. To understand why, we need to understand how Git stores references
+and uses them.
+
+In general, Git stores all references as loose files in the `.git/refs` folder of
+the repository. As the number of references grows, the seek time to find a
+particular reference in the folder also increases. Therefore, every time Git has
+to parse a reference, there is an increased latency due to the added seek time
+of the file system.
+
+To resolve this issue, Git uses [pack-refs](https://git-scm.com/docs/git-pack-refs). In short, instead of storing each
+reference in a single file, Git creates a single `.git/packed-refs` file that
+contains all the references for that repository. This file reduces storage space
+while also increasing performance because seeking within a single file is faster
+than seeking a file within a directory. However, creating and updating new references
+is still done through loose files and are not added to the `packed-refs` file. To
+recreate the `packed-refs` file, run `git pack-refs`.
+
+Gitaly runs `git pack-refs` during [housekeeping](../../../../administration/housekeeping.md#heuristical-housekeeping)
+to move loose references into `packed-refs` files. While this is very beneficial
+for most repositories, write-heavy repositories still have the problem that:
+
+- Creating or updating references creates new loose files.
+- Deleting references involves modifying the existing `packed-refs` file
+ altogether to remove the existing reference.
+
+These problems still cause the same performance issues.
+
+In addition, fetches and clones from repositories includes the transfer
+of missing objects from the server to the client. When there are numerous
+references, Git iterates over all references and walks the internal graph
+structure for each reference to find the missing objects to transfer to
+the client. Iteration and walking are CPU-intensive operations that increase
+the latency of these commands.
+
+In repositories with a lot of activity, this often causes a domino effect because
+every operation is slower and each operation stalls subsequent operations.
+
+#### Mitigation strategies
+
+To mitigate the effects of a large number of references in a monorepo:
+
+- Create an automated process for cleaning up old branches.
+- If certain references don't need to be visible to the client, hide them using the
+ [`transfer.hideRefs`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-transferhideRefs)
+ configuration setting. Because Gitaly ignores any on-server Git configuration, you must change the Gitaly configuration
+ itself in `/etc/gitlab/gitlab.rb`:
+
+ ```ruby
+ gitaly['configuration'] = {
+ # ...
+ git: {
+ # ...
+ config: [
+ # ...
+ { key: "transfer.hideRefs", value: "refs/namespace_to_hide" },
+ ],
+ },
+ }
+ ```
+
+In Git 2.42.0 and later, different Git operations can skip over hidden references
+when doing an object graph walk.
+
+### Large blobs
+
+The presence of large files (called blobs in Git), can be problematic for Git
+because it does not handle large binary files efficiently. If there are blobs over
+10 MB or instance in the `git-sizer` output, this probably means there is binary
+data in your repository.
+
+#### Use LFS for large blobs
+
+Store binary or blob files (for example, packages, audio, video, or graphics)
+as Large File Storage (LFS) objects. With LFS, the objects are stored externally, such as in Object
+Storage, which reduces the number and size of objects in the repository. Storing
+objects in external Object Storage can improve performance.
+
+For more information, refer to the [Git LFS documentation](../../../../topics/git/lfs/index.md).
+
+### Reference architectures
+
+Large repositories tend to be found in larger organisations with many users. The
+GitLab Quality Engineering and Support teams provide several [reference architectures](../../../../administration/reference_architectures/index.md) that
+are the recommended way to deploy GitLab at scale.
+
+In these types of setups, the GitLab environment used should match a reference
+architecture to improve performance.