1 files changed, 356 insertions, 0 deletions
diff --git a/doc/user/project/repository/monorepos/index.md b/doc/user/project/repository/monorepos/index.md
new file mode 100644
index 00000000000..144f46cd7d5
--- /dev/null
+++ b/doc/user/project/repository/monorepos/index.md
@@ -0,0 +1,356 @@
+---
+stage: Systems
+group: Gitaly
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Managing monorepos
+
+Monorepos have become a regular part of development team workflows. While they have many advantages, monorepos can present performance challenges
+when using them in GitLab. Therefore, you should know:
+
+- What repository characteristics can impact performance.
+- Some tools and steps to optimize monorepos.
+
+## Impact on performance
+
+Because GitLab is a Git-based system, it is subject to similar performance
+constraints as Git when it comes to large repositories that are gigabytes in
+size.
+
+Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/06/speed-up-your-monorepo-workflow-in-git/#characteristics-of-monorepos).
+
+Large repositories pose a performance risk performance when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them.
+
+Git itself has performance limitations when it comes to handling
+monorepos.
+
+Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits.
+
+[Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built
+on top of [Git](https://git-scm.com/). This means that any limitations of
+Git are experienced in Gitaly, and in turn by end users of GitLab.
+
+## Optimize GitLab settings
+
+You should use as many of the following strategies as possible to minimize
+fetches on the Gitaly server.
+
+### Rationale
+
+The most resource intensive operation in Git is the
+[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is
+responsible for figuring out all of the commit history and files to send back to
+the client.
+
+The larger the repository, the more commits, files, branches, and tags that a
+repository has and the more expensive this operation is. Both memory and CPU
+are heavily utilized during this operation.
+
+Most `git clone` or `git fetch` traffic (which results in starting a `git-pack-objects` process on the server) often come from automated
+continuous integration systems such as GitLab CI/CD or other CI/CD systems.
+If there is a high amount of such traffic, hitting a Gitaly server with many
+clones for a large repository is likely to put the server under significant
+strain.
+
+### Gitaly pack-objects cache
+
+Turn on the [Gitaly pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache),
+which reduces the work that the server has to do for clones and fetches.
+
+#### Rationale
+
+The [pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache)
+caches the data that the `git-pack-objects` process produces. This response
+is sent back to the Git client initiating the clone or fetch. If several
+fetches are requesting the same set of refs, Git on the Gitaly server doesn't have
+to re-generate the response data with each clone or fetch call, but instead serves
+that data from an in-memory cache that Gitaly maintains.
+
+This can help immensely in the presence of a high rate of clones for a single
+repository.
+
+For more information, see [Pack-objects cache](../../../../administration/gitaly/configure_gitaly.md#pack-objects-cache).
+
+### Reduce concurrent clones in CI/CD
+
+CI/CD loads tend to be concurrent because pipelines are scheduled during set times.
+As a result, the Git requests against the repositories can spike notably during
+these times and lead to reduced performance for both CI/CD and users alike.
+
+Reduce CI/CD pipeline concurrency by staggering them to run at different times.
+For example, a set running at one time and another set running several minutes
+later.
+
+### Shallow cloning
+
+In your CI/CD systems, set the
+[`--depth`](https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt)
+option in the `git clone` or `git fetch` call.
+
+GitLab and GitLab Runner perform a [shallow clone](../../../../ci/pipelines/settings.md#limit-the-number-of-changes-fetched-during-clone)
+by default.
+
+If possible, set the clone depth with a small number like 10. Shallow clones make Git request only
+the latest set of changes for a given branch, up to desired number of commits.
+
+This significantly speeds up fetching of changes from Git repositories,
+especially if the repository has a very long backlog consisting of a number
+of big files because we effectively reduce amount of data transfer.
+
+The following GitLab CI/CD pipeline configuration example sets the `GIT_DEPTH`.
+
+```yaml
+variables:
+  GIT_DEPTH: 10
+
+test:
+  script:
+    - ls -al
+```
+
+### Git strategy
+
+Use `git fetch` instead of `git clone` on CI/CD systems if it's possible to keep
+a working copy of the repository.
+
+By default, GitLab is configured to use the [`fetch` Git strategy](../../../../ci/runners/configure_runners.md#git-strategy),
+which is recommended for large repositories.
+
+#### Rationale
+
+`git clone` gets the entire repository from scratch, whereas `git fetch` only
+asks the server for references that do not already exist in the repository.
+Naturally, `git fetch` causes the server to do less work. `git-pack-objects`
+doesn't have to go through all branches and tags and roll everything up into a
+response that gets sent over. Instead, it only has to worry about a subset of
+references to pack up. This strategy also reduces the amount of data to transfer.
+
+### Git clone path
+
+[`GIT_CLONE_PATH`](../../../../ci/runners/configure_runners.md#custom-build-directories) allows you to
+control where you clone your repositories. This can have implications if you
+heavily use big repositories with a fork-based workflow.
+
+A fork, from the perspective of GitLab Runner, is stored as a separate repository
+with a separate worktree. That means that GitLab Runner cannot optimize the usage
+of worktrees and you might have to instruct GitLab Runner to use that.
+
+In such cases, ideally you want to make the GitLab Runner executor be used only
+for the given project and not shared across different projects to make this
+process more efficient.
+
+The [`GIT_CLONE_PATH`](../../../../ci/runners/configure_runners.md#custom-build-directories) must be
+in the directory set in `$CI_BUILDS_DIR`. You can't pick any path from disk.
+
+### Git clean flags
+
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags) allows you to control
+whether or not you require the `git clean` command to be executed for each CI/CD
+job. By default, GitLab ensures that:
+
+- You have your worktree on the given SHA.
+- Your repository is clean.
+
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags) is disabled when set
+to `none`. On very big repositories, this might be desired because `git
+clean` is disk I/O intensive. Controlling that with `GIT_CLEAN_FLAGS: -ffdx
+-e .build/` (for example) allows you to control and disable removal of some
+directories in the worktree between subsequent runs, which can speed-up
+the incremental builds. This has the biggest effect if you re-use existing
+machines and have an existing worktree that you can re-use for builds.
+
+For exact parameters accepted by
+[`GIT_CLEAN_FLAGS`](../../../../ci/runners/configure_runners.md#git-clean-flags), see the documentation
+for [`git clean`](https://git-scm.com/docs/git-clean). The available parameters
+are dependent on the Git version.
+
+### Git fetch extra flags
+
+[`GIT_FETCH_EXTRA_FLAGS`](../../../../ci/runners/configure_runners.md#git-fetch-extra-flags) allows you
+to modify `git fetch` behavior by passing extra flags.
+
+For example, if your project contains a large number of tags that your CI/CD jobs don't rely on,
+you could add [`--no-tags`](https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt---no-tags)
+to the extra flags to make your fetches faster and more compact.
+
+Also in the case where you repository does _not_ contain a lot of
+tags, `--no-tags` can [make a big difference in some cases](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746).
+If your CI/CD builds do not depend on Git tags, setting `--no-tags` is worth trying.
+
+For more information, see the [`GIT_FETCH_EXTRA_FLAGS` documentation](../../../../ci/runners/configure_runners.md#git-fetch-extra-flags).
+
+### Configure Gitaly negotiation timeouts
+
+You might experience a `fatal: the remote end hung up unexpectedly` error when attempting to fetch or archive:
+
+- Large repositories.
+- Many repositories in parallel.
+- The same large repository in parallel.
+
+You can attempt to mitigate this issue by increasing the default negotiation timeout values. For more information, see
+[Configure negotiation timeouts](../../../../administration/gitaly/configure_gitaly.md#configure-negotiation-timeouts).
+
+## Optimize your repository
+
+Another avenue to keeping GitLab scalable with your monorepo is to optimize the
+repository itself.
+
+### Profiling repositories
+
+Large repositories generally experience performance issues in Git. Knowing why
+your repository is large can help you develop mitigation strategies to avoid
+performance problems.
+
+You can use [`git-sizer`](https://github.com/github/git-sizer) to get a snapshot
+of repository characteristics and discover problem aspects of your monorepo.
+
+For example:
+
+```shell
+Processing blobs: 1652370
+Processing trees: 3396199
+Processing commits: 722647
+Matching commits to trees: 722647
+Processing annotated tags: 534
+Processing references: 539
+| Name                         | Value     | Level of concern               |
+| ---------------------------- | --------- | ------------------------------ |
+| Overall repository size      |           |                                |
+| * Commits                    |           |                                |
+|   * Count                    |   723 k   | *                              |
+|   * Total size               |   525 MiB | **                             |
+| * Trees                      |           |                                |
+|   * Count                    |  3.40 M   | **                             |
+|   * Total size               |  9.00 GiB | ****                           |
+|   * Total tree entries       |   264 M   | *****                          |
+| * Blobs                      |           |                                |
+|   * Count                    |  1.65 M   | *                              |
+|   * Total size               |  55.8 GiB | *****                          |
+| * Annotated tags             |           |                                |
+|   * Count                    |   534     |                                |
+| * References                 |           |                                |
+|   * Count                    |   539     |                                |
+|                              |           |                                |
+| Biggest objects              |           |                                |
+| * Commits                    |           |                                |
+|   * Maximum size         [1] |  72.7 KiB | *                              |
+|   * Maximum parents      [2] |    66     | ******                         |
+| * Trees                      |           |                                |
+|   * Maximum entries      [3] |  1.68 k   | *                              |
+| * Blobs                      |           |                                |
+|   * Maximum size         [4] |  13.5 MiB | *                              |
+|                              |           |                                |
+| History structure            |           |                                |
+| * Maximum history depth      |   136 k   |                                |
+| * Maximum tag depth      [5] |     1     |                                |
+|                              |           |                                |
+| Biggest checkouts            |           |                                |
+| * Number of directories  [6] |  4.38 k   | **                             |
+| * Maximum path depth     [7] |    13     | *                              |
+| * Maximum path length    [8] |   134 B   | *                              |
+| * Number of files        [9] |  62.3 k   | *                              |
+| * Total size of files    [9] |   747 MiB |                                |
+| * Number of symlinks    [10] |    40     |                                |
+| * Number of submodules       |     0     |                                |
+```
+
+In this example, a few items are raised with a high level of concern. See the
+following sections for information on solving:
+
+- A large number of references.
+- Large blobs.
+
+### Large number of references
+
+[References in Git](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
+are branch and tag names that point to a particular commit. You can use the `git
+for-each-ref` command to list all references present in a repository. A large
+number of references in a repository can have detrimental impact on the command's
+performance. To understand why, we need to understand how Git stores references
+and uses them.
+
+In general, Git stores all references as loose files in the `.git/refs` folder of
+the repository. As the number of references grows, the seek time to find a
+particular reference in the folder also increases. Therefore, every time Git has
+to parse a reference, there is an increased latency due to the added seek time
+of the file system.
+
+To resolve this issue, Git uses [pack-refs](https://git-scm.com/docs/git-pack-refs). In short, instead of storing each
+reference in a single file, Git creates a single `.git/packed-refs` file that
+contains all the references for that repository. This file reduces storage space
+while also increasing performance because seeking within a single file is faster
+than seeking a file within a directory. However, creating and updating new references
+is still done through loose files and are not added to the `packed-refs` file. To
+recreate the `packed-refs` file, run `git pack-refs`.
+
+Gitaly runs `git pack-refs` during [housekeeping](../../../../administration/housekeeping.md#heuristical-housekeeping)
+to move loose references into `packed-refs` files. While this is very beneficial
+for most repositories, write-heavy repositories still have the problem that:
+
+- Creating or updating references creates new loose files.
+- Deleting references involves modifying the existing `packed-refs` file
+  altogether to remove the existing reference.
+
+These problems still cause the same performance issues.
+
+In addition, fetches and clones from repositories includes the transfer
+of missing objects from the server to the client. When there are numerous
+references, Git iterates over all references and walks the internal graph
+structure for each reference to find the missing objects to transfer to
+the client. Iteration and walking are CPU-intensive operations that increase
+the latency of these commands.
+
+In repositories with a lot of activity, this often causes a domino effect because
+every operation is slower and each operation stalls subsequent operations.
+
+#### Mitigation strategies
+
+To mitigate the effects of a large number of references in a monorepo:
+
+- Create an automated process for cleaning up old branches.
+- If certain references don't need to be visible to the client, hide them using the
+  [`transfer.hideRefs`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-transferhideRefs)
+  configuration setting. Because Gitaly ignores any on-server Git configuration, you must change the Gitaly configuration
+  itself in `/etc/gitlab/gitlab.rb`:
+
+  ```ruby
+  gitaly['configuration'] = {
+    # ...
+    git: {
+      # ...
+      config: [
+        # ...
+        { key: "transfer.hideRefs", value: "refs/namespace_to_hide" },
+      ],
+    },
+  }
+  ```
+
+In Git 2.42.0 and later, different Git operations can skip over hidden references
+when doing an object graph walk.
+
+### Large blobs
+
+The presence of large files (called blobs in Git), can be problematic for Git
+because it does not handle large binary files efficiently. If there are blobs over
+10 MB or instance in the `git-sizer` output, this probably means there is binary
+data in your repository.
+
+#### Use LFS for large blobs
+
+Store binary or blob files (for example, packages, audio, video, or graphics)
+as Large File Storage (LFS) objects. With LFS, the objects are stored externally, such as in Object
+Storage, which reduces the number and size of objects in the repository. Storing
+objects in external Object Storage can improve performance.
+
+For more information, refer to the [Git LFS documentation](../../../../topics/git/lfs/index.md).
+
+### Reference architectures
+
+Large repositories tend to be found in larger organisations with many users. The
+GitLab Quality Engineering and Support teams provide several [reference architectures](../../../../administration/reference_architectures/index.md) that
+are the recommended way to deploy GitLab at scale.
+
+In these types of setups, the GitLab environment used should match a reference
+architecture to improve performance.