Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/housekeeping.md')
-rw-r--r--doc/administration/housekeeping.md229
1 files changed, 190 insertions, 39 deletions
diff --git a/doc/administration/housekeeping.md b/doc/administration/housekeeping.md
index 15287b917e7..2d3e937e047 100644
--- a/doc/administration/housekeeping.md
+++ b/doc/administration/housekeeping.md
@@ -1,43 +1,144 @@
---
stage: Systems
-group: Distribution
+group: Gitaly
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---
# Housekeeping **(FREE SELF)**
-GitLab supports and automates housekeeping tasks in your current repository such as:
+GitLab supports and automates housekeeping tasks in Git repositories to ensure
+that they can be served as efficiently as possible. Housekeeping tasks include:
-- Compressing Git objects.
+- Compressing Git objects and revisions.
- Removing unreachable objects.
+- Removing stale data like lock files.
+- Maintaining data structures that improve performance.
+- Updating object pools to improve object deduplication across forks.
-## Configure housekeeping
+WARNING:
+Do not manually execute Git commands to perform housekeeping in Git
+repositories that are controlled by GitLab. Doing so may lead to corrupt
+repositories and data loss.
+
+## Housekeeping strategy
+
+Gitaly can perform housekeeping tasks in a Git repository in two ways:
+
+- [Eager housekeeping](#eager-housekeeping) executes specific housekeeping tasks
+ independent of the state a repository is in.
+- [Heuristical housekeeping](#heuristical-housekeeping) executes housekeeping
+ tasks based on a set of heuristics that determine what housekeeping tasks need
+ to be executed based on the repository state.
+
+### Eager housekeeping
+
+The "eager" housekeeping strategy executes housekeeping tasks in a repository
+independent of the repository state. This is the default strategy as used by the
+[manual trigger](#manual-trigger) and the [push-based trigger](#push-based-trigger).
+
+The eager housekeeping strategy is controlled by the GitLab application.
+Depending on the trigger that caused the housekeeping job to run, GitLab asks
+Gitaly to perform specific housekeeping tasks. Gitaly performs these tasks even
+if the repository is in an optimized state. As a result, this strategy can be
+inefficient in large repositories where performing the housekeeping tasks may
+be slow.
+
+### Heuristical housekeeping
+
+> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2634) in GitLab 14.9 for the [manual trigger](#manual-trigger) and the [push-based trigger](#push-based-trigger) [with a flag](feature_flags.md) named `optimized_housekeeping`. Disabled by default.
+> - [Enabled on GitLab.com](https://gitlab.com/gitlab-org/gitlab/-/issues/353607) in GitLab 14.10.
+
+FLAG:
+On self-managed GitLab, by default this feature is not available for the [manual trigger](#manual-trigger) and the [push-based trigger](#push-based-trigger).
+To make it available, ask an administrator to [enable the feature flag](feature_flags.md) named `optimized_housekeeping`.
+
+The heuristical (or "opportunistic") housekeeping strategy analyzes the
+repository's state and executes housekeeping tasks only when it finds one or
+more data structures are insufficiently optimized. This is the strategy used by
+[scheduled housekeeping](#scheduled-housekeeping). It can optionally be enabled
+for the [manual trigger](#manual-trigger) and the [push-based trigger](#push-based-trigger)
+by enabling the `optimized_housekeeping` feature flag.
+
+Heuristical housekeeping uses the following information to decide on the tasks
+it needs to run:
+
+- The number of loose and stale objects.
+- The number of packfiles that contain already-compressed objects.
+- The number of loose references.
+- The presence of a commit-graph.
+
+The decision whether any of the analyzed data structures need to be optimized is
+based on the size of the repository:
+
+- Objects are repacked frequently the bigger the total size of all objects.
+- References are repacked less frequently the more references there are in
+ total.
+
+Gitaly does this to offset the fact that optimizing those data structures takes
+more time the bigger they get. It is especially important in large
+monorepositories (which receive a lot of traffic) to avoid optimizing them too
+frequently.
+
+## Running housekeeping tasks
+
+There are different ways in which GitLab runs housekeeping tasks:
-GitLab automatically runs `git gc` and `git repack` on repositories after Git pushes:
+- A project's administrator can [manually trigger](#manual-trigger) repository
+ housekeeping tasks.
+- GitLab can automatically schedule housekeeping tasks [after a number of Git pushes](#push-based-trigger).
+- GitLab can [schedule a job](#scheduled-housekeeping) that runs housekeeping
+ tasks for all repositories in a configurable time frame.
+
+### Manual trigger
+
+Administrators of repositories can manually trigger housekeeping tasks in a
+repository. In general this is not required as GitLab knows to automatically run
+housekeeping tasks. The manual trigger can be useful when either:
+
+- A repository is known to require housekeeping.
+- Automated push-based scheduling of housekeeping tasks has been disabled.
+
+To trigger housekeeping tasks manually:
+
+1. On the top bar, select **Main menu > Projects** and find your project.
+1. On the left sidebar, select **Settings > General**.
+1. Expand **Advanced**.
+1. Select **Run housekeeping**.
+
+This starts an asynchronous background worker for the project's repository. The
+background worker executes `git gc`, which performs a number of optimizations.
+
+### Push-based trigger
+
+GitLab automatically runs repository housekeeping tasks after a configured
+number of pushes:
- [`git gc`](https://git-scm.com/docs/git-gc) runs a number of housekeeping tasks such as:
- Compressing Git objects to reduce disk space and increase performance.
- Removing unreachable objects that may have been created from changes to the repository, like force-overwriting branches.
- [`git repack`](https://git-scm.com/docs/git-repack) either:
- - Runs an incremental repack, according to a [configured period](#housekeeping-options). This
+ - Runs an incremental repack, according to a [configured period](#configure-push-based-maintenance). This
packs all loose objects into a new packfile and prunes the now-redundant loose objects.
- - Runs a full repack, according to a [configured period](#housekeeping-options). This repacks all
+ - Runs a full repack, according to a [configured period](#configure-push-based-maintenance). This repacks all
packfiles and loose objects into a single new packfile, and deletes the old now-redundant loose
objects and packfiles. It also optionally creates bitmaps for the new packfile.
+- [`git pack-refs`](https://git-scm.com/docs/git-pack-refs) compresses references
+ stored as loose files into a single file.
+
+#### Configure push-based maintenance
-You can change how often this happens or turn it off:
+You can change how often these tasks run when pushes occur, or you can turn
+them off entirely:
1. On the top bar, select **Main menu > Admin**.
1. On the left sidebar, select **Settings > Repository**.
1. Expand **Repository maintenance**.
-1. In the **Housekeeping** section, configure the [housekeeping options](#housekeeping-options).
+1. In the **Housekeeping** section, configure the housekeeping options.
1. Select **Save changes**.
-### Housekeeping options
-
The following housekeeping options are available:
-- **Enable automatic repository housekeeping**: Regularly run `git repack` and `git gc`. If you
+- **Enable automatic repository housekeeping**: Regularly run housekeeping tasks. If you
keep this setting disabled for a long time, Git repository access on your GitLab server becomes
slower and your repositories use more disk space.
- **Incremental repack period**: Number of Git pushes after which an incremental `git repack` is
@@ -60,30 +161,80 @@ Housekeeping also [removes unreferenced LFS files](../raketasks/cleanup.md#remov
from your project on the same schedule as the `git gc` operation, freeing up storage space for your
project.
-WARNING:
-Running `git gc` or `git repack` commands manually in the
-[repository folder](repository_storage_types.md#from-project-name-to-hashed-path)
-is discouraged. If the created pack files get incorrect access rights (that is, owned by the wrong user)
-browsing to the project page might result in `404` and `503` errors.
-
-## How housekeeping handles pool repositories
-
-Housekeeping for pool repositories is handled differently from standard repositories. It is
-ultimately performed by the Gitaly RPC `FetchIntoObjectPool`.
-
-This is the current call stack by which it is invoked:
-
-1. `Repositories::HousekeepingService#execute_gitlab_shell_gc`
-1. `Projects::GitGarbageCollectWorker#perform`
-1. `Projects::GitDeduplicationService#fetch_from_source`
-1. `ObjectPool#fetch`
-1. `ObjectPoolService#fetch`
-1. `Gitaly::FetchIntoObjectPoolRequest`
-
-To manually invoke it from a [Rails console](operations/rails_console.md) if needed, you can call
-`project.pool_repository.object_pool.fetch`. This is a potentially long-running task, though Gitaly
-times out in about 8 hours.
-
-WARNING:
-Do not run `git prune` or `git gc` in pool repositories! This can cause data loss in "real"
-repositories that depend on the pool in question.
+### Scheduled housekeeping
+
+While GitLab automatically performs housekeeping tasks based on the number of
+pushes, it does not maintain repositories that don't receive any pushes at all.
+As a result, inactive repositories or repositories that are only getting read
+requests may not benefit from improvements in the repository housekeeping
+strategy.
+
+Administrators can enable a background job that performs housekeeping in all
+repositories at a customizable interval to remedy this situation. This
+background job processes all repositories hosted by a Gitaly node in a random
+order and eagerly performs housekeeping tasks on them. The Gitaly node will stop
+processing repositories if it takes longer than the configured interval.
+
+#### Configure scheduled housekeeping
+
+Background maintenance of Git repositories is configured in Gitaly. By default,
+Gitaly performs background repository maintenance every day at 12:00 noon for a
+duration of 10 minutes.
+
+You can change this default in Gitaly configuration. The following snippet
+enables daily background repository maintenance starting at 23:00 for 1 hour
+for the `default` storage:
+
+```toml
+[daily_maintenance]
+start_hour = 23
+start_minute = 00
+duration = 1h
+storages = ["default"]
+```
+
+Use the following snippet to completely disable background repository
+maintenance:
+
+```toml
+[daily_maintenance]
+disabled = true
+```
+
+## Object pool repositories
+
+Object pool repositories are used by GitLab to deduplicate objects across forks
+of a repository. When creating the first fork, we:
+
+1. Create an object pool repository that contains all objects of the repository
+ that is about to be forked.
+1. Link the repository to this new object pool via Git's altenates mechanism.
+1. Repack the repository so that it uses objects from the object pool. It thus
+ can drop its own copy of the objects.
+
+Any forks of this repository can now link against the object pool and thus only
+have to keep objects that diverge from the primary repository.
+
+GitLab needs to perform special housekeeping operations in object pools:
+
+- Gitaly cannot ever delete unreachable objects from object pools because they
+ might be used by any of the forks that are connected to it.
+- Gitaly must keep all objects reachable due to the same reason. Object pools
+ thus maintain references to unreachable "dangling" objects so that they don't
+ ever get deleted.
+- GitLab must update object pools regularly to pull in new objects that have
+ been added in the primary repository. Otherwise, an object pool will become
+ increasingly inefficient at deduplicating objects.
+
+These housekeeping operations are performed by the specialized
+`FetchIntoObjectPool` RPC that handles all of these special tasks while also
+executing the regular housekeeping tasks we execute for normal Git
+repositories.
+
+Object pools are getting optimized automatically whenever the primary member is
+getting garbage collected. Therefore, the cadence can be configured using the
+same Git GC period in that project.
+
+If you need to manually invoke the RPC from a [Rails console](operations/rails_console.md),
+you can call `project.pool_repository.object_pool.fetch`. This is a potentially
+long-running task, though Gitaly times out after about 8 hours.