Welcome to mirror list, hosted at ThFree Co, Russian Federation.

housekeeping.md « administration « doc - gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: bf28e169d78f2ba8260ef1be2230ba7f422578ed (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
stage: Systems
group: Gitaly
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Housekeeping **(FREE SELF)**

GitLab supports and automates housekeeping tasks in Git repositories to ensure
that they can be served as efficiently as possible. Housekeeping tasks include:

- Compressing Git objects and revisions.
- Removing unreachable objects.
- Removing stale data like lock files.
- Maintaining data structures that improve performance.
- Updating object pools to improve object deduplication across forks.

WARNING:
Do not manually execute Git commands to perform housekeeping in Git
repositories that are controlled by GitLab. Doing so may lead to corrupt
repositories and data loss.

## Housekeeping strategy

Gitaly can perform housekeeping tasks in a Git repository in two ways:

- [Eager housekeeping](#eager-housekeeping) executes specific housekeeping tasks
  independent of the state a repository is in.
- [Heuristical housekeeping](#heuristical-housekeeping) executes housekeeping
  tasks based on a set of heuristics that determine what housekeeping tasks need
  to be executed based on the repository state.

### Eager housekeeping

The "eager" housekeeping strategy executes housekeeping tasks in a repository
independent of the repository state. This is the default strategy as used by the
[manual trigger](#manual-trigger) and the push-based trigger.

The eager housekeeping strategy is controlled by the GitLab application.
Depending on the trigger that caused the housekeeping job to run, GitLab asks
Gitaly to perform specific housekeeping tasks. Gitaly performs these tasks even
if the repository is in an optimized state. As a result, this strategy can be
inefficient in large repositories where performing the housekeeping tasks may
be slow.

### Heuristical housekeeping

> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2634) in GitLab 14.9 for the [manual trigger](#manual-trigger) and the push-based trigger [with a flag](feature_flags.md) named `optimized_housekeeping`. Enabled by default.
> - [Enabled on GitLab.com](https://gitlab.com/gitlab-org/gitlab/-/issues/353607) in GitLab 14.10.
> - [Generally available](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/107661) in GitLab 15.8. Feature flag `optimized_housekeeping` removed.

The heuristical (or "opportunistic") housekeeping strategy analyzes the
repository's state and executes housekeeping tasks only when it finds one or
more data structures are insufficiently optimized. This is the strategy used by
[scheduled housekeeping](#scheduled-housekeeping).

Heuristical housekeeping uses the following information to decide on the tasks
it needs to run:

- The number of loose and stale objects.
- The number of packfiles that contain already-compressed objects.
- The number of loose references.
- The presence of a commit-graph.

The decision whether any of the analyzed data structures need to be optimized is
based on the size of the repository:

- Objects are repacked frequently the bigger the total size of all objects.
- References are repacked less frequently the more references there are in
  total.

Gitaly does this to offset the fact that optimizing those data structures takes
more time the bigger they get. It is especially important in large
monorepos (which receive a lot of traffic) to avoid optimizing them too
frequently.

You can change how often Gitaly is asked to optimize a repository.

1. On the left sidebar, at the bottom, select **Admin Area**.
1. On the left sidebar, select **Settings > Repository**.
1. Expand **Repository maintenance**.
1. In the **Housekeeping** section, configure the housekeeping options.
1. Select **Save changes**.

- **Enable automatic repository housekeeping**: Regularly ask Gitaly to run repository optimization. If you
  keep this setting disabled for a long time, Git repository access on your GitLab server becomes
  slower and your repositories use more disk space.
- **Optimize repository period**: Number of Git pushes after which Gitaly is asked to optimize a repository.

## Running housekeeping tasks

There are different ways in which GitLab runs housekeeping tasks:

- A project's administrator can [manually trigger](#manual-trigger) repository
  housekeeping tasks.
- GitLab can automatically schedule housekeeping tasks after a number of Git pushes.
- GitLab can [schedule a job](#scheduled-housekeeping) that runs housekeeping
  tasks for all repositories in a configurable time frame.

### Manual trigger

Administrators of repositories can manually trigger housekeeping tasks in a
repository. In general this is not required as GitLab knows to automatically run
housekeeping tasks. The manual trigger can be useful when either:

- A repository is known to require housekeeping.
- Automated push-based scheduling of housekeeping tasks has been disabled.

To trigger housekeeping tasks manually:

1. On the left sidebar, select **Search or go to** and find your project.
1. On the left sidebar, select **Settings > General**.
1. Expand **Advanced**.
1. Select **Run housekeeping**.

This starts an asynchronous background worker for the project's repository. The
background worker asks Gitaly to perform a number of optimizations.

Housekeeping also [removes unreferenced LFS files](../raketasks/cleanup.md#remove-unreferenced-lfs-files)
from your project every `200` push, freeing up storage space for your project.

### Prune unreachable objects

Unreachable objects are pruned as part of scheduled housekeeping. However,
you can trigger manual pruning as well. An example: removing commits that contain sensitive
information. Triggering housekeeping prunes unreachable objects with a grace period of
two weeks. When you manually trigger the pruning of unreachable objects, the grace period
is reduced to 30 minutes.

WARNING:
If a concurrent process (like `git push`) has created an object but hasn't created
a reference to the object yet, your repository can become corrupted if a reference
to the object is added after the object is deleted. The grace period exists to
reduce the likelihood of such race conditions.

To trigger a manual prune of unreachable objects:

1. On the left sidebar, select **Search or go to** and find your project.
1. On the left sidebar, select **Settings > General**.
1. Expand **Advanced**.
1. Select **Run housekeeping**.
1. Wait 30 minutes for the operation to complete.
1. Return to the page where you selected **Run housekeeping**, and select **Prune unreachable objects**.

### Scheduled housekeeping

While GitLab automatically performs housekeeping tasks based on the number of
pushes, it does not maintain repositories that don't receive any pushes at all.
As a result, inactive repositories or repositories that are only getting read
requests may not benefit from improvements in the repository housekeeping
strategy.

Administrators can enable a background job that performs housekeeping in all
repositories at a customizable interval to remedy this situation. This
background job processes all repositories hosted by a Gitaly node in a random
order and eagerly performs housekeeping tasks on them. The Gitaly node stops
processing repositories if it takes longer than the configured interval.

#### Configure scheduled housekeeping

Background maintenance of Git repositories is configured in Gitaly. By default,
Gitaly performs background repository maintenance every day at 12:00 noon for a
duration of 10 minutes.

You can change this default in Gitaly configuration.

For environments with Gitaly Cluster, the scheduled housekeeping start time can be
staggered across Gitaly nodes so the scheduled housekeeping is not running
simultaneously on multiple nodes.

If a scheduled housekeeping run reaches the `duration` specified, the running tasks are
gracefully cancelled. On subsequent scheduled housekeeping runs, Gitaly randomly shuffles
the repository list to process.

The following snippet enables daily background repository maintenance starting at
23:00 for 1 hour for the `default` storage:

::Tabs

:::TabTitle Self-compiled (source)

```toml
[daily_maintenance]
start_hour = 23
start_minute = 00
duration = 1h
storages = ["default"]
```

Use the following snippet to completely disable background repository
maintenance:

```toml
[daily_maintenance]
disabled = true
```

:::TabTitle Linux package (Omnibus)

```ruby
gitaly['configuration'] = {
  daily_maintenance: {
    disabled: false,
    start_hour: 23,
    start_minute: 00,
    duration: '1h',
    storages: ['default'],
  },
}
```

Use the following snippet to completely disable background repository
maintenance:

```ruby
gitaly['configuration'] = {
  daily_maintenance: {
    disabled: true,
  },
}
```

::EndTabs

When the scheduled housekeeping is executed, you can see the following entries in
your [Gitaly log](logs/index.md#gitaly-logs):

```json
# When the scheduled housekeeping starts
{"level":"info","msg":"maintenance: daily scheduled","pid":197260,"scheduled":"2023-09-27T13:10:00+13:00","time":"2023-09-27T00:08:31.624Z"}

# When the scheduled housekeeping completes
{"actual_duration":321181874818,"error":null,"level":"info","max_duration":"1h0m0s","msg":"maintenance: daily completed","pid":197260,"time":"2023-09-27T00:15:21.182Z"}
```

The `actual_duration` (in nanoseconds) indicates how long the scheduled maintenance
took to execute. In the example above, the scheduled housekeeping completed
in just over 5 minutes.

## Object pool repositories

Object pool repositories are used by GitLab to deduplicate objects across forks
of a repository. When creating the first fork, we:

1. Create an object pool repository that contains all objects of the repository
   that is about to be forked.
1. Link the repository to this new object pool via the alternates mechanism of Git.
1. Repack the repository so that it uses objects from the object pool. It thus
   can drop its own copy of the objects.

Any forks of this repository can now link against the object pool and thus only
have to keep objects that diverge from the primary repository.

GitLab needs to perform special housekeeping operations in object pools:

- Gitaly cannot ever delete unreachable objects from object pools because they
  might be used by any of the forks that are connected to it.
- Gitaly must keep all objects reachable due to the same reason. Object pools
  thus maintain references to unreachable "dangling" objects so that they don't
  ever get deleted.
- GitLab must update object pools regularly to pull in new objects that have
  been added in the primary repository. Otherwise, an object pool becomes
  increasingly inefficient at deduplicating objects.

These housekeeping operations are performed by the specialized
`FetchIntoObjectPool` RPC that handles all of these special tasks while also
executing the regular housekeeping tasks we execute for standard Git
repositories.

Object pools are getting optimized automatically whenever the primary member is
getting garbage collected. Therefore, the cadence can be configured using the
same Git GC period in that project.

If you need to manually invoke the RPC from a [Rails console](operations/rails_console.md),
you can call `project.pool_repository.object_pool.fetch`. This is a potentially
long-running task, though Gitaly times out after about 8 hours.