Age | Commit message (Collapse) | Author |
|
We'd like to be able to tell if a command was run in a cgroup based on
the log entry. This change adds a field in the Command struct so
whenever the Cmd finishes, a cgroup field gets logged.
Changelog: added
|
|
operations: Fix missing votes on squashed commits
Closes #4109
See merge request gitlab-org/gitaly!4417
|
|
housekeeping: Track total number of optimization calls
See merge request gitlab-org/gitaly!4418
|
|
The UserSquash RPC computes a squashed commit by first rebasing a range
of commits on top of another commit, and then collapsing these into a
single commit. This RPC is notably different from almost all of our
other RPCs because it never writes any references to disk, and neither
does it ever execute any access checks as the other User RPCs do. This
design is quite weird:
- There is a known race window where the new objects are not
referenced, so they could be pruned by maintenance calls.
- We accept objects into the repository which may not be sanctioned
by our access checks.
- Replication jobs cannot replicate the squashed commit because they
aren't referenced.
- We never perform transactional voting because no references are
updated.
Together these problems show that the RPC call is misdesigned, but
fixing this design would require a bigger refactoring to make it work
alright in Rails.
In this commit we fix the last bullet point though: because this RPC
never performs transactional voting, we're always creating replication
jobs after the call finishes because Praefect didn't observe any
transactional votes. As mentioned though, we don't have any reference to
vote on, so the best we can do is vote on the commit ID of the newly
written squash commit to make sure that it is the same across nodes.
This commit does so by introducing a quarantine directory that is used
to stage all new objects first before they're migrated to the final
repository. We then vote on the object ID of the staged squash commit.
Only if this vote is successful will we successfully commit the object
to disk.
This commit thus solves two things: first it fixes the missing
transactional voting. And second it causes us to discard all objects in
case the RPC errors.
Changelog: fixed
|
|
The housekeeping manager is hosting a latency metric which tracks how
long specific tasks of the housekeeping manager take. But while latency
metrics should typically use the GRPC latency buckets as configured in
the Gitaly configuration, we accidentally didn't in the context of the
housekeeping manager.
Fix this bug by passing in the Prometheus configuration.
|
|
While we already track how often we perform specific optimization tasks,
we do not have any numbers around how often we perform optimizations in
total. As a consequence it's harder than necessary to compute the
relative frequency of each task.
Fix this by adding another counter which tracks the total count of times
optimizations were performed.
|
|
Add two new helpers `helper.ErrAborted()` and `helper.ErrAbortedf()` to
generate gRPC errors with an `codes.Aborted` error code.
|
|
Update Danger config to use new type & subtype labels
See merge request gitlab-org/gitaly!4387
|
|
Fix error handling in GetTreeEntries
See merge request gitlab-org/gitaly!4414
|
|
|
|
housekeeping: Improve visibility into performed maintenance tasks
Closes #4103
See merge request gitlab-org/gitaly!4406
|
|
[ci skip]
|
|
[ci skip]
|
|
Makefile: Add more patches to speed up git-fetch(1) to v2.35.1.gl1
Closes #4104
See merge request gitlab-org/gitaly!4408
|
|
CreateRepositoryFromURL switch to mirror flag
See merge request gitlab-org/gitaly!4378
|
|
Changelog: added
|
|
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
This rule was introduced in
https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles/-/releases/v2.1.0
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
The rule was introduced in
https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles/-/releases/v2.9.0
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
Use features introduced in
https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles/-/releases/v2.6.0
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
Signed-off-by: Rémy Coutable <remy@rymai.me>
|
|
housekeeping: Don't prune recent objects
See merge request gitlab-org/gitaly!4410
|
|
When running git-gc(1), Git automatically executes git-prune(1) with a
grace period of two weeks: only when objects are older than that grace
period will they be pruned. This is really important to fix a race
condition with concurrent commands, which all will first write new
objects into the repository before making them reachable via a ref
update. This means that there is a brief period between writing the new
objects and making them reachable. If objects would be pruned without a
grace period, it could kick in right before updating the references and
thus cause repository corruption. The two-weeks grace period bridges
this gap by giving enough head room to finish the transaction.
With the recent introduction of the heuristics-based OptimizeRepository
RPC we have also started to use git-prune(1). The assumption was that
the two-weeks grace period was the default of git-prune(1), but as it
turns out it is only the default of git-gc(1). So because we are now
running git-prune(1) without any parameters, we accidentally started to
delete all unreachable objects, even if they had just been created.
Curiously, this bug only started to surface as we were migrating Rails
to use OptimizeRepository. Seemingly, because we previously only were
executing the RPC as part of our nightly job, the logic didn't trigger
in the right moment and thus never caused any (known) problems. Still,
this is a serious bug.
Fix this issue by passing `--expire=two.weeks.ago` to git-prune(1).
Changelog: fixed
|
|
repository: Fix indeterministic voting when creating new repos
Closes #4100
See merge request gitlab-org/gitaly!4402
|
|
into 'master'
operations: Fix wrong error code when UserSquash conflicts
See merge request gitlab-org/gitaly!4403
|
|
The rollout of Git v2.35.1.gl1 is currently blocked in production
because we have discovered various repositories where broken references
cause errors with the new version. So right now, the new Git version is
still disabled in all deployments, also because the feature flag is
default-disabled, as well.
This blockade gives us a last chance to sneak in some patches for Git
v2.35.1.gl1 which have landed in `next` meanwhile. Missing this chance
would mean that we'd have to wait a few more releases until we can land
another revision, or otherwise we'd be running with three different Git
versions in production at the same time: the current Git v2.33.1.gl3,
then Git v2.35.1.gl1, and finally the new version we'd be about to
introduce. So let's take this chance and slightly bend the rollout
process of new Git versions.
Apply commits from 60aae8731c (Merge branch 'ps/fetch-mirror-optim' into
next, 2022-03-08) into Git v2.35.1.gl1. These patches optimize fetches
by:
- Patch 1/5 starts to look up up "want" lines via the commit-graph,
resulting in a 7% speedup.
- Patch 2/5 avoids repeated lookups of commits that were only
required when writing to FETCH_HEAD, providing a 8% speedup.
- Patches 3-5/5 start to skip reading packed-refs when searching for
symbolic refs in the context of a fetch. This provides a 13%
speedup.
All benchmarks have been performed in www-gitlab-com.
Changelog: performance
|
|
With OptimizeRepository becoming our new RPC which performs all
repository maintenance in one place it's harder to tell what exactly is
going on without proper metrics. One bit of information that gets lost
is whether repacks are incremental or full repacks, which is quite an
important distinction due to the impact on latency.
Expose this information both via logs and via Prometheus to keep better
track of it. While at it, alos exxpose information about whether we are
writing bitmaps or not.
Changelog: added
|
|
Expose the number of stale files we have pruned via a Prometheus metric
so that it's easier to see what kinds of garbage we frequently need to
clean up.
Changelog: added
|
|
We expose a Prometheus counter that is counting how many empty
directories we have pruned. This is a broader concept in our
housekeeping code, where we also prune other kinds of stale files.
Generalize the counter into a counter vector such that we can reuse the
same counter for all the different types of data we prune. While this
breaks the metric in case it was used anywhere, there are no references
to this counter across the complete GitLab group. Furthermore, we
haven't ever guaranteed backwards compatibility for metrics anyway.
Changelog: changed
|
|
While we have a central manager component which is supposed to hold all
state related to housekeeping, we missed to migrate one of our metrics
into it.
Migrate it to get rid of one more global variable.
|
|
Because we may be deleting files which cause reference directories to
become empty, pruning of empty reference directories needs to happen
last in our housekeeping tasks. Because of this we also don't log info
about pruned references in the log message which reports all the other
removals.
Refactor the code to return the number of pruned empty reference dirs
from `removeRefEmptyDirs()` such that we can log them in the calling
function with all the other cleanups.
|
|
The test which verifies that we correctly prune specific files is using
a subfunction to execute subtests, which makes the test hard to extend.
Refactor it by inlining the subtests.
|
|
In 889450266 (ci: Run tests as unprivileged user, 2022-01-13) we have
converted tests to run as unprivileged user. Back then we forgot to also
adjust the Coverage job though, which is still running as a privileged
user.
Convert the job to also run tests unprivileged. This fixes an upcoming
test failure we're about to introduce where housekeeping tasks remove a
file that it shouldn't be able to because of a lack of permissions.
|
|
When creating repositories we use transactional voting to determine that
the repositories have been created the same on all nodes part of the
transaction. This voting happens after we have seeded the repository,
and the vote is computed by walking through the repository's directory
and hashing all its files. We need to be careful though to skip files
which we know to be indeterministic:
- FETCH_HEAD may contain URLs which are different for each of the
nodes.
- Object packfiles contained in the object database are not
deterministic, mostly because it may use multiple threads to
compute deltas.
Luckily, we do not have to rely on either of both types of files in
order to ensure that the user-visible state of the repository is the
same, so we can indeed just skip them.
While we already have the logic to skip these files, this logic didn't
work alright because we embarassingly forgot to actually return
`fs.SkipDir` in case we see the object directory. So even though we
thought we skipped these files, in reality we didn't.
This bug has been manifesting in production in form of CreateFork, which
regularly fails to reach quorum at random on a subset of nodes. The root
cause here is that we use git-clone(1) to seed repository contents of
the fork, which triggers exactly the case of indeterministic packfiles
noted above. So any successful CreateFork RPC call really only succeeded
by pure luck.
Fix this issue by correctly skipping over "object" directories. While at
it, fix how we skip over FETCH_HEAD by returning `nil`: it's a file and
not a directory, so it doesn't make much sense to return `fs.SkipDir`.
Changelog: fixed
|
|
repository: Add updateHeadFromBundle in CreateRepositoryFromBundle
Closes #4086
See merge request gitlab-org/gitaly!4401
|
|
|
|
[ci skip]
|
|
Extend invalid metadata deletion logic to repos existin on target
Closes #4083
See merge request gitlab-org/gitaly!4396
|
|
With the new improved error hadnling in UserSquash we're now returning
errors in some cases where we previously didn't. One of those cases is
when the rebase performed during the squash results in a merge conflict.
While it is correct to return an error in this case, we're using an
Internal error code for this case, which indicates that Gitaly is to
blame instead of the parameters which have been passed by the user.
Fix the error code to instead be FailedPrecondition. This error code is
special-cased by our monitoring infrastructure to not raise any alerts.
Note that this change is only fixing issues with monitoring: Rails
handles the error alright by inspecting the error details instead of the
error code.
Changelog: fixed
|
|
[ci skip]
|
|
proto: Add structured error types for UserRebaseCofirmable
See merge request gitlab-org/gitaly!4382
|
|
This commit introduces these changes by creating a new
UserRebaseConfirmableError Protobuf message which contains all potential
structured errors we want to return from the UserSquash RPC.
Changelog: added
|
|
Disable implicit pool creation on link behind a feature flag
See merge request gitlab-org/gitaly!4397
|
|
[ci skip]
|
|
doc: Document supported Git execution envinronments
See merge request gitlab-org/gitaly!4392
|
|
operations: Implement structured errors for UserSquash
See merge request gitlab-org/gitaly!4374
|
|
Document the different ways to access Git installations supported by
Gitaly. Most importantly, this also documents the way our new bundled
Git binaries work and why they were introduced.
|
|
[ci skip]
|
|
Add squash parameter to git2go merge
See merge request gitlab-org/gitaly!4241
|