Age | Commit message (Collapse) | Author |
|
Since the maintenance routing feature flag has been running in
production without issue, we can now remove it.
Changelog: changed
|
|
Handle DeleteObjectPool calls in Praefect
Closes #3742 and #4078
See merge request gitlab-org/gitaly!4395
|
|
Remove implicit pool creation on link behavior
See merge request gitlab-org/gitaly!4455
|
|
featureflag: Remove TransactionalSymbolicRefUpdates featureflag
See merge request gitlab-org/gitaly!4467
|
|
Add RateLimiting
Closes #4026
See merge request gitlab-org/gitaly!4427
|
|
Now that we are adding a second limit handle, adjust the code to allow
for multiple limit handlers to be passed into a server invocation.
|
|
This feature flag has been set to default on, and deleted from
production. There have been no observable issues, so it's now safe to
fully remove the feature flag.
Changelog: changed
|
|
When pushing commits to a repository, access checks are run. In order to
use the quarantine directory, we need a way to filter out revisions that
a repository already has in the case that a packfile sends over objects
that already exists on the server. In this case, we don't need to check
the access.
Add an RPC that when given a list of revisions, returns the ones that
already exist in the repository, and the ones that do not exist in the
repository.
Changelog: added
|
|
A future commit will add a new middleware that will limit based on the
rate rather than concurrent calls. There is a good amount of logic
currently used by the concurrency limiter that can be reused since a
rate limiter is also operating on incoming requests based on RPC name.
To make easier to add this new limiter type in the future, refactor the
code by adding some abstractions easier to add another type of limiter.
|
|
Git supports a wide array of range options for this option, but for now
we restrict the formats we support to the very simplest, which is of
the form "1,100".
We can use this to implement limits on how many lines of the blame we
retrieve GitLab-side, which should help to reduce the resource usage
of this particular RPC.
Changelog: added
|
|
Use rev-list --all --objects --disk-usage to calculate repository usage
See merge request gitlab-org/gitaly!4430
|
|
In the previous commit, we added a Size() method in localrepo that calls
git rev-list --all --objects --disk-usage to calculate the size of the
repository.
Add a feature flag that, when toggled on, will cause RepositorySize to
use this new way of calculating repository size.
Changelog: added
|
|
When the pre-receive hook failed, git-receive-pack(1) exits with code 0.
It's arguable whether this is the expected behavior, but anyhow it means
cmd.Wait() did not error out. On the other hand, the gitaly-hooks
command did stop the transaction upon failure. So this final vote fails.
To avoid this error being presented to the end user, ignore it when the
transaction was stopped.
Issue: https://gitlab.com/gitlab-org/gitaly/-/issues/3636
Changelog: fixed
|
|
Add test to check if the error message from a failing custom pre-receive
hook is presented to the user.
|
|
In a future commit we'd like to process the stdout and stderr output
from the ssh push command separately. With this commit the setting up of
the command is extracted into a function so in the future it's possible
to specify where to write stdout/stderr to.
|
|
Praefect is currently disabled in RemoveRepository tests which allows
for Praefect's RemoveRepository behavior to deviate from the Gitaly's.
This commit enables Praefect for these tests so we can be sure we catch
behavior deviations between the two handlers.
|
|
Praefect currently proxies DeleteObjectPool calls to Gitalys like
any other mutating RPC. This has the same problem as RemoveRepository
previously had; a pool can be deleted from the disk and it's metadata
record still left in the database. This can then cause problems as
Praefect believes a pool replica still exists where one doesn't exist.
Praefect doesn't even treat DeleteObjectPool as a repository removing
RPC. This means the deletions have never been replicated to the
secondaries and the pool metadata records have been left in place. This
then causes the reconciler to repeatedly attempt to reconcile from a
non-existing primary pool replica to the secondaries.
This commit fixes both issues by handling pool deletions in Praefect.
Similarly to RemoveRepository, the metadata record of the pool is first
deleted and only then the pool is attempted to remove from the Gitalys
that have it. This ensures atomicity of the removals when Praefect is
rewriting the replica paths.
With the behavior fixed, the Praefect specific test asserations can be
removed as the behavior now matches what how a plain Gitaly would
behave.
The handler in Praefect is tested via the same tests that test the
handler in Gitaly. Adding separate tests doesn't make sense as external
behavior of the the handlers should match and the handler shouldn't
exists in Praefect if it is removed from Gitaly.
Changelog: fixed
|
|
DeleteObjectPool's tests do not currently test what happens if the
request is missing the pool repository. This commit adds the missing
test case. While at it, it exports the functionality to extract a pool
from the request so Praefect's DeleteObjectPool handler can reuse it
later and pass the same test.
|
|
TestDelete does not verify whether the object pool exists or not
after the call to DeleteObjectPool. This is somewhat of an imporant
assertion to make to ensure the deletion logic is working correctly.
This commit adds that missing assertion.
|
|
smarthttp: Fix test race where response buffer may not be filled
See merge request gitlab-org/gitaly!4447
|
|
If a repository is being linked to an object pool that does not exist,
the current behavior is to create the pool implicitly. This doesn't
bode well with Praefect as it doesn't know whether a pool was or wasn't
created and can't thus create a metadata record for the pool. The
functionality was removed behind a feature flag. It doesn't seem like
anything relies on the implicit pool creation, so this commit removes
the feature flag and returns an error when linking a repository to a
non-existent pool.
Changelog: changed
|
|
proto: Introduce new "maintenance" RPC type
Closes #4079
See merge request gitlab-org/gitaly!4399
|
|
git: Globally limit number of threads
Closes #4120
See merge request gitlab-org/gitaly!4443
|
|
repocleaner: Only log repositories that have a mod time older than 24 hours and are not in the Praefect DB
See merge request gitlab-org/gitaly!4449
|
|
In order to make a decision on which repositories to clean up,
Praefect's repocleaner will match repository paths with what it sees in
the database. If it doesn't see a record in the database, that means
Praefect doesn't know about this repository.
However, it could be the case that a repository has just been created
and the record for it doesn't yet exist in the database.
In order for repocleaner to know, it needs the information to be
returned from Gitaly.
Add a field in the RPC response, and return the last modified date on the
refs/ dir so we can use it as a proxy for when the repository was
created.
|
|
We've got all the infrastructure in place now to handle maintenance
operations specially in our infrastructure. Mark all RPCs which are
maintenance-related as maintenance operations to switch them to use the
new infrastucture.
Note that we've still got a feature flag which toggles between old and
new routing behaviour for these RPCs.
Changelog: changed
|
|
When a repository does not have any bitmaps or in case it is missing
bloom filters in the commit graph we always try to do a full repack in
order to generate these data structures. This logic is required because:
- Bitmaps are only generated by git-repack(1) for full repacks. This
limitation exists because there can only be one bitmap per
repository.
- In case commit-graphs exist but missing bloom filters we need to
completely rewrite them in order to enable this extension.
While the logic makes sense, it also causes us to repack repositories
which are empty: they don't contain either of these data structures, and
consequentially we think we need a full repack. While this is not a huge
problem because git-repack(1) would finish fast anyway in case the repo
doesn't contain any objects, we still needlessly spawn a process. Also,
our metrics report that we're doing a lot of full repacks, which can be
confusing.
Improve our heuristics by taking into account whether the repository has
objects at all. If there aren't any, we can avoid all this busywork and
just skip repacking altogether.
Changelog: changed
|
|
We have observed a racy test in production where fetching a hidden ref
via `PostUploadPack` isn't failing in the way we expect it to. While the
error code is there and the RPC call itself has failed, the response
buffer we see is completely empty.
This race can easily be caused by our testing infrastructure though: we
use `makePostUploadPackWithSidechannelRequest()` to set up a sidechannel
and then copy the sidechannel data via a separate Goroutine. In case we
see an error though while handling the sidechannel (which is the case:
we expect to fail fetching the hidden ref), we don't synchronize with
the Goroutine copying the response.
Fix this race by properly synchronizing the Goroutine via a wait group.
|
|
One of our tests is verifying that the `GitConfigOptions` which can be
passed to us via the `PostUploadPackRequest` is applied as expected.
This is done by creating a reference and then injecting config which
tells git-upload-pack(1) to hide this reference. If this config is
applied as expected, then we should see that the server refuses to give
us the object pointed to by the hidden reference.
Refactor this test to be less reliant on the repository state, which is
kind of confusing. The object pointed to by the hidden reference should
in fact already be pointed to by another visible reference, which is
quite puzzling. Furthermore, we also hard-code how many objects we
expect to receive, which is one more thing that depends on this specific
repository.
Convert the test to use an empty repository instead and create two
commits ad-hoc.
|
|
sidechannel: Convert to use runtime directory to store sockets
See merge request gitlab-org/gitaly!4428
|
|
By default, git-pack-objects(1) will spawn as many threads as there are
logical CPU cores on the machine. While this typically makes sense on a
client's machine because a client will want the command to finish as
fast as possible, on the server-side this is less of a good idea because
it can easily impact other, concurrently running Git commands.
This is why we have started to limit the number of threads this command
may use when performing repository maintenance. Packing of objects is in
no way limited to maintenance though, but it's also executed e.g. when
serving packfiles.
Let's globally limit the number of threads git-pack-objects(1) uses by
injecting the `pack.threads` configuration into all commands which may
generate packfiles.
Changelog: performance
|
|
When setting up a push for our tests, we also have the ability to inject
feature flags by setting the `featureFlags` field. This has a major
downside though: the feature flags injected via the context and the
feature flags executed at a later point when actually running the
command may easily drift apart. This inconsistency is something that may
break assumptions on our side though, where some parts may now evaluate
a flag to `true` while others evaluate it to `false`.
Fix this inconsistency by instead accepting feature flags via a context.
This both simplifies the code and fixes such an upcoming incompatibility
when we're about to default-enable Git v2.35.1.gl1.
|
|
Our SSH tests need to set up a repository with some new objects so that
it can verify that these objects end up on the remote side after the
push. To create those objects, we currently use a worktree and then
execute git-commit(1). This usage pattern is outdated though: we have
a `gittest.WriteCommit()` helper which does not require a worktree.
Refactor tests to use this new helper so that we can get rid of using a
worktree.
|
|
Upgrade gofumpt to v0.3.1. Besides a very small number of new rules, the
most important benefit is that the new version parallelizes formatting
of files. It is thus much faster, also for our codebase:
$ hyperfine --warmup=1 'make format' 'make format GOFUMPT_VERSION=0.3.0'
Benchmark #1: make format
Time (mean ± σ): 1.238 s ± 0.060 s [User: 1.337 s, System: 0.084 s]
Range (min … max): 1.125 s … 1.312 s 10 runs
Benchmark #2: make format GOFUMPT_VERSION=0.3.0
Time (mean ± σ): 239.5 ms ± 26.0 ms [User: 1.804 s, System: 0.109 s]
Range (min … max): 180.6 ms … 284.7 ms 14 runs
Summary
'make format GOFUMPT_VERSION=0.3.0' ran
5.17 ± 0.61 times faster than 'make format'
Reformat our code with the new version.
|
|
GetArchive: log request hash
See merge request gitlab-org/gitaly!4425
|
|
repository: Structured errors for UserRebaseConfirmable
See merge request gitlab-org/gitaly!4419
|
|
The sidechannel code is used to create sidechannels that circumvent gRPC
for efficiency's sake. This socket is currently created in the system's
temporary directory as returned by `os.MkdirTemp()`. This has the very
real downside that it easily leaks created files and directories in case
we have a logic bug anywhere.
We have recently introduced a new runtime directory that can help us in
this situation: the runtime directory is created once at a central place
and will be cleaned up when shutting down Gitaly. Consequentially, even
if we leaked and sidechannel files, we'd still remove them on shutdown.
And even if we didn't: the runtime directory is designed so that we can
check whether it's used because it has the current process's PID as part
of its component. So if a runtime directory exists whose PID doesn't
refer to any existing process it's safe to remove. While we don't have
any such logic yet, it can easily be added at a later point and have all
code which started to use the runtime directory benefit at the same
time.
Migrate the sidechannel code to create sockets in a subdirectory within
the runtime directory called "sidechannel.d" if the runtime directory is
set via the hooks payload.
Changelog: changed
|
|
Currently, UserRebaseConfirmable will return gRPC OK even when there is
an error during rebase or if there is an error during PreReceive when it
calls the GitLab API for an access check.
This change modifies the error handling to return a detailed gRPC
structured error in either of these failure modes.
Changelog: changed
|
|
This will allow us to determine in the future if it is worth adding a
cache for GetArchive.
|
|
With d88d29185 (operations: Return error from UserSquash when resolving
revisions fails, 2022-03-01), we have converted UserSquash to return
structured errors in some cases where it previously returned no errors
at all. Instead, the RPC used to embed the error in the response struct
via the `GitError` field. This has made us blind for actual issues which
happen in production, and furthermore it trips Praefect's logic which
decides whether we need to create repliction jobs or not.
Rails has been adapted already in c7bcaadf193 (gitaly_client: Handle
detailed errors for UserSquash, 2022-03-02), and the changes have been
rolled out to production. While we required one follow-up fix to the new
error codes we returned in 6c63998ad (operations: Fix wrong error code
when UserSquash conflicts, 2022-03-11), the rollout has otherwise been
successful.
Remove the feature flag altogether and deprecated the `GitError` field,
which isn't used anymore.
Changelog: changed
|
|
In order to be able to serve Git repositories via a dumb HTTP server,
Git needs to create a bunch of metadata that tells clients what they
need to fetch. This metadata is generated via git-update-server-info(1),
which is can be automatically called via both git-receive-pack(1) and
git-repack(1). While the former doesn't do so by default, the latter
does.
For us this is a waste of resources because we don't ever serve repos
via the dumb HTTP protocol. Generating this information thus wastes
both precious CPU cycles and disk space for data that is ultimately
never used by anything. The waste of disk space is even more pronounced
because git-repack(1) doesn't always clean up the temporary files it
uses to atomically update the final files. So when the command gets
killed, we may accumulate more and more temporary files. In extreme
cases we have seen in production, a repository whose on-disk size of
actual data was less than 5GB had accumulated about 35GB of these
temporary files.
Stop generating this information in git-repack(1) completely. Ideally,
we'd do so by injecting configuration into all repack commands, but
there is no such config option right now. Instead, we need to pass the
`-n` flag everywhere we execute git-repack(1).
Note that this doesn't stop generating the data in all places: commands
like git-gc(1) invoke git-repack(1), but we have no ability to tell it
to pass `-n` to git-repack(1). Neither is there a config option which
would allow us to globally disable the generation. The current approach
is thus only a best-effort one as a stop-gap solution while we're in the
process of upstreaming patches which introduce such a config option.
The cleanup logic for existing repositories is added in a later step in
this patch series.
Changelog: performance
|
|
operations: Fix missing votes on squashed commits
Closes #4109
See merge request gitlab-org/gitaly!4417
|
|
The UserSquash RPC computes a squashed commit by first rebasing a range
of commits on top of another commit, and then collapsing these into a
single commit. This RPC is notably different from almost all of our
other RPCs because it never writes any references to disk, and neither
does it ever execute any access checks as the other User RPCs do. This
design is quite weird:
- There is a known race window where the new objects are not
referenced, so they could be pruned by maintenance calls.
- We accept objects into the repository which may not be sanctioned
by our access checks.
- Replication jobs cannot replicate the squashed commit because they
aren't referenced.
- We never perform transactional voting because no references are
updated.
Together these problems show that the RPC call is misdesigned, but
fixing this design would require a bigger refactoring to make it work
alright in Rails.
In this commit we fix the last bullet point though: because this RPC
never performs transactional voting, we're always creating replication
jobs after the call finishes because Praefect didn't observe any
transactional votes. As mentioned though, we don't have any reference to
vote on, so the best we can do is vote on the commit ID of the newly
written squash commit to make sure that it is the same across nodes.
This commit does so by introducing a quarantine directory that is used
to stage all new objects first before they're migrated to the final
repository. We then vote on the object ID of the staged squash commit.
Only if this vote is successful will we successfully commit the object
to disk.
This commit thus solves two things: first it fixes the missing
transactional voting. And second it causes us to discard all objects in
case the RPC errors.
Changelog: fixed
|
|
The housekeeping manager is hosting a latency metric which tracks how
long specific tasks of the housekeeping manager take. But while latency
metrics should typically use the GRPC latency buckets as configured in
the Gitaly configuration, we accidentally didn't in the context of the
housekeeping manager.
Fix this bug by passing in the Prometheus configuration.
|
|
Fix error handling in GetTreeEntries
See merge request gitlab-org/gitaly!4414
|
|
|
|
Changelog: added
|
|
repository: Fix indeterministic voting when creating new repos
Closes #4100
See merge request gitlab-org/gitaly!4402
|
|
into 'master'
operations: Fix wrong error code when UserSquash conflicts
See merge request gitlab-org/gitaly!4403
|
|
When creating repositories we use transactional voting to determine that
the repositories have been created the same on all nodes part of the
transaction. This voting happens after we have seeded the repository,
and the vote is computed by walking through the repository's directory
and hashing all its files. We need to be careful though to skip files
which we know to be indeterministic:
- FETCH_HEAD may contain URLs which are different for each of the
nodes.
- Object packfiles contained in the object database are not
deterministic, mostly because it may use multiple threads to
compute deltas.
Luckily, we do not have to rely on either of both types of files in
order to ensure that the user-visible state of the repository is the
same, so we can indeed just skip them.
While we already have the logic to skip these files, this logic didn't
work alright because we embarassingly forgot to actually return
`fs.SkipDir` in case we see the object directory. So even though we
thought we skipped these files, in reality we didn't.
This bug has been manifesting in production in form of CreateFork, which
regularly fails to reach quorum at random on a subset of nodes. The root
cause here is that we use git-clone(1) to seed repository contents of
the fork, which triggers exactly the case of indeterministic packfiles
noted above. So any successful CreateFork RPC call really only succeeded
by pure luck.
Fix this issue by correctly skipping over "object" directories. While at
it, fix how we skip over FETCH_HEAD by returning `nil`: it's a file and
not a directory, so it doesn't make much sense to return `fs.SkipDir`.
Changelog: fixed
|