Age | Commit message (Collapse) | Author |
|
If a repository is being linked to an object pool that does not exist,
the current behavior is to create the pool implicitly. This doesn't
bode well with Praefect as it doesn't know whether a pool was or wasn't
created and can't thus create a metadata record for the pool. The
functionality was removed behind a feature flag. It doesn't seem like
anything relies on the implicit pool creation, so this commit removes
the feature flag and returns an error when linking a repository to a
non-existent pool.
Changelog: changed
|
|
proto: Introduce new "maintenance" RPC type
Closes #4079
See merge request gitlab-org/gitaly!4399
|
|
git: Globally limit number of threads
Closes #4120
See merge request gitlab-org/gitaly!4443
|
|
repocleaner: Only log repositories that have a mod time older than 24 hours and are not in the Praefect DB
See merge request gitlab-org/gitaly!4449
|
|
In order to make a decision on which repositories to clean up,
Praefect's repocleaner will match repository paths with what it sees in
the database. If it doesn't see a record in the database, that means
Praefect doesn't know about this repository.
However, it could be the case that a repository has just been created
and the record for it doesn't yet exist in the database.
In order for repocleaner to know, it needs the information to be
returned from Gitaly.
Add a field in the RPC response, and return the last modified date on the
refs/ dir so we can use it as a proxy for when the repository was
created.
|
|
We've got all the infrastructure in place now to handle maintenance
operations specially in our infrastructure. Mark all RPCs which are
maintenance-related as maintenance operations to switch them to use the
new infrastucture.
Note that we've still got a feature flag which toggles between old and
new routing behaviour for these RPCs.
Changelog: changed
|
|
When a repository does not have any bitmaps or in case it is missing
bloom filters in the commit graph we always try to do a full repack in
order to generate these data structures. This logic is required because:
- Bitmaps are only generated by git-repack(1) for full repacks. This
limitation exists because there can only be one bitmap per
repository.
- In case commit-graphs exist but missing bloom filters we need to
completely rewrite them in order to enable this extension.
While the logic makes sense, it also causes us to repack repositories
which are empty: they don't contain either of these data structures, and
consequentially we think we need a full repack. While this is not a huge
problem because git-repack(1) would finish fast anyway in case the repo
doesn't contain any objects, we still needlessly spawn a process. Also,
our metrics report that we're doing a lot of full repacks, which can be
confusing.
Improve our heuristics by taking into account whether the repository has
objects at all. If there aren't any, we can avoid all this busywork and
just skip repacking altogether.
Changelog: changed
|
|
sidechannel: Convert to use runtime directory to store sockets
See merge request gitlab-org/gitaly!4428
|
|
By default, git-pack-objects(1) will spawn as many threads as there are
logical CPU cores on the machine. While this typically makes sense on a
client's machine because a client will want the command to finish as
fast as possible, on the server-side this is less of a good idea because
it can easily impact other, concurrently running Git commands.
This is why we have started to limit the number of threads this command
may use when performing repository maintenance. Packing of objects is in
no way limited to maintenance though, but it's also executed e.g. when
serving packfiles.
Let's globally limit the number of threads git-pack-objects(1) uses by
injecting the `pack.threads` configuration into all commands which may
generate packfiles.
Changelog: performance
|
|
When setting up a push for our tests, we also have the ability to inject
feature flags by setting the `featureFlags` field. This has a major
downside though: the feature flags injected via the context and the
feature flags executed at a later point when actually running the
command may easily drift apart. This inconsistency is something that may
break assumptions on our side though, where some parts may now evaluate
a flag to `true` while others evaluate it to `false`.
Fix this inconsistency by instead accepting feature flags via a context.
This both simplifies the code and fixes such an upcoming incompatibility
when we're about to default-enable Git v2.35.1.gl1.
|
|
Our SSH tests need to set up a repository with some new objects so that
it can verify that these objects end up on the remote side after the
push. To create those objects, we currently use a worktree and then
execute git-commit(1). This usage pattern is outdated though: we have
a `gittest.WriteCommit()` helper which does not require a worktree.
Refactor tests to use this new helper so that we can get rid of using a
worktree.
|
|
Upgrade gofumpt to v0.3.1. Besides a very small number of new rules, the
most important benefit is that the new version parallelizes formatting
of files. It is thus much faster, also for our codebase:
$ hyperfine --warmup=1 'make format' 'make format GOFUMPT_VERSION=0.3.0'
Benchmark #1: make format
Time (mean ± σ): 1.238 s ± 0.060 s [User: 1.337 s, System: 0.084 s]
Range (min … max): 1.125 s … 1.312 s 10 runs
Benchmark #2: make format GOFUMPT_VERSION=0.3.0
Time (mean ± σ): 239.5 ms ± 26.0 ms [User: 1.804 s, System: 0.109 s]
Range (min … max): 180.6 ms … 284.7 ms 14 runs
Summary
'make format GOFUMPT_VERSION=0.3.0' ran
5.17 ± 0.61 times faster than 'make format'
Reformat our code with the new version.
|
|
GetArchive: log request hash
See merge request gitlab-org/gitaly!4425
|
|
repository: Structured errors for UserRebaseConfirmable
See merge request gitlab-org/gitaly!4419
|
|
The sidechannel code is used to create sidechannels that circumvent gRPC
for efficiency's sake. This socket is currently created in the system's
temporary directory as returned by `os.MkdirTemp()`. This has the very
real downside that it easily leaks created files and directories in case
we have a logic bug anywhere.
We have recently introduced a new runtime directory that can help us in
this situation: the runtime directory is created once at a central place
and will be cleaned up when shutting down Gitaly. Consequentially, even
if we leaked and sidechannel files, we'd still remove them on shutdown.
And even if we didn't: the runtime directory is designed so that we can
check whether it's used because it has the current process's PID as part
of its component. So if a runtime directory exists whose PID doesn't
refer to any existing process it's safe to remove. While we don't have
any such logic yet, it can easily be added at a later point and have all
code which started to use the runtime directory benefit at the same
time.
Migrate the sidechannel code to create sockets in a subdirectory within
the runtime directory called "sidechannel.d" if the runtime directory is
set via the hooks payload.
Changelog: changed
|
|
Currently, UserRebaseConfirmable will return gRPC OK even when there is
an error during rebase or if there is an error during PreReceive when it
calls the GitLab API for an access check.
This change modifies the error handling to return a detailed gRPC
structured error in either of these failure modes.
Changelog: changed
|
|
This will allow us to determine in the future if it is worth adding a
cache for GetArchive.
|
|
With d88d29185 (operations: Return error from UserSquash when resolving
revisions fails, 2022-03-01), we have converted UserSquash to return
structured errors in some cases where it previously returned no errors
at all. Instead, the RPC used to embed the error in the response struct
via the `GitError` field. This has made us blind for actual issues which
happen in production, and furthermore it trips Praefect's logic which
decides whether we need to create repliction jobs or not.
Rails has been adapted already in c7bcaadf193 (gitaly_client: Handle
detailed errors for UserSquash, 2022-03-02), and the changes have been
rolled out to production. While we required one follow-up fix to the new
error codes we returned in 6c63998ad (operations: Fix wrong error code
when UserSquash conflicts, 2022-03-11), the rollout has otherwise been
successful.
Remove the feature flag altogether and deprecated the `GitError` field,
which isn't used anymore.
Changelog: changed
|
|
In order to be able to serve Git repositories via a dumb HTTP server,
Git needs to create a bunch of metadata that tells clients what they
need to fetch. This metadata is generated via git-update-server-info(1),
which is can be automatically called via both git-receive-pack(1) and
git-repack(1). While the former doesn't do so by default, the latter
does.
For us this is a waste of resources because we don't ever serve repos
via the dumb HTTP protocol. Generating this information thus wastes
both precious CPU cycles and disk space for data that is ultimately
never used by anything. The waste of disk space is even more pronounced
because git-repack(1) doesn't always clean up the temporary files it
uses to atomically update the final files. So when the command gets
killed, we may accumulate more and more temporary files. In extreme
cases we have seen in production, a repository whose on-disk size of
actual data was less than 5GB had accumulated about 35GB of these
temporary files.
Stop generating this information in git-repack(1) completely. Ideally,
we'd do so by injecting configuration into all repack commands, but
there is no such config option right now. Instead, we need to pass the
`-n` flag everywhere we execute git-repack(1).
Note that this doesn't stop generating the data in all places: commands
like git-gc(1) invoke git-repack(1), but we have no ability to tell it
to pass `-n` to git-repack(1). Neither is there a config option which
would allow us to globally disable the generation. The current approach
is thus only a best-effort one as a stop-gap solution while we're in the
process of upstreaming patches which introduce such a config option.
The cleanup logic for existing repositories is added in a later step in
this patch series.
Changelog: performance
|
|
operations: Fix missing votes on squashed commits
Closes #4109
See merge request gitlab-org/gitaly!4417
|
|
The UserSquash RPC computes a squashed commit by first rebasing a range
of commits on top of another commit, and then collapsing these into a
single commit. This RPC is notably different from almost all of our
other RPCs because it never writes any references to disk, and neither
does it ever execute any access checks as the other User RPCs do. This
design is quite weird:
- There is a known race window where the new objects are not
referenced, so they could be pruned by maintenance calls.
- We accept objects into the repository which may not be sanctioned
by our access checks.
- Replication jobs cannot replicate the squashed commit because they
aren't referenced.
- We never perform transactional voting because no references are
updated.
Together these problems show that the RPC call is misdesigned, but
fixing this design would require a bigger refactoring to make it work
alright in Rails.
In this commit we fix the last bullet point though: because this RPC
never performs transactional voting, we're always creating replication
jobs after the call finishes because Praefect didn't observe any
transactional votes. As mentioned though, we don't have any reference to
vote on, so the best we can do is vote on the commit ID of the newly
written squash commit to make sure that it is the same across nodes.
This commit does so by introducing a quarantine directory that is used
to stage all new objects first before they're migrated to the final
repository. We then vote on the object ID of the staged squash commit.
Only if this vote is successful will we successfully commit the object
to disk.
This commit thus solves two things: first it fixes the missing
transactional voting. And second it causes us to discard all objects in
case the RPC errors.
Changelog: fixed
|
|
The housekeeping manager is hosting a latency metric which tracks how
long specific tasks of the housekeeping manager take. But while latency
metrics should typically use the GRPC latency buckets as configured in
the Gitaly configuration, we accidentally didn't in the context of the
housekeeping manager.
Fix this bug by passing in the Prometheus configuration.
|
|
Fix error handling in GetTreeEntries
See merge request gitlab-org/gitaly!4414
|
|
|
|
Changelog: added
|
|
repository: Fix indeterministic voting when creating new repos
Closes #4100
See merge request gitlab-org/gitaly!4402
|
|
into 'master'
operations: Fix wrong error code when UserSquash conflicts
See merge request gitlab-org/gitaly!4403
|
|
When creating repositories we use transactional voting to determine that
the repositories have been created the same on all nodes part of the
transaction. This voting happens after we have seeded the repository,
and the vote is computed by walking through the repository's directory
and hashing all its files. We need to be careful though to skip files
which we know to be indeterministic:
- FETCH_HEAD may contain URLs which are different for each of the
nodes.
- Object packfiles contained in the object database are not
deterministic, mostly because it may use multiple threads to
compute deltas.
Luckily, we do not have to rely on either of both types of files in
order to ensure that the user-visible state of the repository is the
same, so we can indeed just skip them.
While we already have the logic to skip these files, this logic didn't
work alright because we embarassingly forgot to actually return
`fs.SkipDir` in case we see the object directory. So even though we
thought we skipped these files, in reality we didn't.
This bug has been manifesting in production in form of CreateFork, which
regularly fails to reach quorum at random on a subset of nodes. The root
cause here is that we use git-clone(1) to seed repository contents of
the fork, which triggers exactly the case of indeterministic packfiles
noted above. So any successful CreateFork RPC call really only succeeded
by pure luck.
Fix this issue by correctly skipping over "object" directories. While at
it, fix how we skip over FETCH_HEAD by returning `nil`: it's a file and
not a directory, so it doesn't make much sense to return `fs.SkipDir`.
Changelog: fixed
|
|
repository: Add updateHeadFromBundle in CreateRepositoryFromBundle
Closes #4086
See merge request gitlab-org/gitaly!4401
|
|
|
|
Extend invalid metadata deletion logic to repos existin on target
Closes #4083
See merge request gitlab-org/gitaly!4396
|
|
With the new improved error hadnling in UserSquash we're now returning
errors in some cases where we previously didn't. One of those cases is
when the rebase performed during the squash results in a merge conflict.
While it is correct to return an error in this case, we're using an
Internal error code for this case, which indicates that Gitaly is to
blame instead of the parameters which have been passed by the user.
Fix the error code to instead be FailedPrecondition. This error code is
special-cased by our monitoring infrastructure to not raise any alerts.
Note that this change is only fixing issues with monitoring: Rails
handles the error alright by inspecting the error details instead of the
error code.
Changelog: fixed
|
|
Disable implicit pool creation on link behind a feature flag
See merge request gitlab-org/gitaly!4397
|
|
operations: Implement structured errors for UserSquash
See merge request gitlab-org/gitaly!4374
|
|
This commit adds a feature flag to disable the implicit pool creation
when joining a repository to a non-existent object pool. It looks like
this behavior is not needed given Rails should create the object pool
prior to considering it ready for joining. Disabling this behavior makes
it easier to handle pools in Praefect as Praefect needs to create metadata
records for all of the repositories. This implicit behavior might create
a repository without Praefect's knowledge. Special handling would be
required for this particular RPC only, so let's see if we can remove the
behavior rather than implement it in Praefect.
|
|
Praefect contains logic to delete metadata record of a non-existent
source repository during replication. Praefect checks for the
ErrInvalidSourceRepository error, and if it receives it, it deletes
the metadata record of the source repository. Right now, the error
is only returned from ReplicateRepository if the source repository
didn't exist when the target is creating the repository for the
first time. This means Praefect's clean up logic doesn't run if the
target already has the repository on the disk but the source repository
doesn't exist. This can lead to reconciliation loops where Praefect
keeps trying to replicate from the non-existent source repository
over and over again as all of the replications fail. This commit
changes ReplicateRepository to return the error when the source
doesn't exist but the target repository does.
This logic was initially implemented to break the reonciliation loops
that occurred due to Praefect creating invalid metadata records when
a repository creation failed. As DeleteObjectPool RPC was not handled
as a repository deleting RPC in Praefect, the metadata records were
left in place which is currently causing reconciliation loops. Changing
the error return allows for Praefect to clean up those invalid metadata
records. In the long run, Praefect's background metadata verifier would
capture such issues and this logic could be removed.
Changelog: fixed
|
|
In GitLab rails, there is code to ensure that the default branch on a
new repository is set. This is currently done with an extra WriteRef
call. This is done in a racy way, which causes problems with transaction
voting. To fix this, we can simplify the logic altogether and avoid
adding another roundtrip, add a parameter in CreateRepositoryRequest
to pass a default_branch.
Changelog: changed
|
|
The PreReceiveError field in the UserMergeToRefResponse isn't ever set
because the RPC call doesn't in fact perform any access checks at all.
Digging back through history this doesn't seem to be an oversight in the
Go port, but was the case in the Ruby implementation already. While the
initial implementation of UserMergeToRef in 8fbc337b8 (Allow merging to
a ref without changing the target branch, 2019-01-25) still used to call
hooks, this was explicitly changed with 19fdfa12a (Skip hooks for
UserMergeToRef RPC, 2019-06-14) to not do so anymore. The field thus
hasn't been set anymore for more than two years by now.
Deprecate the PreReceiveError field so that we can eventually remove it.
Changelog: deprecated
|
|
Add skip_content param to wiki find page call
See merge request gitlab-org/gitaly!4373
|
|
|
|
We do not return an error in case creating the squashed merge commit in
UserSquash call fails. This makes us blind for the error conditions in
this RPC, which is in turn creating problems in Praefect setups where we
have to rely on errors in order to decide whether we have to create
replication jobs or not.
Return an error in case the merge fails. Note that we don't return a
structured error in this case: the range of commits we're about to
squash-merge has already been rebased, so we do not expect to ever see a
merge conflict here. As a consequence, a failure to create the merge
commit is unexpected and thus doesn't warrant its own error type.
Changelog: changed
|
|
We do not return an error in case resolving rebasing commits in
UserSquash call fails. This makes us blind for the error conditions in
this RPC, which is in turn creating problems in Praefect setups where we
have to rely on errors in order to decide whether we have to create
replication jobs or not.
Return structured errors in case the error condition triggers such that
callers can tell what kind of error they have been hitting.
Changelog: changed
|
|
We do not return an error in case resolving either the start or end
revision of a UserSquash call fails. This makes us blind for the error
conditions in this RPC, which is in turn creating problems in Praefect
setups where we have to rely on errors in order to decide whether we
have to create replication jobs or not.
Return structured errors in case the error condition triggers such that
callers can tell what kind of error they have been hitting.
Changelog: changed
|
|
Add a short documentation snippet to UserSquash so that we can get rid
of the previous linting exception.
|
|
localrepo: Remove flag to switch to sidechannels for internal fetches
Closes #4065
See merge request gitlab-org/gitaly!4375
|
|
Enable CloneRepositoryFromURL to authenticate with custom token
See merge request gitlab-org/gitaly!4239
|
|
In commit 8e5cb6419 (git: Convert internal fetches to use sidechannel,
2022-02-16), we have converted internalf teches to use the sidechannel.
With this change, communication exchanged between git-upload-pack(1) and
git-fetch(1) don't use gRPC encoding anymore, but instead use pkt-line
encoding such that we can avoid having to serialize and deserialize lots
of gRPC messages. This avoids the overhead induced by grpc-go and should
thus reduce the load while serving an internal fetch.
This change has been rolled out to production on February 23rd without
any issues observed. Remove the feature flag to unconditionally enable
use of the sidechannel.
Changelog: changed
|
|
SetDefaultBranch sets HEAD to a reference using git-symbolic-ref.
However, git-symbolic-ref does not call the reference-transaction hook
so we have no way of voting. Praefect will detect that the vote failed
and schedule a replication job each time. This leads to unnecessary
replication jobs being scheduled.
Modify SetDefaultBranch to do its own operation using a safe locking
file writer which supports transaction voting instead of calling
git-symbolic-ref.
Update all the callsites to call SetDefaultBranch with a transaction
manager.
Protect this change behind a feature flag write_ref_manual_voting.
Changelog: changed
|
|
CheckRefFormat calls git-check-ref-format(1) and if there is an error
from the command, returns a separate CheckRefFormat error. This doesn't
help much however. It swallows up the context of the original error. But
also, the implementation of git-check-ref-format(1) is such that if the
format is not valid, it will return with an exit code of 1.
Clean up the interface of this function by instead, checking if the
exit code of git-check-ref-format(1) is 1. If it is, then we know the
format is invalid. If the exit code is anything else, something went
wrong so we bubble up the error to be handled by the caller. We can also
get rid of CheckRefFormatError.
Update the one callsite that checks for CheckRefFormatError
|
|
Quarantine directories are set in Gitaly to receive objects into a
separate directory from the repository's main object database. The
tests are not currently account for relative path rewriting that
Praefect would do. To do so, this commit changes the tests to
set the quarantine directories based on the rewritten relative path.
The API still needs to be called with the relative path so Praefect
can route the request correctly.
|