Age | Commit message (Collapse) | Author |
|
When creating a new cached catfile process we need to decorrelate it
from the parent context. For one this means that we need to remove the
correlation ID and any tracing spans from the context. And second it
means we need to suppress cancellation of the parent so that our cached
process doesn't get killed when the RPC context is cancelled.
The way we do this right now is to use `context.Background()`: this just
creates a completely new background. Funny enough, we still try to strip
information from this new context, which doesn't make any sense given
that it's empty already. But this hints at an actual bug: the intent is
not to get a completely new context, but by using `context.Background()`
we also lose information such as feature flags here.
Fix this bug by not creating a completely new context. Instead, we just
suppress context cancellation via `helper.SuppressCancellation()` and
then continue to decorrelate it like we always used to do. Only that now
it actually does something.
Changelog: fixed
|
|
The catfile cache is using Goroutines to clean up processes. This is not
deterministic and thus frequently causes our tests to flake when we lose
one of the races caused by this. Furthermore, the asynchronicity makes
the code quite hard to understand.
Refactor the code to use synchronous cancellation. Instead of waiting
for the context to get cancelled, it is now the caller that is
responsible for killing the process. The caller transparently either
gets a cancellation function that
- kills the process directly in case it is an uncachable process.
- or returns the process to the cache in case it is a cachable
process.
This significantly simplifies the code and also makes it deterministic,
which fixes the flaky tests we have observed. Furthermore, this also
allows us to get rid of one more Goroutine per cached catfile process.
Note that this allows us to get rid of the signalling logic we had in
place previous to this change: there is no need to wait for processes to
return to the cache in our tests anymore because this is now guaranteed
after the cancellation function was called.
|
|
Both the `CatfileReader` and `CatfileInfoReader` have the same logic to
tear down the process when the context gets cancelled. This logic can
easily be unified into the cache: it already got a Goroutine that waits
for the context to get cancelled to decrement the process counter, so it
is a perfect fit to also host the other logic.
Refactor the code to do this. While this reduces complexity, it also
cuts the number of Goroutines in half.
|
|
The catfile cache still mentions "batch" in a lot of places, even though
what it's actually handling is processes that may or may not be a batch
process.
Adjust the variable names to instead say "process".
|
|
operations: Remove feature flag for transactional voting in UserSquash
Closes #4119
See merge request gitlab-org/gitaly!4496
|
|
Ignore verification columns for read-only cache updates
Closes #4159
See merge request gitlab-org/gitaly!4468
|
|
sidechannel: lower yamux close timeout
See merge request gitlab-org/gitaly!4491
|
|
With af4ea3258 (operations: Fix missing votes on squashed commits,
2022-03-18) we have introduced the use of quarantine directories in the
`UserSquash()` RPC to enable transactional voting. This change in
behaviour has been rolled out to production, and the change was enabled
by default in e8413304d (operations: Always use structured errors for
UserSquash, 2022-03-21).
We haven't seen any issues with this change. Remove the feature flag to
always enable it.
Changelog: removed
|
|
repository: RestoreCustomHooks to do transaction voting
Closes #4081
See merge request gitlab-org/gitaly!4481
|
|
[ci skip]
|
|
Lower the yamux StreamCloseTimeout setting from 5 minutes to 1 second
for sidechannel connections.
The motivation for changing this is to make it easier to test client
side cancelation in Workhorse tests.
|
|
Fix typo in documentation of UserCommitFilesRequestHeader
See merge request gitlab-org/gitaly!4422
|
|
Remove Maintenance routing feature flag
See merge request gitlab-org/gitaly!4486
|
|
[ci skip]
|
|
Release expired verification leases periodically
See merge request gitlab-org/gitaly!4478
|
|
[ci skip]
|
|
RestoreCustomHooks changes data in a repository, so it should do voting.
Otherwise, each time we create a replication job for it, which is
useless because we don't replicate hooks.
Add transactional voting to the RestoreCustomHooks RPC by integrating
with the new LockingDirectory.
Changelog: changed
|
|
Sometimes it's useful to be able to lock a directory before modifying
it. This is similar to LockingFileWriter, except it just gives the
ability to lock a directory so another process cannot also lock the
directory.
This will be immediately useful in RestoreCustomHooks where we are about
ot add transactional voting to it. With transactional voting, the
semantics are that the `Prepared` vote is made once the data that is
about to be changed is locked.
This will enable us to lock the custom hooks directory.
Changelog: added
|
|
Since the maintenance routing feature flag has been running in
production without issue, we can now remove it.
Changelog: changed
|
|
The background verifier sets a lease time on a replica when it picks
it up for verification. If the worker dies for some reason, the lease
will remain in place and no other worker will pick up the replica for
verification again until the lease is cleared. The lease itself tells
the maximum time the worker itself would be working on the replica.
After it has been passed, it would be safe for another worker to pick
up the replica for verification again. This commit adds a background
goroutine that periodically releases expired leases so other workers
can take up the work if the original worker failed and did not release
the lease. The 'verificaton_leases' index is added so the query can
efficiently find the replicas with leases acquired to find the stale
ones.
|
|
Update version of danger-files dependency
See merge request gitlab-org/gitaly!4485
|
|
|
|
docs: Document Gitaly backpressure
See merge request gitlab-org/gitaly!4469
|
|
There are a number of knobs in Gitaly to tune backpressure Gitaly can impose on services that call it. This commit documents these.
|
|
Add support for FIPS encryption
See merge request gitlab-org/gitaly!4482
|
|
repocleaner: Allow NewWalker to receive grace period parameter
Closes #4164
See merge request gitlab-org/gitaly!4474
|
|
A default grace period of 6 hours is sufficient for the subcommand.
|
|
Implement 'praefect verify' subcommand
Closes #4091
See merge request gitlab-org/gitaly!4463
|
|
Praefect periodically verifies the repository metadata in the background.
The interval may be too long especially if there was an incident, say
a disk failure. After recovering the Gitaly node, the disk may be completely
empty or may contains an older snapshot which does not contain all expected
repositories. In such cases, it would be great if Praefect could be manually
instructed to verify the storage again as soon as possible rather than waiting
for the next scheduled verification interval to pass. This commit adds the
'praefect verify' subcommand that allows for doing that. It takes in either a
repository id, a virtual storage or a (virtual storage, storage) tuple and
marks all replicas matching the selector as being unverified. Praefect's
metadata verifier will then prioritize verifying these repositories over other
repositories pending verification. This allows administrators to speed up
the verification process and thus recovery.
Changelog: added
|
|
With the introduction of metadata verification, Praefect needs a tool
to manually mark a repository as needing verification immediately rather
than after the specified verification interval has passed. That tool will
require a new RPC that it can call achieve its goal. This commit adds the
proto definitions for MarkUnverified RPC which can be called to either
mark a single repository by ID, a whole virtual storage, or a whole storage
as needing verification.
Changelog: added
|
|
proto: Add LimitError as a structured error
See merge request gitlab-org/gitaly!4476
|
|
This commit adds support of using a FIPS-validated SSL library with
compiled Go executables when `FIPS_MODE=1 make` is run. A Go compiler
that supports BoringSSL either directly (e.g. the `dev.boringcrypto`
branch) or with a dynamically linked OpenSSL
(e.g. https://github.com/golang-fips/go) is required.
This is similar to the changes to support FIPS in GitLab Runner and in
GitLab Pages:
https://gitlab.com/gitlab-org/gitlab-pages/-/merge_requests/716
Changelog: added
|
|
Makefile: Make GITALY_EXECUTABLES deferred again
See merge request gitlab-org/gitaly!4477
|
|
Add new examples for concurrency and rate limiters
See merge request gitlab-org/gitaly!4472
|
|
Recently, in b5c9c7efc (Makefile: Rename find_commands to
GITALY_EXECUTABLES, 2022-03-25), we've changed the variable that holds
the names of all Gitaly executable to be an immediate variable. While
this is a good idea in general, it causes trouble in CI. In CI the
compiled executables are put in cache, but the source files are not. So
when files are pulled from cache, and any make target is built, it will
expand GITALY_EXECUTABLES. Now source files are not pulled from the
cache, so the `cmd` directory is missing. And therefore we revert it
back to be a deferred variable.
|
|
When Gitaly enforces a limit, either due rate limiting or concurrency
limiting, it needs to be able to return an error to its clients to
provide context into why it failed so that clients can then inform its
callers of why the call failed.
Changelog: added
|
|
Expose last verification time in 'praefect metadata'
Closes #4092
See merge request gitlab-org/gitaly!4466
|
|
Read-only cache receives invalidations on record updates via triggers
in Postgres. Currently the notifications are sent for any modification
to the records. The verification related columns are not relevant to
the operation of the cache so this commit ignores the changes to the
columns in the triggers.
Changelog: changed
|
|
Administrator's may want to know when Praefect has last verified a
replica. This commit exposes that information via the 'praefect metadata'
command.
Changelog: changed
|
|
GetRepositoryMetadata fetches a repository's metadata from the
database. This commit expands the query to also fetch the newly added
verified_at column so we can expose it in the 'praefect metadata'
command to the admins.
|
|
Administrators may want to know when a replica has been last verified
by Praefect. GetRepositoryMetadata RPC is called by the 'metadata'
sub-command to retrieve infromation about a repository and its
replicas from Praefect's database. This commit adds the proto
definitions for exposing the last verification time of replicas to
the metadata sub-command.
Changelog: changed
|
|
Initial implementation of a metadata verifier
See merge request gitlab-org/gitaly!4459
|
|
This commit wires the metadata verifier in Praefect's main so it can
actually be configured for use. It's default disabled still as it still
is missing some functionality that should be in place before generally
enabling it, for example tooling like metrics, integration in to the
'praefect metadata' tool and a background routine to release stale leases.
Changelog: added
|
|
This commit adds an initial implementation of a metadata verifier
to Praefect.
Praefect stores metadata of the repositories stored on the cluster in
Postgres. These metadata records may become out of sync with the disks
if changes occur on the disks without going through Praefect, for example
due to disk failures or manual modifications. Right now, Praefect only
contains some temporary logic to clean up invalid metadata records when
replication is attempted using a non-existent source repository. This was
mostly put in place to stop reconciliation loops where Praefect keeps
scheduling replication jobs from the non-existent repository that will
never succeed. While this performs some clean up, it's not sufficient to
catch cases where something happens in the background without prompting
replication.
The metadata verifier introduced in this commit aims to catch these issues
by verifying the metadata eveynow and then in the background with the
state on the disks. For now, only the existence of the replica is verified,
not the actual contents by checksumming.
Each replica contains a 'verified_at' timestamp in the database that tells
Praefect when the metadata record was last verified. If it exceeds a configurable
threshold, the replica is considered to be due for reverification. Praefect
then asks the Gitaly hosting the replica whether the replica still exists.
If it doesn't the invalid metadata record is deleted and the removal is logged.
To avoid multiple Praefects verifying the same replica concurrently, Praefect
acquires the verification lease on the replica in the database prior to
verifying the existence of the repository.
The scheduling is fairly simplistic at the moment with each Praefect acquiring
a batch of work every two seconds. This also serves as a crude way to rate
limit the background verification work rather to avoid consuming too many
resources while doing it. This should be sufficient for now althoug could later
be improved.
Praefect leaves the repository's record in place even if all of its replicas
have been lost. This ensures no data loss goes unnoticed and that the loss
needs to be acknowledged by removing the repository manually.
Changelog: added
|
|
This commit adds the necessary schema changes for the metadata
background verification. Each replica receives two new columns:
1. 'verified_at' which contains the timestamp of the last successful
verification of the replica. This effectively allows for identifying
replicas that are in need of reverification.
2. 'verification_leased_until' which contains a timestamp until which
a worker has acquired a lease to reverify the repository. This prevents
multiple workers from picking the same repository for reverification at
the same time.
'verification_queue' index is added to index replicas which have not been
acquired by any worker. This allows for efficientl querying replicas that
are in need of reverification later.
Changelog: other
|
|
[ci skip]
|
|
README: Add link to backpressure video
See merge request gitlab-org/gitaly!4475
|
|
Add a link in the presentations section of the README to a video
explaining the different ways to configure request limits (or
backpressure) in Gitaly.
|
|
The Walker is used by both the periodic job that checks for untracked
repositories as well as the list-untracked-repositories subcommand. It
is useful to be able to give it a different grace period value. Add this
parameter to the signature.
|
|
Expose command stats (rusage) metrics via prometheus
See merge request gitlab-org/gitaly!4464
|