Age | Commit message (Collapse) | Author |
|
In the past, we had several incidents where repositories ended up with
corrupted loose references immediately after a server was experiencing a
hard shut-down. The root cause of this is that Git historically didn't
flush the contents of loose references to disk before renaming them into
place. This usage is explicitly documented to be invalid by various
filesystems in the Linux kernel. Most importantly, documentation of the
ext4 filesystem says that:
Many broken applications don't use fsync() when replacing existing
files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/
rename("foo.new", "foo"), or worse yet, fd = open("foo",
O_TRUNC)/write(fd,..)/close(fd).
With Git v2.36.0 new infrastructure was introduced that provides finer
grained control of exactly what data should be flushed to disk via the
new `core.fsync` option. As part of that effort we have also upstreamed
patches which give us the ability to also flush references to disk
before they're renamed into place.
So let's finally fix this source of corruption on newer Git versions by
enabling fsync(3P) for references.
Changelog: fixed
|
|
Git v2.36.0 has introduced new infrastructure to allow administrators
more fine-grained control over what files Git will synchronize to disks.
This new infrastructure replaces the old `core.fsyncObjectFiles` config,
which we are currently injecting unconditionally. Unfortunately, the new
Git version has started to emit a warning in that case, which would both
be presented to the user and furthermore it would break a bunch of our
tests.
We're thus forced to conditionally inject either the old or the new
config depending on which Git version we're running. Let's do so by
injecting `core.fsync=objects,derived-metadata` instead when running Git
v2.36.0 or newer. This has the same semantics as what we had previously
with `core.fsyncObjectFiles=true`:
- `default` is what Git did by default previously, and expands to
`objects,derived-metadata,-loose-object`. Note that the current
documentation in gitconfig(1) is wrong as it pretends that it
instead expands to `committed,-loose-object`.
- Because we want to also synchronize loose objects though we can
drop the `-loose-object` part.
This fixes compatibility witih v2.36.0 while continuing to do the same
as we did before.
Changelog: fixed
|
|
Git v2.36.0 has introduced the new `core.fsync` infrastructure, which
allows the user more fine-grained control over what exactly should be
synchronized to disk. Add a function which allows us to check whether
this new infrastructure is supported by the current Git version.
|
|
Inline the global Git configuration. This is done so that we can only
conditionally add `core.fsyncObjectFiles`.
|
|
We're about to introduce a version check whenever the Git command
factory creates a new command, which means that the factory will start
to call `GitVersion()` on itself. This breaks the intercepting command
factory though, which will erroneously try to parse the version by using
the output of the intercepting script.
Fix this issue by writing a second wrapper script that detects the
version check in case the version check is not to be intercepted. If so,
then the script will call the real Git binary to determine the version.
|
|
The intercepting Git command factory is currently always using the
intercepting script to derive the current Git version. This only works
alright because we typically don't check for the Git version in most
callsites, and thus the intercepting script's output is never used to
determine the version except for some limited cases. This is about to
change though given that we'll introduce a Git version check into the
command factory itself, which will then break all tests which use the
intercepting command factory.
Add a new option to the intercepting command factory that allows the
caller to only optionally intercept the Git version. By default, we'll
now use the real Git version so that users will continue to work.
|
|
It's currently only possible to pass options intended for the real Git
command factory when constructing an intercepting factory. We're about
to add another option though that changes behaviour of the intercepting
factory itself.
Introduce a new `InterceptingCommandFactoryOption` type to make this
more flexible and allow for the new usecase.
|
|
Due to a new incompatibility in Git with regards to the newly introduced
`core.fsync` option we'll have to inject different global options based
on the Git version. Consequentially, we are forced to detect the Git
version when spawning any Git commands. This creates a chicken-and-egg
problem though: we need to spawn a command via `newCommand()`, which in
turn tries to detect the Git version by running `newCommand()` to run
git-version(1).
Prepare for this upcoming version check by open-coding `GitVersion()` so
that it doesn't use `newCommand()` anymore. Instead, we simply assemble
the arguments ourselves and then use `command.New()`. This gets rid of
the cyclic dependency and paves the path for the new version check.
|
|
operations: Remove feature flag for transactional voting in UserSquash
Closes #4119
See merge request gitlab-org/gitaly!4496
|
|
Ignore verification columns for read-only cache updates
Closes #4159
See merge request gitlab-org/gitaly!4468
|
|
sidechannel: lower yamux close timeout
See merge request gitlab-org/gitaly!4491
|
|
With af4ea3258 (operations: Fix missing votes on squashed commits,
2022-03-18) we have introduced the use of quarantine directories in the
`UserSquash()` RPC to enable transactional voting. This change in
behaviour has been rolled out to production, and the change was enabled
by default in e8413304d (operations: Always use structured errors for
UserSquash, 2022-03-21).
We haven't seen any issues with this change. Remove the feature flag to
always enable it.
Changelog: removed
|
|
repository: RestoreCustomHooks to do transaction voting
Closes #4081
See merge request gitlab-org/gitaly!4481
|
|
[ci skip]
|
|
Lower the yamux StreamCloseTimeout setting from 5 minutes to 1 second
for sidechannel connections.
The motivation for changing this is to make it easier to test client
side cancelation in Workhorse tests.
|
|
Fix typo in documentation of UserCommitFilesRequestHeader
See merge request gitlab-org/gitaly!4422
|
|
Remove Maintenance routing feature flag
See merge request gitlab-org/gitaly!4486
|
|
[ci skip]
|
|
Release expired verification leases periodically
See merge request gitlab-org/gitaly!4478
|
|
[ci skip]
|
|
RestoreCustomHooks changes data in a repository, so it should do voting.
Otherwise, each time we create a replication job for it, which is
useless because we don't replicate hooks.
Add transactional voting to the RestoreCustomHooks RPC by integrating
with the new LockingDirectory.
Changelog: changed
|
|
Sometimes it's useful to be able to lock a directory before modifying
it. This is similar to LockingFileWriter, except it just gives the
ability to lock a directory so another process cannot also lock the
directory.
This will be immediately useful in RestoreCustomHooks where we are about
ot add transactional voting to it. With transactional voting, the
semantics are that the `Prepared` vote is made once the data that is
about to be changed is locked.
This will enable us to lock the custom hooks directory.
Changelog: added
|
|
Since the maintenance routing feature flag has been running in
production without issue, we can now remove it.
Changelog: changed
|
|
The background verifier sets a lease time on a replica when it picks
it up for verification. If the worker dies for some reason, the lease
will remain in place and no other worker will pick up the replica for
verification again until the lease is cleared. The lease itself tells
the maximum time the worker itself would be working on the replica.
After it has been passed, it would be safe for another worker to pick
up the replica for verification again. This commit adds a background
goroutine that periodically releases expired leases so other workers
can take up the work if the original worker failed and did not release
the lease. The 'verificaton_leases' index is added so the query can
efficiently find the replicas with leases acquired to find the stale
ones.
|
|
Update version of danger-files dependency
See merge request gitlab-org/gitaly!4485
|
|
|
|
docs: Document Gitaly backpressure
See merge request gitlab-org/gitaly!4469
|
|
There are a number of knobs in Gitaly to tune backpressure Gitaly can impose on services that call it. This commit documents these.
|
|
Add support for FIPS encryption
See merge request gitlab-org/gitaly!4482
|
|
repocleaner: Allow NewWalker to receive grace period parameter
Closes #4164
See merge request gitlab-org/gitaly!4474
|
|
A default grace period of 6 hours is sufficient for the subcommand.
|
|
Implement 'praefect verify' subcommand
Closes #4091
See merge request gitlab-org/gitaly!4463
|
|
Praefect periodically verifies the repository metadata in the background.
The interval may be too long especially if there was an incident, say
a disk failure. After recovering the Gitaly node, the disk may be completely
empty or may contains an older snapshot which does not contain all expected
repositories. In such cases, it would be great if Praefect could be manually
instructed to verify the storage again as soon as possible rather than waiting
for the next scheduled verification interval to pass. This commit adds the
'praefect verify' subcommand that allows for doing that. It takes in either a
repository id, a virtual storage or a (virtual storage, storage) tuple and
marks all replicas matching the selector as being unverified. Praefect's
metadata verifier will then prioritize verifying these repositories over other
repositories pending verification. This allows administrators to speed up
the verification process and thus recovery.
Changelog: added
|
|
With the introduction of metadata verification, Praefect needs a tool
to manually mark a repository as needing verification immediately rather
than after the specified verification interval has passed. That tool will
require a new RPC that it can call achieve its goal. This commit adds the
proto definitions for MarkUnverified RPC which can be called to either
mark a single repository by ID, a whole virtual storage, or a whole storage
as needing verification.
Changelog: added
|
|
proto: Add LimitError as a structured error
See merge request gitlab-org/gitaly!4476
|
|
This commit adds support of using a FIPS-validated SSL library with
compiled Go executables when `FIPS_MODE=1 make` is run. A Go compiler
that supports BoringSSL either directly (e.g. the `dev.boringcrypto`
branch) or with a dynamically linked OpenSSL
(e.g. https://github.com/golang-fips/go) is required.
This is similar to the changes to support FIPS in GitLab Runner and in
GitLab Pages:
https://gitlab.com/gitlab-org/gitlab-pages/-/merge_requests/716
Changelog: added
|
|
Makefile: Make GITALY_EXECUTABLES deferred again
See merge request gitlab-org/gitaly!4477
|
|
Add new examples for concurrency and rate limiters
See merge request gitlab-org/gitaly!4472
|
|
Recently, in b5c9c7efc (Makefile: Rename find_commands to
GITALY_EXECUTABLES, 2022-03-25), we've changed the variable that holds
the names of all Gitaly executable to be an immediate variable. While
this is a good idea in general, it causes trouble in CI. In CI the
compiled executables are put in cache, but the source files are not. So
when files are pulled from cache, and any make target is built, it will
expand GITALY_EXECUTABLES. Now source files are not pulled from the
cache, so the `cmd` directory is missing. And therefore we revert it
back to be a deferred variable.
|
|
When Gitaly enforces a limit, either due rate limiting or concurrency
limiting, it needs to be able to return an error to its clients to
provide context into why it failed so that clients can then inform its
callers of why the call failed.
Changelog: added
|
|
Expose last verification time in 'praefect metadata'
Closes #4092
See merge request gitlab-org/gitaly!4466
|
|
Read-only cache receives invalidations on record updates via triggers
in Postgres. Currently the notifications are sent for any modification
to the records. The verification related columns are not relevant to
the operation of the cache so this commit ignores the changes to the
columns in the triggers.
Changelog: changed
|
|
Administrator's may want to know when Praefect has last verified a
replica. This commit exposes that information via the 'praefect metadata'
command.
Changelog: changed
|
|
GetRepositoryMetadata fetches a repository's metadata from the
database. This commit expands the query to also fetch the newly added
verified_at column so we can expose it in the 'praefect metadata'
command to the admins.
|
|
Administrators may want to know when a replica has been last verified
by Praefect. GetRepositoryMetadata RPC is called by the 'metadata'
sub-command to retrieve infromation about a repository and its
replicas from Praefect's database. This commit adds the proto
definitions for exposing the last verification time of replicas to
the metadata sub-command.
Changelog: changed
|
|
Initial implementation of a metadata verifier
See merge request gitlab-org/gitaly!4459
|
|
This commit wires the metadata verifier in Praefect's main so it can
actually be configured for use. It's default disabled still as it still
is missing some functionality that should be in place before generally
enabling it, for example tooling like metrics, integration in to the
'praefect metadata' tool and a background routine to release stale leases.
Changelog: added
|
|
This commit adds an initial implementation of a metadata verifier
to Praefect.
Praefect stores metadata of the repositories stored on the cluster in
Postgres. These metadata records may become out of sync with the disks
if changes occur on the disks without going through Praefect, for example
due to disk failures or manual modifications. Right now, Praefect only
contains some temporary logic to clean up invalid metadata records when
replication is attempted using a non-existent source repository. This was
mostly put in place to stop reconciliation loops where Praefect keeps
scheduling replication jobs from the non-existent repository that will
never succeed. While this performs some clean up, it's not sufficient to
catch cases where something happens in the background without prompting
replication.
The metadata verifier introduced in this commit aims to catch these issues
by verifying the metadata eveynow and then in the background with the
state on the disks. For now, only the existence of the replica is verified,
not the actual contents by checksumming.
Each replica contains a 'verified_at' timestamp in the database that tells
Praefect when the metadata record was last verified. If it exceeds a configurable
threshold, the replica is considered to be due for reverification. Praefect
then asks the Gitaly hosting the replica whether the replica still exists.
If it doesn't the invalid metadata record is deleted and the removal is logged.
To avoid multiple Praefects verifying the same replica concurrently, Praefect
acquires the verification lease on the replica in the database prior to
verifying the existence of the repository.
The scheduling is fairly simplistic at the moment with each Praefect acquiring
a batch of work every two seconds. This also serves as a crude way to rate
limit the background verification work rather to avoid consuming too many
resources while doing it. This should be sufficient for now althoug could later
be improved.
Praefect leaves the repository's record in place even if all of its replicas
have been lost. This ensures no data loss goes unnoticed and that the loss
needs to be acknowledged by removing the repository manually.
Changelog: added
|
|
This commit adds the necessary schema changes for the metadata
background verification. Each replica receives two new columns:
1. 'verified_at' which contains the timestamp of the last successful
verification of the replica. This effectively allows for identifying
replicas that are in need of reverification.
2. 'verification_leased_until' which contains a timestamp until which
a worker has acquired a lease to reverify the repository. This prevents
multiple workers from picking the same repository for reverification at
the same time.
'verification_queue' index is added to index replicas which have not been
acquired by any worker. This allows for efficientl querying replicas that
are in need of reverification later.
Changelog: other
|
|
[ci skip]
|