Age | Commit message (Collapse) | Author |
|
The post-receive hook run by git-receive-pack(1) is special compared to
all the other hooks because any errors returned by it have no impact on
the end result of git-receive-pack(1). This is because the hook runs
after references have been updated already, so it cannot have any impact
on the end result anymore.
Our hooks updater, which emulates roughly the same semantics for hooks
execution as git-receive-pack(1), behaves differently though and will in
fact report any custom-hook errors to the caller. This has the result
that depending on how the post-receive hook is executed, an error code
has different impacts.
Align the behaviour of the hooks updater to match what Git is doing and
ignore errors returned by custom hooks. Any custom hooks which relied on
this exact behaviour were broken already because the error code wouldn't
have changed the outcome, and neither would the hook work correctly when
executed by git-receive-pack(1). This thus shouldn't be a breaking
change, but at most fix behaviour in context of misbehaving post-receive
hook scripts.
Changelog: fixed
|
|
When we get reports from customers about issues with running hooks it's
always hard to figure out what exactly is happening. Part of the problem
is that the errors we return from the `UpdaterWithHooks` aren't properly
wrappend and don't have any breadcrumbs. As a result, it's hard to see
without digging deep what code path has actually caused an error.
Improve the situation by properly wrapping errors. While this changes
the error messages, callers shouldn't rely on them anymore anyway and
instead only ever access the structured errors, which remain unchanged.
|
|
Custom hook errors must be returned to the caller without modification
given that they may contain special prefixes that are parsed by the
Rails application. We have thus introduced the CustomHookError struct so
that we can unwrap that kind of error and then pass the error message to
the caller directly. But while we properly unwrap the error everywhere,
we don't in fact bubble up the contained error as expected.
Fix this and bubble up the unwrapped error's message. This allows us to
change error messages in those code paths to properly leave breadcrumbs
of the error trace.
|
|
go: Update gitlab-shell dependency to v14.8.0
See merge request gitlab-org/gitaly!4690
|
|
'master'
testserver: Detect when Praefect is dying while waiting for it
See merge request gitlab-org/gitaly!4710
|
|
Pin auxiliary binaries the main Gitaly binary invokes
Closes #3253 and #4303
See merge request gitlab-org/gitaly!4670
|
|
Our makefile forcefully rebuilds Gitaly binaries on every run. This
causes the prepare-tests target to always rebuild the packed binaries.
While the packed executables are a dependency that should be built prior
to running tests, they can't be built in CI due to the CI running
without privileges. Skip rebuilding the packed exectutables as part of
prepare-tests if we are running in CI. SKIP_RSPEC_BUILD is used for the
same reason to skip rspec build. This commit renames the variable to a
more general name and skips the packed binary building in CI during
tests.
|
|
We have to maintain backwards compatibility between Gitaly and its
auxiliary binaries as they may originate from different builds during
an upgrade. This has caused incidents in the past and slows down
iteration velocity. To avoid the incidents and th burden of backwards
compatiblity, it would be great if we could guarantee Gitaly
only ever invokes the auxiliary binaries from the same build. We now
have the infrastructure in place to locate binaries on disk and to
pack the auxiliary binaries in the main Gitaly binary to ship them
as a single deployment. This commit hooks wires up the final changes to
ensure the correct binaries get invoked always.
The Gitaly binary contains the auxiliary binaries packed into itself. On
start up, Gitaly will unpack the binaries into the runtime directory.
BinaryPath helper used to locate binaries is updated to correctly locate
the unpacked binaries from the runtime directory. Since each Gitaly process
has its own runtime directory, the upgrades do not affect the binaries
unpacked there. Replacing the gitaly binary on the disk itself is atomic.
The combination ensures that the correct binaries are always invoked.
The testing infrastructure is updated to also locate the packed binaries
to the runtime directory. The auxiliary binaries are not always built prior
to tests but as a part of them. Most of the tests thus do not exercise
the packing or unpacking but only the fact that certain binaries are used
from the runtime directory.
The packed binaries are still installed so they'll be present during the
upgrade if the old Gitaly uses them. We can stop installing them in the
next release.
Changelog: fixed
|
|
Gitaly needs to carefully maintain backwards compatibility with its
auxiliary commands to avoid issues during upgrades. Omnibus and source
deployments upgrade the binaries in place which can cause a running
Gitaly to invoke a binary from the newer release. This has been the
root cause in multiple production issues. To avoid having to maintain
backwards compatibility between Gitaly and its auxiliary binaries, we
can pin the binaries in a manner that Gitaly only ever invokes the
auxiliary binaries from the same build as the main binary. To do so,
we can pack the auxiliary binaries into the main Gitaly binary and
unpack them into a temporary directory when starting up. This way the
upgrades happen atomically, as replacing the gitaly binary is an
atomic operation and it contains the auxiliary binaries as well.
This commit adds the functionality to pack and unpack binaries in the
main Gitaly binary. When building the Gitaly binary, the auxiliary
binaries are taken from _build/bin and embedded in the main binary.
UnpackAuxiliaryBinaries can then be called on starting Gitaly to unpack
the binaries into a temporary directory where they can be invoked from.
In order to embed the binaries in Gitaly during the build, they have to
be built prior to building Gitaly. Makefile is updated to set the packed
binaries as dependencies of Gitaly's build. Gitaly depends on the fully
built binaries so the GNU build-id remains intact in the packed binaries.
The packed binaries are set also as dependencies for the prepare-test target
so the binaries exist for the tests that exercise the packing related code.
In similar fashion, the lint target requires the packed binaries to be
present as otherwise the linter will fail as it will find build failures
in Gitaly due to the embedding failing due to missing binaries. The packed
executables are added as a dependency to the lint target. Same goes for the
notice target which needs to have the packed binaries in place so the build
succeeds. Packed binaries are also added as dependency for the notice target.
|
|
Gitaly needs to locate auxiliary binaries in various places to pass
them as arguments and call them. Currently, the logic to do so is
spread around the code base where the cfg.BinDir is joined with the
binary name. Soon we'll have to make a distinction between binaries
that are in the BinDir and binaries that are unpacked into a runtime
directory on start up. To support this, this commit centralizes the
binary finding logic to BinaryPath helper added to Gitaly's config.
For now, it simply join BinDir with the binary name as each of the
call sites were doing. In a later commit, it will also help with
locating the unpacked binaries correctly.
|
|
git2go package is dynamically adding the module version to the
BinaryName variable. Only production build contains the module
version though, leading the test builds to have the binary name as
'gitaly-git2go-'. This makes it more difficult to centralize the
binary location resolving as we'd have to differentiate this case.
For now, let's just hard code the v15 suffix to the variable. We
can soon drop the suffix entirely when the backwards compatibility
concerns are dealt with by pinning the auxiliary binaries a given
Gitaly can invoke.
|
|
Repository service's server type contains an unused field called
binDir. This commit removes it.
|
|
objectpool: Switch to use OptimizeRepository for pool repos
Closes #4338
See merge request gitlab-org/gitaly!4708
|
|
The equivalent to a housekeeping job for object pools is our
FetchIntoObjectPool RPC: it fetches all changes from the primary pool
member and then optimizes the repository. Its implementation is separate
from `housekeeping.OptimizeRepository()` though which creates a world
where we have two parallel implementations of repository housekeeping,
which goes against our principle of a single source of truth.
More importantly though pool repositories may not only be optimized via
FetchIntoObjectPool, but also via our nightly repository housekeeping
job. So we already use both implementations to pack a repository. And
sure enough, they had some subtle differences already like delta islands
not being the same. Furthermore, OptimizeRepository is doing some things
like keeping commit-graphs up to date that FetchIntoObjectPool doesn't.
Unify the infrastructure so that we only have a one maintenance strategy
to worry about: OptimizeRepository. This has multiple benefits:
- We avoid any future divergence in behaviour.
- We have less code to worry about.
- Nightly maintenance and on-demand maintenance use the same code
paths.
- Commit-graphs are kept up to date again, which is something that
has been broken in object pools for quite a while.
- We avoid doing needless work when there is not much to be done
thanks to the heuristisc of OptimizeRepository.
- We avoid running repository maintenance concurrently on the same
pool repository given that OptimizeRepository exits early when it
sees a concurrent call to the same repository.
The new strategy is guarded by a feature flag for the time being.
Changelog: fixed
|
|
One of our test repositorise contains the "spooky commit" ba3343b (Weird
commit date, 292278994-08-17). This commit is rather far in the future,
and this is in fact corrupting commit-graphs:
commit date for commit ba3343bc4fa403a8dfbfcab7fc1a8c29ee34bd69 in commit-graph is 15668040695 != 9223372036854775
We cannot really avoid this corruption for now, unfortunately, so we'll
just have to live with it for the time being until a fix to this bug
lands upstream.
This bug unfortunately does break tests which use this repository and
which compute and use commit-graphs. Right now there seemingly aren't
any, but we're about to add some that surface this breakage.
Work around this bug by deleting the reference to this broken commit and
prune the commit as required.
|
|
When packing objects we make sure to create delta islands. These tell
Git that it may only create deltas against objects that are reachable by
references part of the same island, which can help speed up the creation
of packfiles because it can reuse existing on-disk deltas better.
We use different delta islands for normal and pooled repositories:
- A normal repository has `refs/heads/` and `refs/tags/` as delta
islands given that those are the most frequently used references.
- A pool repository has `refs/remotes/origin/heads/` and
`refs/remotes/origin/tags` as delta islands given that those
contain the most frequently used references of the primary pool
member.
While `FetchIntoObjectPool()` knows to set up the latter, special delta
island, `OptimizeRepository()` doesn't distinguish normal and pooled
repositories at all and thus uses the same delta islands as a normal
repository would. So when our nightly maintenance decides to walk over
any pool repository, it may silently screw up the deltas.
Fix this bug properly distinguishing between pool repositories and
non-pool repositories.
Changelog: fixed
|
|
We have a test in our objectpool package that ought to verify whether we
set up delta islands correctly when fetching into an object pool. This
test is completely broken though: we don't check delta islands in the
pool, but instead in its primary pool member by accident.
Fix this bug and check delta islands in the pool itself. Furthermore,
stop repacking the pool objects: it destroys the purpose of the test as
we want to test exactly how objects are packed after fetching.
While at it, write a commit into the source repository before performing
the fetch so that there is actually data that has changed. This isn't
necessary right now, but is required as soon as wwe migrate maintenance
of pool repositories to use `OptimizeRepository()`.
|
|
Refactor the shared testing infrastructure to exercise whether functions
that repack repositories set up delta islands correctly. This function
is not exactly trivial and a bit hard to understand.
Refactor the function a bit and extend its documentation.
|
|
Object pools have their references in `refs/remotes/origin` as compared
to using their own reference namespace. We require this constant in the
housekeeping package to fix our usage of delta islands when repacking
object pools.
Move the constant into the `git` package and make it public.
|
|
Improve test naming to match modern best practices and add missing calls
to `t.Parallel()`.
|
|
Improve error handling in `FetchFromOrigin()` to provide proper
breakcrumbs so that it becomes easier to see where a specific error has
been raised. Furthermore, given that this is an internal package and not
an RPC's implementation, we shouldn't wrap errors with error codes but
instead leave this to the RPC implementations.
|
|
Object pools and normal repositories currently share the same heuristics
to decide whether to repack objects or not. While this strategy works,
it is not ideal for pool repositories because they may be reused across
many forks. It is thus desirable for us to keep them as efficient as
reasonably possible.
Tweak the heuristics to repack object pools more aggressively. Right now
this shouldn't make much of a difference: object pools are only updated
via `FetchIntoObjectPool()`, which knows to do a full repack when it has
fetched objects. As a result, no object pool should ever have more than
a single packfile in general.
This will make a difference though when we migrate object pools to not
do their own housekeeping anymore, but to use `OptimizeRepository()`
instead: we'll want to keep similar semantics as we have right now and
aggressively repack.
Note that we also need to fix some of our tests which used git-repack(1)
with the `-A` flag to bring repositories into the state we want to test
repository maintenance on. This will leave behind the old, superseded
packfile though and thus we end up with two packfiles even though the
intent of the test was to only test with a single packfile. This didn't
matter before where we required at least five packfiles before we pack
objects, but does now where the limit for object pools is at two packs.
|
|
The lower limit of packfiles required to exist in order to trigger a
repack has been picked as five packed. It is clearly documented as such,
and tests try to exercise this. There has been an off-by-one bug though,
which causes us to set the lower limit to six packfiles. And the tests
that intend to verify this behaviour have the same off-by-one bug.
Fix this off-by-one and make the heuristic work as advertised.
Changelog: fixed
|
|
We already have some differences here because we never prune objects in
pool repositories, and we're about to tweak pool maintenance a bit more.
Extend tests for OptimizeRepository and related functionality to
explicitly verify behaviour with object pools.
|
|
Some of our tests in the housekeeping package need to verify whether we
correctly compute the cutoff point when we want to do a full repack
based on the size of the largest packfile. We thus have to write such a
packfile into our temporary directory, which we're currently doing by
using some implementation that use `io.Reader`. This is inefficient, and
the tests spend a significant amount of time to just write these files.
Simplify the code and speed it up by just using truncate(3p) instead.
This speeds up the test from around 1s to 0.35s on my computer.
|
|
The `IsPoolPath()` function can be used to check whether a given path
belongs to a pool repository or not. It is not quite clear though what
kind of parameter the function expects: is it a relative path, or is it
the full path to the object pool? This is easy to get wrong and may have
fatal consequences.
Change the interface to instead accept a `repository.GitRepo` to clarify
this.
|
|
hook: Add concurrency tracker to track concurrency of pack object
See merge request gitlab-org/gitaly!4596
|
|
'master'
maintenance: Remove per-repository timeouts in nightly maintenance
Closes #4345
See merge request gitlab-org/gitaly!4711
|
|
Prune leftover runtime directories on startup
See merge request gitlab-org/gitaly!4700
|
|
Our nightly maintenance job is running in a time-limited window of 30
minutes by default. Despite this timeout we have in place for the
complete maintenance job we also have a per-repository timeout of 5
minutes that cannot be altered at all. This timeout exists so that as
many repositories as possible have the chance to get optimized.
In total this per-repository timeout creates more harm than benefit
though. While it may in fact cause us to optimize more repositories, it
also has the effect that we may consistently fail to optimize any repos
which are bigger than a certain threshold because they take more than 5
minutes to optimize. So this timeout prioritizes small repositories and
penalizes huge repositories.
This is indeed nothing we want. In fact, I'd even argue that maintaining
repositories that are this huge is a lot more important than maintaining
the smaller ones given that they're so much more expensive to serve.
Drop the per-repository timeout to stop penalizing large repositories.
Optimizing them will still be cancelled when the overall timeframe for
the nightly maintenance job is over.
Changelog: fixed
|
|
Gitaly can leave around stale runtime directories if it fails to run
its clean up when terminating. These runtime directories can pile up
as there's nothing cleaning them up later. This commit adds a pruning
logic where Gitaly will remove runtime directories that belonged to
process' that no longer exist. The process that created the runtime
directory is identified by the process id suffixed to the directory
name. The process is not perfect and can leave some directories
uncleaned if another process now has the same pid as the process that
created the directory.
Changelog: fixed
|
|
Makefile: Update libgit2 to v1.3.2
See merge request gitlab-org/gitaly!4705
|
|
When booting up Praefect and connecting to Postgres, we check its
version so that we can be sure it's in fact running a supported version.
This check is using a timeout of a 100 milliseconds. While this probably
tends to work in most setups, this deadline can easily be busted when we
connect to a very busy server. And indeed, we frequently see this fail
on a particularly busy system, which is our CI.
Ideally, we'd just get rid of the timeout altogether to not run into
this scenario at all anymore. On the other hand it might be sensible to
fail on the startup of Praefect when we fail to connect to Postgres in a
reasonable amount of time so that it doesn't hang indefinitely. The real
question is what "reasonable amount of time" means though.
For now, let's pick a much more conservative timeout of five seconds.
This should hopefully be plenty of time for our CI, and still causes us
to fail comparatively fast on production systems in case anything goes
wrong.
Changelog: fixed
|
|
When running the `test-with-praefect` target, we frequently have flakes
where waiting for Praefect to come up eventually times out. We have thus
recently introduced logic to capture stderr of the Praefect process we
spawn in case we fail to connect via 484cdfb1b (testserver: Capture
Praefect logs if dialing fails, 2022-07-07) to hopefully get a better
understanding of what's happening.
And indeed, this change has been fruitious. Recent failings now show the
following fatal error logged by Praefect:
{"level":"fatal","msg":"get postgres server version: context deadline exceeded","pid":526384,"time":"2022-07-14T07:05:34.819Z"}
We fail to connect to Praefect because Praefect itself fails to connect
to Postgres in time. It then dies without us noticing, so we keep on
waiting for Praefect to come up.
This highlights two different problems: the fact that Praefect cannot
connect to the Postgres server, and the fact that we don't notice that
Praefect died and continue to wait for it.
Fix the second bug by spawning a Goroutine that waits for the process to
exit. If it does exit while waiting for it to become healthy we now
cause the test to fail immediately instead of waiting for the context to
time out. This results in the following exemplary error:
=== FAIL: internal/gitaly/service/smarthttp TestServer_PostUploadPackWithSidechannel_suppressDeepenExitError/no_config_options (1.41s)
gitaly.go:133:
Error Trace: /home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:133
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/praefect.go:126
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:73
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:62
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:31
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:55
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:59
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/upload_pack_test.go:243
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/upload_pack_test.go:48
Error: Received unexpected error:
failed to dial "unix:///tmp/gitaly-1383785959/gitaly.socket.30475510" connection: context canceled
Test: TestServer_PostUploadPackWithSidechannel_suppressDeepenExitError/no_config_options
praefect.go:111:
Error Trace: /home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/praefect.go:111
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/panic.go:482
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testing.go:864
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:133
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/praefect.go:126
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:73
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/gitaly.go:62
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:31
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:55
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/testhelper_test.go:59
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/upload_pack_test.go:243
/home/pks/Development/gitlab/gdk/gitaly/internal/gitaly/service/smarthttp/upload_pack_test.go:48
Error: Praefect has died
Test: TestServer_PostUploadPackWithSidechannel_suppressDeepenExitError/no_config_options
Messages: {"level":"info","msg":"Starting praefect","pid":526384,"time":"2022-07-14T07:05:34.817Z","version":"Praefect, version "}
{"level":"info","msg":"establishing database connection to /home/pks/Development/gitlab/gdk/postgresql:5432 ...","pid":526384,"time":"2022-07-14T07:05:34.817Z"}
{"level":"fatal","msg":"get postgres server version: context deadline exceeded","pid":526384,"time":"2022-07-14T07:05:34.819Z"}
--- FAIL: TestServer_PostUploadPackWithSidechannel_suppressDeepenExitError/no_config_options (1.41s)
While this error is quite verbose, it exactly pinpoints what has
happened: first we fail to connect to Praefect ourselves, and second
Praefect has died under us with a specific error message.
|
|
Makefile: Update Git to v2.37.1
See merge request gitlab-org/gitaly!4706
|
|
go: Update github.com/hashicorp/yamux digest to 0bc27b2
See merge request gitlab-org/gitaly!4704
|
|
In order to gain insight into concurrent calls of pack objects, call the
ConcurrencyTracker each time we spawn a pack objects process. Since we
want to know both how many concurrent pack object processes are spawned
per user as well as per repository, we have a call for each.
|
|
Instantiate a ConcurrencyTracker in the main goroutine and pass it in
through service dependencies.
|
|
Add ConcurrencyTracker as a service dependency and set up the
boilerplate code that includes it in the NewServer invocation. A future
commit will actually instantiate it in the main goroutine and pass it
in.
|
|
Add a concurrency tracker whose job it is to keep track of concurrent
calls based on some key type and key. This will be used for example, to
track the number of concurrent calls to pack objects based on user_id as
well as repository.
The LogConcurrency function returns a cleanup function,
which should get called when the caller finishes its process. When the
concurrent processes for a certain key is at 0, the element will get
removed from the map to save memory.
There are also two metrics that will give us increased insight to the
traffic patterns. The first metric gives us the total active callers
broken up by segment (repository, user_id, or even something else in the
future). This answers the question "how many unique ___"(user_id,
repository) have active pack object processes?
The second metric gives us a histogram of concurrent pack object
processes so we can answer the question "what's the p99 of concurrent
pack object processes for users? repositories?"
|
|
Update libgit2 to v1.3.2. This release contains fixes to both
CVE-2022-24765 and CVE-2022-29187, both of which relate to opening
repositories owned by a user different to the current one that may lead
to privilege escalation.
While libgit2 itself is not affected by these vulnerabilities, the
upgrade brings it in line with what Git is doing. Most notably, libgit2
will now refuse to open repositories which are owned by a different
user. This should theoretically not make much of a difference for Gitaly
given that it is expected that all repositories are typically owned by
the same user as the one we're executing as.
Also note that this upgrade does not plug a known vulnerability in
Gitaly itself, but is rather done as a precaution and to not be put in a
position to argue whether we are or aren't susceptible to these CVEs.
|
|
go: Update module github.com/cloudflare/tableflip to v1.2.3
See merge request gitlab-org/gitaly!4696
|
|
Add onboarding issue template for Gitaly team.
See merge request gitlab-org/gitaly!4701
|
|
|
|
|
|
|
|
helper: Fix error messages for wrapped gRPC errors
See merge request gitlab-org/gitaly!4692
|
|
testserver: Capture Praefect logs if dialing fails
See merge request gitlab-org/gitaly!4691
|
|
golangci-lint: Increase timeout to 10 minutes
See merge request gitlab-org/gitaly!4707
|
|
updateref: Skip known-flaky test with Git v2.33.0
See merge request gitlab-org/gitaly!4699
|