Age | Commit message (Collapse) | Author |
|
The `errRepositoryNotFound` error is only used by tests now. Convert
them to use `datastore.ErrRepositoryNotFound` and get rid of it.
|
|
The datastore is reusing errors from the storage layer even though these
are conceptually orthogonal to each other. Furthermore, given that we
sometimes don't have the storage name and relative path available but
just the repository ID, we need to use different error types depending
on the callsite.
Refactor the package to instead use its own ErrRepositoryNotFound error.
This also allows us to get rid of the `RepositoryNotFoundError` type,
where we instead inject error code and metadata in a central place in
the coordinator given that it has more context.
|
|
It is not easily possible to explicitly handle errors for when resolving
additional repositories fails. Introduce an explicit error so that we
can easily handle them specially and introduce error handling in the
coordinator that will inject error metadata for these conditions.
|
|
Parameters ctx and relativePath are not used in the method.
We are safe to drop them to simplify the code.
|
|
Same as the preceding commit, this commit moves the knowledge around
pool paths from the stats package into the storage package. This ensures
that all knowledge around repository paths is neatly contained in a
single file.
|
|
The praefectutil package contains a set of functions that compute and
classify repository paths for Praefect's rewritten replica paths. They
had to be split out into a separate package given that they're also used
by Gitaly code in order to figure out whether a repository is an object
pool or not. Furthermore, the code to figure out whether a repository is
an object pool or not is also partially contained in the stats package,
which adds even more to the confusion as it is now split up across
multiple packages.
Move the computations for Praefect paths into our storage package. This
makes for a better fit as this package should be the only source of
truth for where repositories are located anyway, and is already used by
both Praefect and Gitaly.
The next commit will also move the code from the stats package in here
to make the knowledge neatly contained in a single place.
|
|
When rewriting the target repository, the rewritten relative path will
be specific to the node's storage to which the rewritten request is
being routed to. Consequentially, if we're also rewriting an additional
repository, it follows that both repositories must be hosted by the same
storage, or otherwise one of both repositories wouldn't actually be
resolveable on the local storage.
This property wasn't followed when routing repository creations that
have an additional repository. This can lead to a case where we resolve
the additional repository first, but ultimately assign the newly created
repository to nodes that don't actually have the additional repository.
The result is breakage.
We have fixed this bug in 41c10e01f (praefect/router: Fix creation of
repos with additional referenced repo, 2023-05-10), where we started to
route repository creating RPCs to the same nodes as the additional
repository is assigned to. This change was rolled out more than a week
ago into production systems without any observed issues.
Remove the feature flag guarding the code so that we unconditionally
route such repository creating requests to the correct node.
|
|
The feature flag logic is currently hosted inside of
`internal/metadata/featureflag`. And while they do have something to do
with metadata, they really are a standalone concept.
Move the package to `internal/featureflag` to reflect this.
|
|
In order for Praefect to figure out where a specific request needs to be
routed to we have introduced the `target_repository` gRPC extension. If
a field is tagged with this extension, then it indicates to Praefect
that it shall contain the repository to which the request needs to be
routed. Furthermore, it also acts as a mechanism to support our repo
rewriting mechanism where Praefect intercepts and rewrites the target
repository so that the target storages may have a unique replica path.
The `additional_repository` extension is in the same spirit. But as
requests can only be routed to a single node at once, its only intent is
to indicate to Praefect that the contained repository path shall be
rewritten to the replica path. Due to a set of limitations in Praefect,
the target repository and additional repository _must_ always be on the
same storage as the translated replica paths are indeed storage-local.
So if there were different target storages, then the target node would
likely not even be able to handle them.
But while this restriction exists, we don't actually honor it ourselves
when creating repositories via a request that has an additional repo
set. Theoretically-speaking, it is thus possible that the new repository
will be created on a set of storages that is partially or completely
disjunct from the set of storages that host the preexisting additional
repository.
There is only one such request now that is impacted by this, which is
the CreateObjectPool RPC. So if the above bug is hit, then creation of
the object pool will fail because Gitaly cannot locate the origin repo
on its storage.
While theoretically a large issue, in practice it is likely not going to
matter all that much because most setups use a replication factor that
is in fact equal to the number of Gitaly nodes. Which means that we'd
always create the object pool on the same set of nodes that already have
the origin repository anyway.
But there's one more subtle issues here: we also didn't pay attention to
the repository generation. So we'd be totally fine to create an object
pool from a source repository that is known to be out-of-date on any of
the secondaries. The end result could be inconsistently created oject
pools.
Fix these bugs by implementing special routing for all repository
creating RPCs that have an additional repository field. Like this, we
will always use the same set of storages as the additional repository is
assigned to. Furthermore, this also allows us to take generation numbers
of the additional repository into account.
Changelog: fixed
|
|
When routnig repository creations we have somewhat different logic based
on the replication factor:
- If it is set to `1` then we only need to pick a primary node.
- If it is set to `0` then we have to assign to all nodes.
- If it is set to a value `n >= 2` then we need to pick `n - 1`
secondaries.
The first case is handled via an explicit early return while the latter
two cases are mostly using the same codepath. The early return makes it
a bit harder to follow the logic though and also doesn't make it obvious
what input exactly is required for each of the different cases.
Furthermore, we're about to add another case that needs special
consideration. And while that case could also be handled via an early
return, it would only decrease readability even further.
Refactor the code to use an explicit switch over the different cases so
that it's easy to see which code paths belong together.
|
|
The RouteRepositoryCreation function handles many concerns at once:
- It resolves the additional repository path.
- It assigns the nodes on which the repository shall be created.
- It derives the replica path that shall be used on the target
nodes.
- It reserves the repository ID of the new repository.
While most of the functionality is straight-forward, the logic to assign
a repository to its nodes is comparatively involved. This makes it hard
to figure out which parts belong together and ultimately makes the
function harder to extend.
Refactor the code by pulling out a function `assignRepositoryToNodes()`
that encapsulates the assignment logic. This makes the functions easier
to understand and extend.
|
|
We're about to release Gitaly v16.0. As we've landed a bunch of
previously announced removals it's thus time to bump our Go module
version from v15 to v16.
|
|
The repository store interface uses plain maps as sets. Refactor them to
instead use our new set type. This reduces the mental overhead it takes
to first understand what the map is actually representing.
|
|
The `housekeeping` package currently hosts the logic to decide whether a
repository is a pool repository or not. This is not a particularly good
fit, but was done in the past to avoid an import cycle. There is another
import cycle coming up though between `git/stats` and `housekeeping`, as
we want to migrate some the information into the `RepositoryInfo`
struct.
Move the functionality into `git/stats` to avoid this import cycle. It's
a much better fit anyway given that this package is responsible for
reporting various different bits about repositories already. So adding
information about whether a repository is an object pool or not feels
natural.
|
|
Praefect-generated replica paths have been enabled by default since
15.3 without issues, so this commit removes the feature flag and
related code.
Changelog: removed
|
|
With 7e87131ed (golangci-lint: Allow `testing.T` as first parameter,
2022-08-10) we set `context-as-argument` the sole argument to
`revive:rules`. This causes golangci-lint to disable all revive lints
that were previously enabled by default.
Re-enable the default revive lints by explicitly listing them in our
config. This triggers a large number of lint errors, resolve these as
well in this commit.
|
|
We currently exclude a revive rule that `context.Context` should be the
first parameter for our test sources. This can be handled better though
because golangci-lint allows us to exclude certain types from this rule.
Adapt the rule to allow `testing.T` et al before `context.Context` and
remove the excluded rule. Interestingly, this now surfaces a whole bunch
of `nolint: revive` annotations that aren't needed anymore, so we fix
them in the same commit.
|
|
The `IsPoolPath()` function can be used to check whether a given path
belongs to a pool repository or not. It is not quite clear though what
kind of parameter the function expects: is it a relative path, or is it
the full path to the object pool? This is easy to get wrong and may have
fatal consequences.
Change the interface to instead accept a `repository.GitRepo` to clarify
this.
|
|
This commit changes the major version in the package name from v14 to
v15
Updating go.mod & go.sum with new module name v15
Update Makefile to bump major version to v15
Update the gitaly package name in the Makefile. Also update
gitaly-git2go-v14 -> gitaly-git2go-v15. We need to keep
gitaly-git2go-v14 for a release however, for zero downtime upgrades.
This pulls directly from a sha that is v14.
Update package name from v14->v15 for auth, client, cmd, internal packages
This commit changes the package name from v14 to v15 in go and proto
files in the internal, auth, client, cmd packages.
proto: Update major package number in package name
tools: Change major version number in package name from v14 to v15
gitaly-git2go: Change the package name from v14 to v15
update module updater for v15
Update the documentation for the module updater to reflect v15
|
|
Renaming, creating and deleting repositories is racy in Praefect.
They can also partially fail in awkward ways due to Praefect first
applying the operations on the disks of the Gitalys and later updating
its metadata store. In order to make these operations atomic, there's
been ongoing work in the background to make it possible to perform these
in the database in a manner that races are not possible and partial
failures do not end up conflicting with future operations.
Renames can be fixed by doing the rename atomically in the database
without moving any repositories on the disks.
Deletes can be modeled as simply deleting the repository's database record.
Creates are atomic by creating the repository's database record as the
last step in the process.
The last piece of the puzzle is to ensure repositories always land in different
directories on the disk. This ensures that a partial failure doesn't block
a future operation from succeeding. This commit implements that piece by deriving
a unique path for each repository to store their replicas. Existing repositories
stay where they are but newly created repositories will land in unique directories.
Create might succeed on a disk but fail to be persisted in the database. Prior to
this change, attempting to recreate a repository might fail due to the stale state
on the disk. With this in place, the next attempt at creating the repository would
still succeed as the new attempt would land the repository in a different directory
on the disk.
Deletes have the same problem prior to this commit. The repository's metadata may
be successfully deleted but if we fail to delete the repository from the disk, future
creations and renames may fail if they conflict with the stale state. As creates now
always land in a different directory, the stale state no longer causes problems.
Renames will work purely in the database, so any stale state on the disk will not affect
them.
Changelog: fixed
|
|
We have introduced a new "maintenance" operation type for RPCs which
allows us to easily special-case repositories which fall into this
class. One those locations where we do wish to special-case them is in
the coordinator. All RPCs which we're about to convert to be maintenance
operations are currently classified as mutators. As a consequence, we:
- Need to explicitly special-case those RPCs to not have
transactions enabled because they shouldn't change any on-disk
state as visible to the user. Furthermore, the state we're writing
to disk is often not deterministic.
- Because transactions are disabled, maintenance is never performed
on multiple nodes at once. As a consequence, we always create
replication jobs for maintenance jobs, which requires quite some
architecture and persistence of parameters in the database.
- We skip nodes which are not currently consistent with the state
other nodes have. This is not a requirement though: because the
maintenance RPCs shouldn't have an impact on user-visible state
anyway we don't really care about consistency, only about
routability.
Introduce special routing for those new maintenance operations. We're
using a best-effort strategy: every node that is hosting the repository
at hand and which is online will be included. Furthermore, we never
create replication jobs. It should be acceptable to just wait for the
next time a maintenance RPC is scheduled because they shouldn't have an
impact on servicability.
With these changes in place we can eventually get rid of replication for
maintenance-style RPCs, which is a good step towards deprecating the
complete replication queue.
Changelog: changed
|
|
Upgrade gofumpt to v0.3.1. Besides a very small number of new rules, the
most important benefit is that the new version parallelizes formatting
of files. It is thus much faster, also for our codebase:
$ hyperfine --warmup=1 'make format' 'make format GOFUMPT_VERSION=0.3.0'
Benchmark #1: make format
Time (mean ± σ): 1.238 s ± 0.060 s [User: 1.337 s, System: 0.084 s]
Range (min … max): 1.125 s … 1.312 s 10 runs
Benchmark #2: make format GOFUMPT_VERSION=0.3.0
Time (mean ± σ): 239.5 ms ± 26.0 ms [User: 1.804 s, System: 0.109 s]
Range (min … max): 180.6 ms … 284.7 ms 14 runs
Summary
'make format GOFUMPT_VERSION=0.3.0' ran
5.17 ± 0.61 times faster than 'make format'
Reformat our code with the new version.
|
|
The object pool service RPCs carry an additional repository along
with the target repository. This repository lies on the same storage
as the target repository. The additional repository's relative path
needs to also be rewritten by Praefect with the replica path as
it would have been when the additional repository was originally
created.
Praefect currently does so for other mutator RPCs than 'create' type
as there are no 'create' type RPCs yet that contain an additional
repository. CreateObjectPool is incorrectly not marked as 'create'
type call though. In preparation for fixing that, this commit already
introduces relative path rewriting for additional repositories in
'create' type RPCs.
|
|
Most of the linting exception we have are about missing documentation of
public variables or comments. Let's inline all of these -- it's hard to
keep track of exceptions and keep the list up-to-date. By having a
`//nolint` directive at the site where the document is missing it
becomes a breeze to directly clean it up. Furthermore, it also acts as a
reminder that one might want to add a comment when one stumbles over any
of these comments.
|
|
ObjectPoolService operates on multiple repositories in its mutators.
One of these repositories is the repository itself and the second one
is always the object pool. Depending on the operation, either one is
the target repo and the other one is called the additional repository.
As ObjectPool's are not handled in a special manner by Praefect, they
will also get unique relative paths generated for them by Praefect. The
proxied requests need to be then rewritten to use this generated path
instead of the client provided relative path. Prafect is already rewriting
the relative paths of the target repository but not the relative path of
the additional repository. This commit extends the Router to returns the
replica path of the additional repository if one is included in the request.
|
|
This commit returns the replica path with repository accessor routing
decisions. This allows the coordinator to later rewrite the relative path
in the requests to target the correct repository on the disk.
|
|
This commit returns the replica path when routing repository mutators
and creations. This can then later be used to route the requests to the
correct repository on the disks of the storages.
|
|
Praefect will start generating unique relative paths for repositories in
14.6 to prevent stale state conflicting with recreated repositories. To
facilitate this, the 'repositories' table has the 'replica_path' column
to store where the replicas of a repository are stored. While in 14.5
Praefect still doesn't generate the paths, it needs to route the requests
to the paths stored in the database. This allows 14.6 then to start generating
unique paths and be backwards compatible wiht 14.5 as it is already using
the paths from the database.
This commit returns the replica path from GetConsistentStorages. The method
was chosen mostly as its result is cached for accessors, so integrating the
replica path into the cache is effortless. This commit only includes the changes
required to fetch the replica path and cache it. A later commit will update the
routing logic to actually route the requests to the stored path.
|
|
Praefect is now getting the primary of a repository by the repository
ID. Given that, the virtual storage paraemeter was no longer needed and
was removed. There are some tests for RepositoryReplicas in Rails' repo
which use the legacy election strategies. These legacy election strategies
still need the virtual storage parameter to determine the primary node.
In order to keep compatibility with the tests, this commit returns the
parameter to GetPrimary. While this works for now, we should really configure
Praefect properly in the tests to avoid being stuck with legacy implementations.
|
|
Praefect's routing should use repository ID consistently in all of the
queries in order to avoid races where the external key of the repository
changes between the various queries. This commit updates GetHostAssignments
to get assignments by the repository's ID rather than its virtual storage
and relative path.
|
|
This reverts commit 311fbb9789c5310314c8a2c6916bfb331aa4661b.
|
|
This commit updates the PerRepositoryRouter's mutator routing to get
consistent storages by the repository ID to ensure the query refers
to the same repository as the other queries in the method. This avoids
races in cases such as the repository being renamed in between the queries.
|
|
Praefect's routing should use repository ID consistently in all of the
queries in order to avoid races where the external key of the repository
changes between the various queries. This commit updates GetHostAssignments
to get assignments by the repository's ID rather than its virtual storage
and relative path.
|
|
StaticStorageAssignments is not used anywhere in Praefect so this
commit removes it.
|
|
This commit updates GetPrimary to get a repository's primary by its
repository ID. This is the first step to make Praefect's request routing
race free by ensuring all of the various queries operate on the same
repository based on its ID rather than the virtual storage and relative
path which could be changed by concurrent requests in the mean while.
|
|
This commit returns the repository's ID from RouteRepositoryMutator.
The repository ID is needed in the request finalizers later to propagate
the repository id into the replication jobs.
|
|
When Praefect is routing a repository creation, it needs to reserve
a repository ID in order to generate later a relative path for the
repository. This commit changes the PerRepositoryRouter to reserve a
repository ID and return in the route. The ID reservation also acts
as a fail fast check to see whether there already exists a repository
with the given virtual storage and relative path.
|
|
CreateRepository will need to know the relative path of the repository
the client is trying to create in order to check whether a repository with
it already exists or not. This commit pipes the relative path to CreateRepository
to prepare for follow-up changes even though the relative path is not yet used.
|
|
The new "v14" version of the Gitaly module is named to match
the next GitLab release. The module versioning is needed in
order to pull gitaly as a dependency in other projects. The
change updates all imports to include v14 version. The go.mod
file was modified as well after go mod tidy execution. And
the changes in dependency licenses are reflected in the NOTICE
file.
Part of: https://gitlab.com/gitlab-org/gitaly/-/issues/3177
|
|
Neither of the Router implementations currently handle repository
creations transactionally. This is a leftover from past. Repository
creations were originally handled as every other mutator. As the
repository was missing DB records prior to being created, the secondaries
were always considered outdated. When PerRepositoryRouter was implemented,
it's implementation matched the existing behavior to minimize behavior
differences.
When Praefect receives a lot repository creaton calls, the replication
queue comes under pressure as every repository creation needs to be
replicated. This has caused problems when for example seeding repositories
or performing Geo sync operations.
This commit fixes the above problem by handling the repository creations
transactionally. Healthy secondaries are included in the transaction and
unhealthy secondaries are still kept as replication targets. This should
reduce the replication queue pressure with a lot of repository creations.
Changelog: changed
|
|
After a couple of releases with enabled reads distribution
feature we are removing a feature flag completely.
The special test-cases deleted as now the feature is a default
and there is no way to change it.
The change also fixes some comments around reads distribution.
Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3383
|
|
PerRepositoryRouter does not currently take the reads distribution
feature flag into account. As we are getting ready to use the router
in production, this commit adds the feature flag check in the router.
|
|
This commit passes the ConsistentStoragesGetter configured in main
to PerRepositoryRouter instead of the RepositoryStore. This allows
PerRepositoryRouter to also benefit from the caching used in
NodeManagerRouter.
|
|
Now that routers get a flag telling them whether they should force-route
to the primary or not, we can deduplicate code a bit and move inspection
of the RPC's metadata into the coordinator.
|
|
Implementations of the Router interface are all forced to check
themselves whether they must force-route accessors to the primary node
because the RPC's metadata tells them to. While annoying, this hasn't
been much of a problem until now given there's only two implementations.
But we need to extend the conditions to also take into account what kind
of RPC we are currently routing. Currently, the Routers are designed
RPC-agnostic though except for an RPC's type, and this should stay this
way. Instead, it's the coordinator which should inspect the RPC and tell
the router to force-route to primaries.
As a preparation for this change, this commit adds a new `forcePrimary`
flag to the `Router` interface. If set, it tells the router that it must
always route to the primary node. No change in behaviour is expected
from this change as we always pass `false` for now.
|
|
This commit adds support for selecting host nodes randomly on
repository creation in PerRepositoryRouter. New repositories get
random secondaries assigned as host nodes until the default
replication factor is met. If the virtual storage has no default
replication factor configured, every configured node, except the
randomly picked primary, is included as secondaries. This is
fallback behavior to match Praefect's behavior prior to variable
replication factor.
|
|
With reads distribution being enabled, requests would typically be
routed to a random up-to-date node. There is situations though where not
even up-to-date nodes would be able to serve a request, e.g. when object
quarantine is in use.
This commit implements force-routing to the primary for the
per-repository router. It's implemented a new GRPC metadata header
"gitaly-route-repository-accessor-to-primary": if set, a
repository-scoped accessors will always get routed to the primary.
|
|
NodeManagerRouter routes repository creations as normal mutator calls.
This works as it defaults to routing to the virtual storage's primary
when no database records of a repository exist.
This does not work for the PerRepositoryRouter as there is no concept
of virtual storage's primary. Due to this, repository creations need
to be routed in a special manner.
This commit adds support for routing repository creations with
PerRepositoryRouter. It picks a random healthy node to act as the
repository's primary node and sets other nodes as replication targets.
This behavior matches the existing behavior in NodeManagerRouter, except
for picking a random node to act as the primary instead of the virtual
storage's primary.
The functionality is not yet wired to the rest of the code.
|
|
On each read/write operation praefect requires to know which
gitaly node is a primary. For mutator operations it used to
define from what node the response will be returned back to
the client. For the read operations it is used to redirect request
to or as a fallback option for reads distribution in case it
is enabled. The default strategy for defining the primary is
'sql' which means the primary is tracked inside of the Postgres
database and praefect issues select statement into it each time
it needs to define the current primary. It creates a high load
on the database when there are too many read operations (the
outcome of the performance testing).
To resolve this problem we change the logic of retrieval of
the set of up to date storages to return all storages including
the primary. With it in place we don't need to know the current
primary and use any storage that has latest generation of the
repository to serve the requests. As this information is cached
by the in-memory cache praefect won't create a high load on the
database anymore.
This change also makes check IsLatestGeneration for the primary
node redundant as it won't be present in the set of consistent
storages if its generation not the latest one.
Fix linting issues
Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3337
|
|
The golangci-lint project has a set of default excludes for linters
which are applied by default. This means that we cannot choose ourselves
which linting messages we want to include, as some of them cannot be
controlled by us.
One noteworthy linting check which is excluded right now checks whether
comments for types and functions are prefixed with their respective
name. This is something which we try to enforce in Gitaly, too, but
naturally there's cases where it slips through review.
So let's disable the default excludes such that we can enable this
linting check. As it turns out, there's quite a lot of cases where we
failed to adhere to this style, which this commit also fixes. One thing
which we don't do though is to have per-package comments, so let's
actively ignore it for now.
|