Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitaly.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-04-14tests: Fix random set of missing error checksPatrick Steinhardt
Various functions return error which we do not check. Fix these by calling `require.NoError()` to catch any unexpected issues.
2021-04-13support muxed connection in DisabledElectorSami Hiltunen
GitLab's tests use Praefect with the DisabledElector. Recently an MR was merged to default enable all Gitaly and Praefect feature flags when the tests are being run. The pipeline was failign as the DisabledElector did not support the connection multiplexing and the feature flags were enabled. Fix this by passing the multiplexed connections in to the DisabledElector as well.
2021-04-08Merge branch 'smh-inject-backchannel-deps' into 'master'James Fargher
Inject backchannel dependencies to components See merge request gitlab-org/gitaly!3339
2021-04-07inject backchannel ClientHandshaker from mainSami Hiltunen
This commit injects the multiplexing handshaker from Praefect's main to the dialing locations. This allows us to later plug in a backchannel server easily. This commit has no changes to the functionality itself.
2021-04-06return NotFound when accessing non-existent repositorySami Hiltunen
The PerRepositoryRouter expects database records to exist for each accessed repository. This is due to two reasons: 1. With each repository having their own primary, we don't have a virtual storage's primary anymore to fallback on when we don't know where the replicas exist. 2. With variable replication factor, we really need to know where to send the request as not every storage contains all repositories anymore. This behavior caused problems for Rails, as Rails sends us requests about repositories which do not exist. This could be due to various race conditions or the repository might still be present in Rails' caches. Rails is catching not found errors returned and handles them gracefully. Prior to this commit, the PerRepositoryRouter was returning 'no suitable node to serve request' errors when we didn't have a node we could serve the request from. This case was also hit when the repository had no database records. To differentiate the repository existing but all host nodes being unaccessible from Rails asking for a non-existent repository, this commit returns a proper status code to indicate the repository did not exist. This worked with virtual storage scoped primaries due to the fallback to primary behavior when no replicas where found from the database. The primary would then return NotFound when trying to access a non-existent repository on the disk. This behavior is still maintained and the case when no database records exist is ignored when virtual storage scoped primaries are used.
2021-03-24Removal of the feature flag: distributed_readsPavlo Strokov
After a couple of releases with enabled reads distribution feature we are removing a feature flag completely. The special test-cases deleted as now the feature is a default and there is no way to change it. The change also fixes some comments around reads distribution. Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3383
2021-03-04remove DirectStorageProviderSami Hiltunen
This commit removes DirectStorageProvider as it is just a proxy to call GetConsistentStorages on the RepositoryStore.
2021-03-03coordinator: Allow force-routing specific accessors to primaryPatrick Steinhardt
Until now, we've been completely agnostic about which accessor RPC we're routing: they simply all use reads distribution. This works just fine for most of the RPCs, but leads to inconsistent results for some others where it's really hard to fix. Two such RPCs are `RepositorySize` and `GetObjectDiretcorySize`: both of them depend on the on-disk state of the repository. This is nothing we have full control of: first, we do not assure that replicas always get packed at the same point in time. And second, even if we did, we cannot guarantee that git always ends up with the same set of packfiles. As a result, the likelihood that replicas would report different sizes is exceedingly high. We could probably solve this by implementing some custom logic which simply routes the RPC to all replicas and then takes the mean of all returned sizes. But this is needlessly complex, and would require us to put a lot of RPC-specific behaviour into Praefect. This commit instead introduces a best-effort strategy of simply always routing both RPCs to the primary. This may still lead to discrepancies when the primary node is flapping. But that shouldn't happen too frequently, and when it does we probably have other problems than repo sizes.
2021-03-01fix data race in TestCoordinator_grpcErrorHandlingSami Hiltunen
TestCoordinator_grpcErrorHandling has a data race due to sharing test server state between different test cases. If the primary node returns an error, the proxy cancels all secondary streams. This cancel is purely client side though, the proxy does not wait for the server handlers to actually terminate. This causes a race where the primary's error is returned while some secondary servers are still executing the handler. The test goroutine then runs assertions and starts setting up the next test case. The WaitGroup and the errByNode are shared between the test cases, so the loop setting up the WaitGroup and errByNode then races with the secondary server handler still executing from the previous test case. This commit fixes the race by setting up a new server for each test case to avoid sharing the WaitGroup and errByNode between the test cases.
2021-03-01testhelper: Move repository helpers into `gittest`Patrick Steinhardt
It's currently not possible to use our git DSL in the testhelper package because of an import cycle between the testhelper and git package. To fix this import cycle, we thus move test-related git helpers into the gittest package. This commit moves the repository helpers. As we're already touching all sites which use these helpers anyway, it also aligns functions to have more consistent naming.
2021-02-24fix racy test TestCoordinator_grpcErrorHandlingSami Hiltunen
TestCoordinator_grpcErrorHandling has a race where in the "primary error gets forwarded" error case. The mock server does acks a call as done before setting a flag indicating the node was called. This causes a race where the primary might return an error before the secondary calls have been recorded as having arrived. This causes the test to then flake as the primary error is returned and the assertions are done before the secondary calls have been recorded. This commit fixes the issue by recording the call as having arrived prior to marking decremeting the counter of the WorkGroup.
2021-02-23praefect: Ignore proxying failures for secondariesPatrick Steinhardt
When proxying streams in a transaction, we do frame-wise proxying to both primary and all secondaries at once. This is done by kicking off a bunch of Goroutines to forward from client to primary/secondaries, and from primary to client. This brings up some interesting questions about error handling though. While it's clear that we should be aborting the proxying operation if the client errors out, as the channel is now essentially broken, and it's also clear that, if proxying to the primary fails, we need to cancel the stream. But right now, we also cancel streams if proxying to the secondaries fail. This last failure mode is quite debatable, and in fact it can cause weird bugs. This is mostly because while we wait for all secondaries to terminate, even if any of them errors out, the same isn't true for the primary: we'll happily cancel the stream when proxying to secondaries has concluded, if at least one of them has returned an error, even if proxying to the primary is still ongoing. As a mitigation, this commit thus starts to ignore any errors caused by proxying to secondaries -- we already store their errors in a local map anyway and consider them when creating replication jobs. It's not an ideal fix though: when enough secondaries fail, then we know that transactions cannot reach quorum anymore and should thus be able to just abort the transaction. If necessary, we can do this in a follow-up fix.
2021-02-23praefect: Consider RPC errors when creating replication jobsPatrick Steinhardt
Historically, we didn't have any access to error codes of proxying destinations if proxying to more than one node at once. This is mostly because the proxy code just conflated all errors into one, which made proper error handling next to impossible. As a result, the transactional finalizer cannot and thus doesn't take into account whether any of the nodes fails, which may lead to spotty coverage when creating replication jobs. Fix this shortcoming by setting up error handlers for all proxying destinations. Like this, we can easily grab any errors they encounter and put it into a local map, which we can then forward to the transaction finalizer. Now there's three cases to consider: - The primary raised an error. In that case, we cannot help it and need to create replication jobs to all nodes. - A secondary raised an error. In that case, we need to consider it as outdated and need to create a replication job for that specific node. - Neither of them failed, in which case the node is considered to be up-to-date. Note that currently we still pass the error through as-is even for the secondaries. We should instead ignore errors on secondaries, but this is left for a subsequent commit.
2021-02-23praefect: Always create replication jobs for transactional RPCsPatrick Steinhardt
The current implementation of determining updated and outdated secondaries for a transactional RPC is suboptimal as there are several conditions where a node is not considered to be outdated. As a result, we wouldn't schedule replication jobs in these scenarios and potentially the state would be inconsistent across nodes. This commit addresses most of these shortcomings: - When no subtransactions were created, replication targets which didn't take part in the transaction weren't considered outdated. - If the primary failed the vote and didn't reach quorum, then no replication jobs were created. We cannot judge though as there might have been on-disk changes which we must replicate. - If we failed to get the transaction state, then we simply errored and didn't finalize the transaction at all. While this does indicate a real error, we must still finalize the transaction and err on the safe side by creating replication jobs. One shortcoming that isn't yet addressed is if any of the nodes fails the RPC. This change is left for a subsequent commit so that we do the best we can with the info we've currently got available.
2021-02-23praefect: Extract function to determine updated/outdated nodesPatrick Steinhardt
The transactional request finalizer is quite hard to follow. This has led to multiple issues with the current implementation where it doesn't correctly handle some nodes and won't create replication jobs when there's an error. As a preparation for refactorings, this commit pulls out the code to determine outdated and up-to-date nodes into its own standalone function and adds a bunch of tests. No functional change is expected from this commit.
2021-02-23praefect: Implement tests for failing RPC error proxyingPatrick Steinhardt
This commit implements a bunch of tests to assert that we correctly proxy GRPC errors from primary while ignoring any errors from secondaries. One of the tests is currently broken and will be fixed in a subsequent commit. The tests use the MockManager, which was missing two functions to get nodes. These have also been mocked such that tests succeed.
2021-02-08support default replication factor in coordinatorSami Hiltunen
This commit adds support for default replication factor in Praefect's coordinator and hooks up the configuration option with the rest of the code. When a default replication factor is configured for a virtual storage and repository specific primaries are enabled, the coordinator stores the nodes participating in the repository's creation as the assigned nodes.
2021-02-08support storing assignments on repository creationSami Hiltunen
When variable replication factor is enabled, the repository's host nodes should be selected when the repository is being created. The selected assignments should then be persisted so they can be used for future routing decisions. This commit adds support for storing assignments in RepositoryStore's CreateRepository. The functionality is behind a boolean switch as assignments are not supported when repository specific primaries are not used. Separate switch from storing primary is used in order to support repository specific primaries without variable replication factor. This allows for independently rolling out the features to production later.
2021-02-08support default replication factor in PerRepositoryRouterSami Hiltunen
This commit adds support for selecting host nodes randomly on repository creation in PerRepositoryRouter. New repositories get random secondaries assigned as host nodes until the default replication factor is met. If the virtual storage has no default replication factor configured, every configured node, except the randomly picked primary, is included as secondaries. This is fallback behavior to match Praefect's behavior prior to variable replication factor.
2021-02-08add Shuffle method to RandomSami Hiltunen
Adds Shuffle method to Random interface for shuffling a slice. While not in use yet, the method is required for selecting random host nodes when a repository is being created. NewLockedRandom implements a mutex protected wrapper for Random. Currently it was implemented with a function wrapper. As we need more than one method now, this commit changes the implementation in to a struct and adds the Shuffle method on it. The interface is also implemented by mockRandom which is used for mocking out a Random during tests. mockRandom also gains the Shuffle function so it implements the updated interface.
2021-02-04praefect: Implement force-routing to primary for node-manager routerPatrick Steinhardt
With reads distribution being enabled, requests would typically be routed to a random up-to-date node. There is situations though where not even up-to-date nodes would be able to serve a request, e.g. when object quarantine is in use. This commit implements force-routing to the primary for the node-manager router. It's implemented a new GRPC metadata header "gitaly-route-repository-accessor-to-primary": if set, a repository-scoped accessors will always get routed to the primary.
2021-02-02handle repository creations separately in coordinatorSami Hiltunen
This commit wires up repository creation routing to Praefect's coordinator. If the RPC creates a repository, the routing is handled different manner from other repository scoped mutators. This fixes repository creation when repository specific primaries are enabled. Prior to this, repository creation did not work as PerRepositoryRouter needs to know where to route the mutator request. As there were no record of the repository, the router didn't know where to route the request. The router now picks a random healthy node to act as the primary and sets the other nodes as replication targets. It then stores the primary in the database in the request finalizers.
2021-02-02add option for storing primary when creating a repositorySami Hiltunen
Repository creation when using repository specific primaries does not work as the creation goes through the usual repository mutator route which fails due to the repository not having a primary. As a preliminary step towards fixing that, this commit adds an option that can be used to store a repository's primary node in the database when creating its record.
2021-01-12testhelper: Use test context in `GetTemporaryGitalySocketFileName()`Patrick Steinhardt
The `testhelper.GetTemporaryGitalySocketFileName()` function currently returns an error. Given that all callers use `require.NoError()` on that error, let's convert it to instead receive a `testing.TB` and not return an error.
2021-01-04Fix lint issues with c1064ad30Ævar Arnfjörð Bjarmason
In c1064ad30 (Merge branch 'datastore_create_repo' into 'master', 2021-01-04) I merged a commit which ran afoul of the new checks in 20ef26439 (golangci: Enable error and documentation linters, 2020-12-17). I think the CI was passing when I looked at the MR and it was based off an earlier master, but I'm not 100% sure. In any case, this fixes it. See also my 149d56118 (golangci: fix linting issues on "master", 2020-12-24) for previous fights with the linter.
2021-01-04Merge branch 'datastore_create_repo' into 'master'Ævar Arnfjörð Bjarmason
praefect: intercept CreateRepository* RPCs to populate database See merge request gitlab-org/gitaly!2873
2020-12-17Retrieve all consistent storagesPavlo Strokov
On each read/write operation praefect requires to know which gitaly node is a primary. For mutator operations it used to define from what node the response will be returned back to the client. For the read operations it is used to redirect request to or as a fallback option for reads distribution in case it is enabled. The default strategy for defining the primary is 'sql' which means the primary is tracked inside of the Postgres database and praefect issues select statement into it each time it needs to define the current primary. It creates a high load on the database when there are too many read operations (the outcome of the performance testing). To resolve this problem we change the logic of retrieval of the set of up to date storages to return all storages including the primary. With it in place we don't need to know the current primary and use any storage that has latest generation of the repository to serve the requests. As this information is cached by the in-memory cache praefect won't create a high load on the database anymore. This change also makes check IsLatestGeneration for the primary node redundant as it won't be present in the set of consistent storages if its generation not the latest one. Fix linting issues Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3337
2020-12-17Add coordinator handler for create repositoryJames Fargher
2020-12-14Removal of unused RepositoryStore dependencyPavlo Strokov
As the strategy of interacting with the repository changes there is no more need to provide RepositoryStore into Mgr as a dependency as it has no usage. This commit removes RepositoryStore from the list of the input parameters of the constructor function and from the list of fields of the Mgr struct.
2020-12-10Revert "On each read/write operation praefect requires to know which"Pavlo Strokov
This reverts commit 09c6d25de370446ac855a8241d8f821ed3f1ceec
2020-12-10On each read/write operation praefect requires to know whichPavlo Strokov
gitaly node is a primary. For mutator operations it used to define from what node the response will be returned back to the client. For the read operations it is used to redirect request to or as a fallback option for reads distribution in case it is enabled. The default strategy for defining the primary is an 'sql' which means the primary is tracked inside Postgres database and praefect issues select statement into it each time it needs to define current primary. It creates a high load on the database when there are too many read operations (the outcome of the performance testing). To resolve this problem we change logic of retrieving set of up to date storages to return all storages including primary. So now we don't need to know the current primary and use any storage that has latest generation of the repository to serve the requests. As this information is cached by the in-memory cache praefect won't create a high load on the database anymore. This change also makes check IsLatestGeneration for the primary node redundant as it won't be present in the set of consistent storages if its generation not the latest one. Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3337
2020-12-07Revert "featureflag: Remove reference transaction feature flag"Patrick Steinhardt
This reverts commit ca8e2de58193775a62599b3a4de682c511046033 and restores the feature flag for enabling/disabling reference transactions. The main reason to revert the change is that there are currently still some problems with TLS setups and injected Praefect server information, where Gitaly may not be able to verify authenticity of Praefect servers and thus fail to establish a connection. This also removes the environment variable to disable transactions, which was only added as a short-term fix to make GitLab setups work in such environments again.
2020-12-04coordinator: Switch tests to use non-transactional RPCsPatrick Steinhardt
The coordinator tests have two tests which verify routing logic for RPCs which don't use transactions. This is currently implemented via `FetchIntoObjectPool`, but we're in fact about to change it to conditionally use transactions. As this would break tests, let's switch to a different RPC whihc we know won't use transactions.
2020-12-02remove usage of MemoryRepositoryStore in testsSami Hiltunen
To prepare for removing the MemoryRepositoryStore, this commit removes its uses in tests. Mostly the tests need something that works, which is when DisableRepositoryStore is used. When a test is testing a particular scenario with the RepositoryStore, a mock is provided instead. Ideally we'd use the Postgres implementation in these cases but hooking it in requires some additional work as the test setup overwrites home directory which breaks the discovery of GDK's Postgres.
2020-12-02use postgres repository store in TestStreamDirectorMutator_TransactionSami Hiltunen
This replaces the in-memory store in Praefect's coordinator's TestStreamDirectorMutator_Transaction tests with Postgres implementation of the store. The test case exercises the whole mutator flow and tests assumptions related to it. Due to this, mocking gets more involved. As the test case covers a lot of ground, it makes sense to use the real implementation here and perform integration testing with it.
2020-11-27transactions: Allow disabling with an env varZeger-Jan van de Weg
Logic-wise this comments is intended to achieve the same as reverting: ca8e2de58193775a62599b3a4de682c511046033: (featureflag: Remove reference transaction feature flag), just without conflicts and a more concentrated diff. The use case is that a customer turned the feature flag for reference transactions off, which wasn't honoured anymore. This change allows for disabling transactions again. The root problem lies in the fact that Gitaly needs to connect with one Praefect that coordinates the current transaction. Given this is an IP based connection, not to a domain name, this violates some security policies.
2020-11-18nodes: Add context parameter to `GetShard()`Patrick Steinhardt
The `nodes.Manager` interface's method `GetShard()` doesn't currently receive a context as parameter, even though some of its implementations perform non-trivial and potentially blocking work like querying the database. This commit thus adds this parameter and adjusts all current callsites.
2020-11-17Introduction of storages provider to get up to date storagesPavlo Strokov
It is a next step in including cached storages provider in order to support reads distribution across gitalies. On each invocation it queries the passed in dependency and combine the result with existing primary. The resulted list is used by the manager to decide where request should be routed for processing. In a follow up MR it will be extended with expiration cache to reduce load on database as accessing it on each read operation is not efficient. Part of: https://gitlab.com/gitlab-org/gitaly/-/issues/3053
2020-10-21transactions: Remove primary-wins feature flagPatrick Steinhardt
In order to ease the migration to reference transactions and reduce the risk, initially all transactions used a primary-wins strategy: as long as the primary node performed a vote, the transaction was guaranteed to succeed. This was thus only a transitional feature flag, and given that we've been running without for many weeks now without any issues we can safely remove it now.
2020-10-16coordinator: Stop raising errors on stopped transactionsPatrick Steinhardt
When a transaction was gracefully stopped, it indicates that any node taking part has hit an expected error case, e.g. the primary has executed hooks which declined a push. The coordinator can't and shouldn't do anything about it as this is an expected error inside of the business logic which the Gitaly nodes should know to handle correctly. Change the logic to reflect this by silently returning in case the primary's transaction has been stopped.
2020-09-21Allow null generations in repositories tableSami Hiltunen
'repositories' table's records are used to keep record of repositories that should exist and as repository specific locks for accessing 'storage_repositories' table. The repository specific locking was implemented by incrementing the generation column before modifying the entries in the 'storage_repositories' table. Furthermore, the generation column is used as an easy way to access the latest generation number. These uses prevent adding other columns to the table, as inserting a record forces one to also set a generation number from unrelated code. To make it viable to add other repository specific columns to the table, this commit stops using the 'repositories.generation' as an easy way to get the current expected generation and instead gets it by getting the maximum value from the 'storage_repositories' table. To make it possible to add other columns without affecting the generation, NOT NULL constraint is dropped from the column and queries have been updated to handle the case. The column is still kept around and updated in the writes to keep backwards compatibility with earlier versions. The column can be dropped later once every repository is guaranteed to have a record in the table by replacing the increment operation with a select for update. Behavior should be semantically the same. Only difference is in GetConsistentSecondaries which was only checking the secondaries match the primary's generation rather than that they are on the latest generation. This was changed as we can't conistency of the repository unless it is on the very latest generation.
2020-09-17extract routing logic in to a separate componentSami Hiltunen
Routing logic is currently not pluggable as it is part of the coordinator and depends on the NodeManager. In preparation for variable replication factor and per repository primary, this commit extracts the routing logic in to a separate component which allows for plugging in alternative implementation. There should be no behavior difference, except for eagerly loading consistent secondaries on repository scoped mutators regardless of whether transactions are enabled or not. This should be fine though as transactions are enabled by default.
2020-09-11remove server scoped handling from coordinatorSami Hiltunen
This commit removes server scoped request handling from the coordinator as there are no server scoped RPCs Praefect should direct to Gitaly nodes.
2020-08-31Use error tracker to determine if node is healthyJohn Cai
Hooks up the error tracker in the node manager so it checks if a certain backend node has reached a threshold of errors. If it has, then it will be deemed unhealthy.
2020-08-25coordinator: Implement majority-wins transaction voting strategyPatrick Steinhardt
We currently have two ways we set up transactions in the coordinator: - With the primary-wins feature flag, the primary will always succeed no matter what the secondaries vote. - Without the flag, all nodes need to agree. The first strategy really is only to ensure smoother transitioning towards using reference transactions and is going away at some point in time. The second voting strategy is the one that's going to stay, but right now it's really pessimistic because it will always require all nodes to agree on the same thing. Let's refine this strategy and move it to the next level, which is a majority-wins approach: instead of requiring all nodes to agree, we will now require a majority of nodes to agree. In this case, we define majority to: - Always include the primary node. This is a hard requirement, as the primary node is the one who's executing hooks. If it failed and the transaction succeeded, we'd have changes persisted to disk but none of the hooks were run, which may lead to inconsistent state. - Require at least half of the secondaries to agree, rounded up. As a special case, if there's only a single secondary, then it's not required to agree with the primary. This would result in the following typical scenarios: - 1 primary, 0 secondaries: Primary always wins. - 1 primary, 1 secondary: Primary always wins. - 1 primary, 2 secondaries: Primary and a single secondary required. - 1 primary, 3 secondaries: Primary and two secondaries required. - 1 primary, 4 secondaries: Primary and two secondaries required. This commit implements the above by changing the default transaction setup in case the primary-wins feature flag is disabled and adds/adjusts some of our tests to match this new strategy.
2020-08-17transactions: Setup metrics in transaction managerPatrick Steinhardt
Right now, setup of metrics used in the transaction manager is split across multiple locations. This makes the process of adding new metrics more involved than it needs to be and is a source of bugs in case any of those locations is not updated. Improve the situation by moving setup of metrics into the transaction manager. Metrics are exposed by implementing the Collector interface and registering the transaction manager itself as a metric.
2020-08-12Praefect: replication processing should omit unhealthy nodesPavlo Strokov
Replication should consider the state of the target nodes and omit dequeueing and processing of replication events for unhealthy nodes. Processing of replication events for unhealthy nodes could lead to outdated repositories as all attempts to process the event would be failed and event will be marked 'dead'. And the repository will remain outdated until the next replication event happen to it. By checking the health state before processing the event we have more chances to apply them without issues, so reduce potential data loss. Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/2922
2020-08-10Local elector that is used in case failover is disabledPavlo Strokov
Local elector doesn't handle retrieval of the primary node properly. As it is not started by the Mgr and there is no health checks executed for each node. In suck case each node considered unhealthy as it has 0 successful health checks. Disabled elector is used in case failover is disabled. It returns a first passed in node as a primary and all the others as secondaries. It also returns `ErrPrimaryNotHealthy` as other electors in case primary is not healthy and can't serve the requests. The health check request starts despite of failover for any type of elector. Each node shows actual health status with `IsHealthy` method. Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/3011
2020-08-06Feature flags enabling for tests.Pavlo Strokov
It is decided to go with all features enabled by default behaviour for tests. This should help us identify potential problems with feature flag combinations enabled in production. To verify implementation without feature flag enabled it should be disabled explicitly in the test. Part of: https://gitlab.com/gitlab-org/gitaly/-/issues/2702
2020-07-31Improve query to identify up to date storages for reads distributionPavlo Strokov
Retrieve up to date storages that can server read operation for the repository in order to distribute reads across all healthy storages of the virtual storage. Closes: https://gitlab.com/gitlab-org/gitaly/-/issues/2944