Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitaly.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-07-21Update VERSION filesv14.1.0GitLab Release Tools Bot
[ci skip]
2021-07-21Update changelog for 14.1.0GitLab Release Tools Bot
[ci skip]
2021-07-20Update VERSION filesv14.1.0-rc43GitLab Release Tools Bot
[ci skip]
2021-07-20Update VERSION filesv14.1.0-rc42GitLab Release Tools Bot
[ci skip]
2021-07-15Merge branch 'sh-update-ffi-gem' into 'master'Toon Claes
Update ffi gem to 1.15.3 See merge request gitlab-org/gitaly!3664
2021-07-15Merge branch 'smh-set-default-grpc-buckets' into 'master'James Fargher
Set default Prometheus buckets for Gitalys RPC instrumentation Closes #3431 See merge request gitlab-org/gitaly!3669
2021-07-14Set default Prometheus buckets for Gitalys RPC instrumentationSami Hiltunen
Gitaly doesn't set default buckets for RPC latency instrumentation which leads to the instrumentation being disabled by default. This commit adds default buckets to the configuration which is used if the buckets are not explicilty configured. Changelog: changed
2021-07-14Merge branch 'jv-add-streamrpc' into 'master'Sami Hiltunen
Add StreamRPC library code See merge request gitlab-org/gitaly!3601
2021-07-14Merge branch 'smh-dataloss-lazy-failovers' into 'master'Zeger-Jan van de Weg
Support lazy failovers in `praefect dataloss` See merge request gitlab-org/gitaly!3549
2021-07-13Merge branch 'smh-unavailable-repos-metric' into 'master'Zeger-Jan van de Weg
Update read-only repository count metric to account for lazy failover See merge request gitlab-org/gitaly!3548
2021-07-13Merge branch 'remove_gitaly_fetch_internal_remote_errors' into 'master'James Fargher
Remove feature gitaly_fetch_internal_remote_errors Closes #3588 See merge request gitlab-org/gitaly!3647
2021-07-13Remove feature gitaly_fetch_internal_remote_errorsJames Fargher
Since FetchInternalRemote has been inlined into ReplicateRepository we no longer need to make this RPC errors more verbose.
2021-07-12Merge branch 'ps-code-style-fix' into 'master'Zeger-Jan van de Weg
Fix various static lint issues See merge request gitlab-org/gitaly!3666
2021-07-12Merge branch 'smh-perform-lazy-failovers' into 'master'Zeger-Jan van de Weg
Perform failovers lazily Closes #3207 See merge request gitlab-org/gitaly!3543
2021-07-12Add StreamRPC library codeJacob Vosmaer
Changelog: other
2021-07-12Standardise package aliasesPavlo Strokov
It is not common to use snake case names for packages and package aliases in Go. The change renames aliases to a preferred single word name. We also use 'gitaly' prefix for the project-defined packages that clashes with standard or 3-rd party package names.
2021-07-12Remove unused declarationsPavlo Strokov
Some functions, types, fields and other variables are not used. There is no reason to keep them and support. Some of them became redundant starting from declaration and some after the code changes.
2021-07-12Redundant condition check in the for loopPavlo Strokov
The done variable is never assigned any value, so the condition always evaluates into true.
2021-07-12Fix instantiation of the structs with fields assignmentPavlo Strokov
If struct has a list of fields declared the values for those fields could be assigned during struct instance creation by providing values in the same order as the fields are declared or in any order if field names are used to assign the values. The preferred way is to use a field name assigment as it is less error prone if you define a new field in the middle of the struct and as well more readable as you see the list of the initialized fields.
2021-07-12Export unavailable repositories metricSami Hiltunen
The current read-only repository count metric describes unavailable repositories rather than read-only repositories. We have to keep the name for backwards compatibility as some alerting rules and dashboards depend on it. To make it possible to migrate to a more accurate metric later, this commit adds another metric on the side with more accurate name and description.
2021-07-12Update read-only repository count metric to account for lazy failoverSami Hiltunen
Read-only repository count metric previously reported the number of repositories that were outdated on the primary. As Praefect no longer promotes outdated replicas as primaries, this metric is not really useful anymore. With lazy failover in place, Praefect will failover to an up to date replica as long as there is a healthy one available. The purpose of this metric was to alert when a repository's availability was degraded, mainly the writes being blocked. With lazy failover, we no longer would block the writes as we'd simply promote the up to date node. Praefect hasn't served reads from outdated replicas since 7af9c950. Having no fully up to date healthy replicas means the repository is fully unavailable. There's effectively no more read-only mode. This commit updates the metric to count repositories which are unavailable according to the new failover logic. The old metric name is kept in place though as some alerting depends on it.
2021-07-12Remove support for virtual storage scoped primaries in read-only metricsSami Hiltunen
This commit removes the support for virtual storage scoped primaries in the read-only repository count metric to make future changes easier. Virtual storage scoped primaries were deprecated in 13.12 and removed in 14.0. Changelog: removed
2021-07-12Support lazy failovers in `praefect dataloss`Sami Hiltunen
With the recent failover changes, the output of `praefect dataloss` is no longer accurate. Previously a repository would have been in read-only mode if the primary of the repository was outdated. With lazy failovers in place, it's no longer sufficient to check only whether the current primary is outdated or not. If the current primary is outdated, Praefect would immediately switch the repository's primary on the next request if there is an up to date replica available. This also means that there is no 'read-only mode' anymore, as we'd simply failover to an up to date node rather than wait for the current primary to be brought up to speed. This commit updates the dataloss sub-command to take the new changes into account: 1. If there is an up to date, available replica for the repository, it's considered to be available for both reads and writes. 2. If there are no up to date replicas available, the repository is considered unavailable. As it is, Praefect does not distribute writes to outdated replicas. 3. To make it easier to determine why a repository is unavailable, 'unavailable' is printed next to the storages which are considered to be unavailable by the consensus of the Praefect nodes. Changelog: changed
2021-07-12Replace GetPartiallyReplicatedRepositories with ↵Sami Hiltunen
GetPartiallyAvaialableRepositories `praefect dataloss` is using GetPartiallyReplicatedRepositories to get repositories which have assigned replicas that are outdated. Inferring from the returned generations it was also reporting whether the repository was in read-only mode or not. This is not sufficient anymore to determine whether a repository is unavailable or not due to recent changes: 1. Since 7af9c950, Praefect has no longer served reads from outdated replicas. 2. Praefect no longer elects outdated replicas as primaries. Electing an outdated primary does not improve the availability of a repository as it still couldn't accept writes nor reads. 3. With introduction of lazy failovers, there is effectively no read-only mode anymore as Praefect would simply failover to the up to date node immediately if one exists. With those in mind, the behavior of `praefect dataloss` is not accurate anymore. By default, its attempts to print out repositories which have reduced availability. To reflect the current failover logic, we should instead print out repositories which do not have any up to date, healthy nodes available. This commit replaces the GetPartiallyReplicatedRepositories with GetPartiallyAvailableRepositories. A repository is considered available by the current logic if there exists a replica that could serve as the primary. A replica can serve as the primary if it is fully up to date and healthy. If such a replica exists, the repository is not in read-only mode as we'd simply use the replica as the primary. If no such replicas exist, the repository is unavailable. The dataloss sub-command also has the `-partially-replicated` flag that prints out repositories which have some assigned replicas that are not fully up to date. That flag is going to be replaced by the `partially-available` flag, which returns repositories which have assigned replicas that are not able to serve requests at the moment. This effectively does the same as the flag did previously but it also considers whether the replicas are healthy. This behavior fits better with variable replication factor: it could be that we have one up to date copy of the replica on an unhealthy node. The previous check would only see that there are no outdated replicas and not return the repository. The repository would be unavailable though, as the only replicas is on a node that is unhealthy. To better facilitate debugging these scenarios, the flag is changed to cover replicas on unavailable nodes as well. This commit covers only the datastore changes. The user facing changes in dataloss are to be done in a follow up commit.
2021-07-12Return more information from GetPartiallyReplicatedRepositoriesSami Hiltunen
GetPartiallyReplicatedRepositories returns information about repositories which have outdated replicas on assigned hosts. The generations returned are used in `praefect dataloss` to determine whether a repositroy is in read-only mode or not. With lazy failover, there is no read-only mode anymore as Praefect can immediately failover to another valid primary. Praefect doesn't serve reads from outdated replicas, so the repository would effectively be unavailable if there are no up to date and healthy replicas. To prepare for updating `praefect dataloss` to account for lazy failovers, let's return the health status and whether the replica can act as the primary with each of the replicas. We can later use the ValidPrimary field to determine if the repository is available and the health status to ease with debugging why a repository may be unavailable. Other than returning the additional fields, this commit makes no other behavior changes yet.
2021-07-12Use repository_generations view in GetPartiallyReplicatedRepositoriesSami Hiltunen
GetPartiallyReplicatedRepositories is currently using a window function to get the highest generation from all of the replicas. We've since introduced the repository_generations view which also gets the highest generation across the replicas. Let's simplify the query by reusing the view rather than performing the logic again using the window function.
2021-07-12Remove support for virtual storage primaries in `praefect dataloss`Sami Hiltunen
Starting from 14.0, Praefect only supports repository-specific primaries. This commit removes support for virtual storage scoped primaries in `praefect dataloss` to make future changes easier. Changelog: removed
2021-07-12Extract a testhelper for setting healthy nodes in the databaseSami Hiltunen
This commit extracts the setHealthyNodes helper from the tests of PerRepositoryElector so it can be reused in other packages. The helper is used for setting healthy nodes in the database during tests.
2021-07-12Use request scoped logger in PerRepositoryElectorSami Hiltunen
PerRepositoryElector uses its own logger as a remnant from the time it was performing elections in the background. As the elections now happen in the request context, let's switch to using the request context logger. This allows for correlating the primary changes with the request that triggered that failover.
2021-07-12Perform failovers lazilySami Hiltunen
Praefect's PerRepositoryElector runs elections globally when Praefect launches and when a Gitaly node's health status changed. This approach was originally taken to match global elections done by the sqlElector as well. While the sqlElector runs elections after every health check, by default every 3s, the event driven approach was implemented for the PerRepositoryElector as it has to perform a lot more work every election run compared to the sqlElector. The sqlElector has a single primary for each virtual storage where as the PerRepositoryElector has a primary record for every repository. While both electors check every repository's generations to pick the best new primary, only the PerRepositoryElector has to write potentially a large number of records as well. We can do a lot better though: 1. If the primary is unavailable only temporarily, there's a high chance that the repository is not even accesed during the outage. If so, there's no need to eagerly failover as no one would even see the failure. 2. Most of the operations on the repositories are reads. Reads can be served from any up to date replica without needing to have a primary. Only once an RPC that requires the primary arrives we care about having a healthy primary. Given the above, this commit implements a lazy approach to failovers. This removes the background election loop entirely and elects a primary if needed when an RPC requires a primary. This happens transparently when getting the primary from the database. This brings multiple benefits: 1. Perfomance improves as we don't have to perform failovers for repositories which are not written to during the primary's outage. This reduces the time to perfrom failovers as we are working on records of a single repository as opposed to all of the repositories. 2. Failover code is responsive without having to feed it more and more events. This becomes more relevant as we implement rebalancing features. When moving a repository with a single replica, we may have to demote the primary temporarily and we want it to be re-elected as soon as a request needs it and it's possible. Previous approach would require us hooking more code into the events where as this lazy approach just works. 3. It's easier to reason about synchronous code rather than asynchronous elections. 4. We can log all the individual changes, as opposed to logging the aggregate stats of demotions and promotions. Changelog: performance
2021-07-12Merge branch 'pks-tx-coordinator-replication-error-handling' into 'master'Sami Hiltunen
coordinator: Only schedule replication for differing error states See merge request gitlab-org/gitaly!3642
2021-07-12Merge branch 'pks-ff-receiver' into 'master'Sami Hiltunen
featureflag: Implement receiver functions on FeatureFlag struct See merge request gitlab-org/gitaly!3662
2021-07-11Update ffi gem to 1.15.3Stan Hu
We're shipping three different versions of this gem in Omnibus. Update to the latest to avoid wasting space. https://my.diffend.io/gems/ffi/1.13.1/1.15.3 Changelog: changed
2021-07-09Merge branch 'pks-gitpipe-cancellation' into 'master'Patrick Steinhardt
gitpipe: Prioritize context cancellation Closes #3693 and #3697 See merge request gitlab-org/gitaly!3658
2021-07-09Merge branch 'mk-activesupport-6.1' into 'master'Patrick Steinhardt
Bump actionpack, actionview, activesupport to 6.1 See merge request gitlab-org/gitaly!3661
2021-07-09Merge branch 'pks-ff-lfs-pointer-pipeline-default-enabled' into 'master'Toon Claes
featureflag: Default-enable LFS pointers pipeline See merge request gitlab-org/gitaly!3653
2021-07-09featureflag: Document OutgoingCtxWithRubyFeatureFlagsPatrick Steinhardt
The OutgoingCtxWithRubyFeatureFlags function is missing documentation. Add it and remove the corresponding linter exemption.
2021-07-09featureflag: Internalize computation of header keysPatrick Steinhardt
The `HeaderKey()` function is used to determine the header key of a feature flag. This function is only used in the featureflag package, so let's make it a private symbol.
2021-07-09featureflag: Remove old interface to check for feature flagsPatrick Steinhardt
Now that all callers have been converted to use receiver functions, remove the old way of checking feature flags.
2021-07-09global: Convert users to use feature flag receiverPatrick Steinhardt
Instead of `featureflag.IsEnabled(ctx. featureflag.MyFeatureFlag)`, all callers are now converted to use the new receiver functions on feature flags to avoid stuttering.
2021-07-09featureflag: Implement receiver functions on FeatureFlag structPatrick Steinhardt
Implement receiver function `IsEnabled()` and `IsDisabled()` on the FeatureFlag structure. These new functions will replace the old interface of `featureflag.IsEnabled(ctx, featureflag.MyFeatureFlag)`, which stutters.
2021-07-09Bump actionpack, activesupport to 6.1Matthias Kaeppler
This is necessary for us to support Ruby 3. Changelog: changed
2021-07-09transactions: Remove `DidCommitAnySubtransactions()`Patrick Steinhardt
The interface function `DidCommitAnySubtransactions()` isn't used by any callsite anymore. Drop it.
2021-07-09coordinator: Create replication jobs if the primary cast a votePatrick Steinhardt
Starting with commit d87747c8 (Consider primary modified only if a subtransaction was committed, 2021-05-14), we consider primaries to not have been modified unless at least one subtransaction was committed. The intent of this change is to avoid queueing replication jobs in case an RPC returned an error without having modified any on-disk state. As it turns out, this optimization had unintended side effects: if an RPC fails on the first vote because of inconsistent state across all nodes, then we wouldn't ever schedule a replication job to fix this inconsistency. In some cases, this will keep up from making any progress at all because we will never converge towards the same state, for example in object pools. The current condition is clearly insufficient: if the initial vote fails, then we must schedule a replication job because we cannot tell much about the reason of its failure. This commit thus tightens the check: instead of requiring at least one committed subtransaction to consider the primary dirty, we now consider it dirty whenever it did cast a vote. With this change, we can still avoid replication jobs if secondaries created a subtransaction while the primary dropped out before casting a vote given that secondaries couldn't have reached quorum without the primary. But in all the other cases where the primary did cast a vote, we'll now go through our typical updated-outdated logic and will thus also know to replicate changes in case the first vote fails. Changelog: fixed
2021-07-09transactions: Implement function which tells whether votes did votePatrick Steinhardt
It's currently not possible to tell whether a given node has cast any vote or not. Implement it -- we'll need this as a heuristic to determine whether the primary has been dirtied.
2021-07-09transactions: Simplify computation of voter statePatrick Steinhardt
In order to determine the voter state in a transaction, we iterate through all subtransactions and soak up the result in there. But given that subtransactions are always created with all voters of the transaction, we know that on each iteration, we'll override all results of the preceding subtransaction anyway. Ultimately, we thus end up with the state as recorded by the last subtransaction, and it's clear that the iteration is pointless. Simplify the code to just take the state of the last subtransaction.
2021-07-09coordinator: Explicitly exit early if primary is not dirtyPatrick Steinhardt
Under some circumstances, the primary node will not be considered dirty after a transactional mutator. No matter what, we do not have to schedule any replication jobs in these cases given that there shouldn't be any changes anyway. But even though we know early on that the primary is not dirty, we still collect updated and outdated nodes and return them to the caller. This code flow is a bit confusing and hard to reason about. Refactor the code to return early in case the primary is not dirty.
2021-07-09coordinator: Combine node state log messages into a single messagePatrick Steinhardt
When determining updated and outdated secondaries for transactional mutators, we write several log messages stating why certain nodes are considered outdated or updated. Having this information split up across multiple messages is quite a pain if one wants to get a quick overview over why nodes are outdated given that one now has to search for multiple log messages. Combine these log messages into a single message which has secondary node states as a its metadata.
2021-07-09gitpipe: Prioritize context cancellationPatrick Steinhardt
We're observing flaky tests in our gitpipe code, where the race happens on context cancellation: we'll either see that the pipeline has shut down gracefully, or alternatively we'll see that git-rev-list(1) was killed because of context cancellation. This race can happen because we don't prioritize handling of context cancellations: if git-rev-list(1) was killed via the context, then it must've happened via our context and thus we know that at the point of sending the error down the pipeline, that the context has terminated already. Fix the race by prioritizing context cancellation: before trying to send down any results or errors, we'll first check whether the context was cancelled. This also allows us to get rid of the workaround we had in our pipeline tests where there was a special child context such that we didn't observe killed git-cat-file(1) processes.
2021-07-09gitpipe: Deduplicate sender of CatfileInfoResultsPatrick Steinhardt
We have multiple ad-hoc implementations of senders for CatfileInfoResuls. Deduplicate them into a single function.