diff options
author | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-02-19 09:49:16 +0300 |
---|---|---|
committer | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-02-23 12:01:53 +0300 |
commit | 77a0174b88f4232b8b8f81378acdff108f3e078f (patch) | |
tree | 6e8179bd957fe8845c1cd87ffffae8dc2831e03e | |
parent | cd44d9e62a272c71c96b2057c59d6eab45b4d908 (diff) |
praefect: Ignore proxying failures for secondaries
When proxying streams in a transaction, we do frame-wise proxying to
both primary and all secondaries at once. This is done by kicking off a
bunch of Goroutines to forward from client to primary/secondaries, and
from primary to client. This brings up some interesting questions about
error handling though. While it's clear that we should be aborting the
proxying operation if the client errors out, as the channel is now
essentially broken, and it's also clear that, if proxying to the primary
fails, we need to cancel the stream. But right now, we also cancel
streams if proxying to the secondaries fail.
This last failure mode is quite debatable, and in fact it can cause
weird bugs. This is mostly because while we wait for all secondaries to
terminate, even if any of them errors out, the same isn't true for the
primary: we'll happily cancel the stream when proxying to secondaries
has concluded, if at least one of them has returned an error, even if
proxying to the primary is still ongoing.
As a mitigation, this commit thus starts to ignore any errors caused by
proxying to secondaries -- we already store their errors in a local map
anyway and consider them when creating replication jobs. It's not an
ideal fix though: when enough secondaries fail, then we know that
transactions cannot reach quorum anymore and should thus be able to just
abort the transaction. If necessary, we can do this in a follow-up fix.
-rw-r--r-- | internal/praefect/coordinator.go | 10 | ||||
-rw-r--r-- | internal/praefect/coordinator_test.go | 3 |
2 files changed, 10 insertions, 3 deletions
diff --git a/internal/praefect/coordinator.go b/internal/praefect/coordinator.go index f5ffeecdb..71c6b1a95 100644 --- a/internal/praefect/coordinator.go +++ b/internal/praefect/coordinator.go @@ -449,7 +449,15 @@ func (c *Coordinator) mutatorStreamParameters(ctx context.Context, call grpcCall nodeErrors.Lock() defer nodeErrors.Unlock() nodeErrors.errByNode[secondary.Storage] = err - return err + + ctxlogrus.Extract(ctx).WithError(err). + Error("proxying to secondary failed") + + // For now, any errors returned by secondaries are ignored. + // This is mostly so that we do not abort transactions which + // are ongoing and may succeed even with a subset of + // secondaries bailing out. + return nil }, }) } diff --git a/internal/praefect/coordinator_test.go b/internal/praefect/coordinator_test.go index 4e88323af..df6510e17 100644 --- a/internal/praefect/coordinator_test.go +++ b/internal/praefect/coordinator_test.go @@ -1426,11 +1426,10 @@ func TestCoordinator_grpcErrorHandling(t *testing.T) { expectedErr: status.Error(codes.Unknown, "foo"), }, { - desc: "secondary error gets ignored (test is broken)", + desc: "secondary error gets ignored", errByNode: map[string]error{ "secondary-1": errors.New("foo"), }, - expectedErr: status.Error(codes.Internal, "failed proxying to secondary: rpc error: code = Unknown desc = foo"), }, { desc: "primary error has precedence", |