praefect: Ignore proxying failures for secondaries

When proxying streams in a transaction, we do frame-wise proxying to both primary and all secondaries at once. This is done by kicking off a bunch of Goroutines to forward from client to primary/secondaries, and from primary to client. This brings up some interesting questions about error handling though. While it's clear that we should be aborting the proxying operation if the client errors out, as the channel is now essentially broken, and it's also clear that, if proxying to the primary fails, we need to cancel the stream. But right now, we also cancel streams if proxying to the secondaries fail. This last failure mode is quite debatable, and in fact it can cause weird bugs. This is mostly because while we wait for all secondaries to terminate, even if any of them errors out, the same isn't true for the primary: we'll happily cancel the stream when proxying to secondaries has concluded, if at least one of them has returned an error, even if proxying to the primary is still ongoing. As a mitigation, this commit thus starts to ignore any errors caused by proxying to secondaries -- we already store their errors in a local map anyway and consider them when creating replication jobs. It's not an ideal fix though: when enough secondaries fail, then we know that transactions cannot reach quorum anymore and should thus be able to just abort the transaction. If necessary, we can do this in a follow-up fix.
author: Patrick Steinhardt <psteinhardt@gitlab.com> 2021-02-19 09:49:16 +0300
committer: Patrick Steinhardt <psteinhardt@gitlab.com> 2021-02-23 12:01:53 +0300
commit: 77a0174b88f4232b8b8f81378acdff108f3e078f (patch)
tree: 6e8179bd957fe8845c1cd87ffffae8dc2831e03e
parent: cd44d9e62a272c71c96b2057c59d6eab45b4d908 (diff)
2 files changed, 10 insertions, 3 deletions
diff --git a/internal/praefect/coordinator.go b/internal/praefect/coordinator.go
index f5ffeecdb..71c6b1a95 100644
--- a/internal/praefect/coordinator.go
+++ b/internal/praefect/coordinator.go
@@ -449,7 +449,15 @@ func (c *Coordinator) mutatorStreamParameters(ctx context.Context, call grpcCall
 					nodeErrors.Lock()
 					defer nodeErrors.Unlock()
 					nodeErrors.errByNode[secondary.Storage] = err
-					return err
+
+					ctxlogrus.Extract(ctx).WithError(err).
+						Error("proxying to secondary failed")
+
+					// For now, any errors returned by secondaries are ignored.
+					// This is mostly so that we do not abort transactions which
+					// are ongoing and may succeed even with a subset of
+					// secondaries bailing out.
+					return nil
 				},
 			})
 		}
diff --git a/internal/praefect/coordinator_test.go b/internal/praefect/coordinator_test.go
index 4e88323af..df6510e17 100644
--- a/internal/praefect/coordinator_test.go
+++ b/internal/praefect/coordinator_test.go
@@ -1426,11 +1426,10 @@ func TestCoordinator_grpcErrorHandling(t *testing.T) {
 			expectedErr: status.Error(codes.Unknown, "foo"),
 		},
 		{
-			desc: "secondary error gets ignored (test is broken)",
+			desc: "secondary error gets ignored",
 			errByNode: map[string]error{
 				"secondary-1": errors.New("foo"),
 			},
-			expectedErr: status.Error(codes.Internal, "failed proxying to secondary: rpc error: code = Unknown desc = foo"),
 		},
 		{
 			desc: "primary error has precedence",
author	Patrick Steinhardt <psteinhardt@gitlab.com>	2021-02-19 09:49:16 +0300
committer	Patrick Steinhardt <psteinhardt@gitlab.com>	2021-02-23 12:01:53 +0300
commit	77a0174b88f4232b8b8f81378acdff108f3e078f (patch)
tree	6e8179bd957fe8845c1cd87ffffae8dc2831e03e
parent	cd44d9e62a272c71c96b2057c59d6eab45b4d908 (diff)