diff options
author | Will Chandler <wchandler@gitlab.com> | 2022-08-01 17:35:30 +0300 |
---|---|---|
committer | Will Chandler <wchandler@gitlab.com> | 2022-08-02 20:03:14 +0300 |
commit | 6a02746cd6edc7d3e87e8c340ad271e8859b4956 (patch) | |
tree | 119f9bc72f77f8c30a4bba26b44c6c1ce337ebe0 | |
parent | 2667f87af97aaf7aaa5c54bbc7d91472b26dfc33 (diff) |
server: Give clients grace-period for keepaliveswc-reduce-min-keepalive
Currently gRPC clients are configured to send keepalives every 20
seconds, and the servers enforce that as a minimum interval. Requiring
precisely the same interval has caused a number of problems over time.
gGRPC-Core has an issue where keepalives may be sent a few milliseconds
ahead of schedule[0], breaking keepalives from Puma and Sidekiq.
With the introduction of `SSHUploadPackWithSidechannel` in v15.0, the
gRPC connection for Gitlab-Shell and Gitlab-Workhorse is now idle during
lengthy clones, triggering keepalives in this scenario. Previously the
connection was always busy and keepalives were rare.
After upgrading to v15.0 a large customer reported that they have
started to receive intermittent ENHANCE_YOUR_CALM errors on SSH clones,
which indicates that keepalives are being sent to rapidly from the
client. Examining packet captures from their hosts, we found that
Gitlab-Shell was sending keepalive packets roughly 20,001ms apart,
which is correct. However, the Gitaly server was intermittently
receiving these at an interval of 19,998ms, triggering the error. This
appears to be due to the variable connection latency, where typically
there's 50ms of latency between the hosts, but if there's slightly less
latency then the norm the keepalive arrives early from the server's
perspective.
To resolve this, let's give the clients a grace-period of ten seconds by
reducing the minimum keepalive interval to 10 seconds. This way a
keepalive that arrives slightly ahead of schedule is not a critical
error. Given that a Gitaly server will typically have hundreds of
concurrent clients, perhaps thousands on a very busy host, the volume of
keepalives should not add significant load and we can afford to be
forgiving.
[0] https://github.com/grpc/grpc/issues/28830
Changelog: fixed
-rw-r--r-- | internal/gitaly/server/server.go | 6 | ||||
-rw-r--r-- | internal/praefect/server.go | 6 |
2 files changed, 10 insertions, 2 deletions
diff --git a/internal/gitaly/server/server.go b/internal/gitaly/server/server.go index 2aad1ad74..e62b4b99c 100644 --- a/internal/gitaly/server/server.go +++ b/internal/gitaly/server/server.go @@ -156,8 +156,12 @@ func New( grpc.Creds(lm), grpc.StreamInterceptor(grpcmw.ChainStreamServer(streamServerInterceptors...)), grpc.UnaryInterceptor(grpcmw.ChainUnaryServer(unaryServerInterceptors...)), + // We deliberately set the server MinTime to significantly less than the client interval of 20 + // seconds to allow for network jitter. We can afford to be forgiving as the maximum number of + // concurrent clients for a Gitaly server is typically in the hundreds and this volume of + // keepalives won't add significant load. grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{ - MinTime: 20 * time.Second, + MinTime: 10 * time.Second, PermitWithoutStream: true, }), grpc.KeepaliveParams(keepalive.ServerParameters{ diff --git a/internal/praefect/server.go b/internal/praefect/server.go index f2bc6a380..3192ddfdd 100644 --- a/internal/praefect/server.go +++ b/internal/praefect/server.go @@ -128,8 +128,12 @@ func NewGRPCServer( auth.UnaryServerInterceptor(conf.Auth), )..., )), + // We deliberately set the server MinTime to significantly less than the client interval of 20 + // seconds to allow for network jitter. We can afford to be forgiving as the maximum number of + // concurrent clients for a Gitaly server is typically in the hundreds and this volume of + // keepalives won't add significant load. grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{ - MinTime: 20 * time.Second, + MinTime: 10 * time.Second, PermitWithoutStream: true, }), grpc.KeepaliveParams(keepalive.ServerParameters{ |