Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2020-07-29 03:09:37 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2020-07-29 03:09:37 +0300
commit: 937f82e11fe1d3970ea3e1f281185e91d8f5102e (patch)
tree: 9c69e19144f2f9d7d5119496f468aea4f1538137 /doc/development/multi_version_compatibility.md
parent: 583fadea8d738850cbd83dcde1118d3fc3462d61 (diff)
1 files changed, 50 insertions, 0 deletions
diff --git a/doc/development/multi_version_compatibility.md b/doc/development/multi_version_compatibility.md
index ce6cc6610f4..d9478142cb5 100644
--- a/doc/development/multi_version_compatibility.md
+++ b/doc/development/multi_version_compatibility.md
@@ -20,6 +20,22 @@ but AJAX requests to URLs (like the GraphQL endpoint) won't match the pattern.
 With this canary setup, we'd be in this mixed-versions state for an extended period of time until canary is promoted to
 production and post-deployment migrations run.
 
+Also be aware that during a deployment to production, Web, API, and
+Sidekiq nodes are updated in parallel, but they may finish at
+different times. That means there may be a window of time when the
+application code is not in sync across the whole fleet. Changes that
+cut across Sidekiq, Web, and/or the API may [introduce unexpected
+errors until the deployment is complete](#builds-failing-due-to-varying-deployment-times-across-node-types).
+
+One way to handle this is to use a feature flag that is disabled by
+default. The feature flag can be enabled when the deployment is in a
+consistent state. However, this method of synchronization doesn't
+guarantee that customers with on-premise instances can [upgrade with
+zero downtime](https://docs.gitlab.com/omnibus/update/#zero-downtime-updates)
+since point releases bundle many changes together. Minimizing the time
+between when versions are out of sync across the fleet may help mitigate
+errors caused by upgrades.
+
 ## Examples of previous incidents
 
 ### Some links to issues and MRs were broken
@@ -75,3 +91,37 @@ the new application code, hence QA was successful. Unfortunately, the production
 instance still uses the older code, so it started failing to insert a new release entry.
 
 For more information, see [this issue related to the Releases API](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64151).
+
+### Builds failing due to varying deployment times across node types
+
+In [one production issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2442),
+CI builds that used the `parallel` keyword and depending on the
+variable `CI_NODE_TOTAL` being an integer failed. This was caused because after a user pushed a commit:
+
+1. New code: Sidekiq created a new pipeline and new build. `build.options[:parallel]` is a `Hash`.
+1. Old code: Runners requested a job from an API node that is running the previous version.
+1. As a result, the [new code](https://gitlab.com/gitlab-org/gitlab/blob/42b82a9a3ac5a96f9152aad6cbc583c42b9fb082/app/models/concerns/ci/contextable.rb#L104)
+was not run on the API server. The runner's request failed because the
+older API server tried return the `CI_NODE_TOTAL` CI variable, but
+instead of sending an integer value (e.g. 9), it sent a serialized
+`Hash` value (`{:number=>9, :total=>9}`).
+
+If you look at the [deployment pipeline](https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/202212),
+you see all nodes were updated in parallel:
+
+![GitLab.com deployment pipeline](img/deployment_pipeline_v13_3.png)
+
+However, even though the updated started around the same time, the completion time varied significantly:
+
+|Node type|Duration (min)|
+|---------|--------------|
+|API      |54            |
+|Sidekiq  |21            |
+|K8S      |8             |
+
+Builds that used the `parallel` keyword and depended on `CI_NODE_TOTAL`
+and `CI_NODE_INDEX` would fail during the time after Sidekiq was
+updated. Since Kubernetes (K8S) also runs Sidekiq pods, the window could
+have been as long as 46 minutes or as short as 33 minutes. Either way,
+having a feature flag to turn on after the deployment finished would
+prevent this from happening.
author	GitLab Bot <gitlab-bot@gitlab.com>	2020-07-29 03:09:37 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2020-07-29 03:09:37 +0300
commit	937f82e11fe1d3970ea3e1f281185e91d8f5102e (patch)
tree	9c69e19144f2f9d7d5119496f468aea4f1538137 /doc/development/multi_version_compatibility.md
parent	583fadea8d738850cbd83dcde1118d3fc3462d61 (diff)