diff options
Diffstat (limited to 'doc/development/redis/new_redis_instance.md')
-rw-r--r-- | doc/development/redis/new_redis_instance.md | 131 |
1 files changed, 131 insertions, 0 deletions
diff --git a/doc/development/redis/new_redis_instance.md b/doc/development/redis/new_redis_instance.md new file mode 100644 index 00000000000..714936d9a24 --- /dev/null +++ b/doc/development/redis/new_redis_instance.md @@ -0,0 +1,131 @@ +--- +stage: none +group: unassigned +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Add a new Redis instance + +GitLab can make use of multiple [Redis instances](../redis.md#redis-instances). +These instances are functionally partitioned so that, for example, we +can store [CI trace chunks](../../administration/job_logs.md#incremental-logging-architecture) +from one Redis instance while storing sessions in another. + +From time to time we might want to add a new Redis instance. Typically this will +be a functional partition split from one of the existing instances such as the +cache or shared state. This document describes an approach +for adding a new Redis instance that handles existing data, based on +prior examples: + +- [Dedicated Redis instance for Trace Chunk storage](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/462). + +This document does not cover the operational side of preparing and configuring +the new Redis instance in detail, but the example epics do contain information +on previous approaches to this. + +## Step 1: Support configuring the new instance + +Before we can switch any features to using the new instance, we have to support +configuring it and referring to it in the codebase. We must support the +main installation types: + +- Source installs (including development environments) - [example MR](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/62767) +- Omnibus - [example MR](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5316) +- Helm charts - [example MR](https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2031) + +### Fallback instance + +In the application code, we need to define a fallback instance in case the new +instance is not configured. For example, if a GitLab instance has already +configured a separate shared state Redis, and we are partitioning data from the +shared state Redis, our new instance's configuration should default to that of +the shared state Redis when it's not present. Otherwise we could break instances +that don't configure the new Redis instance as soon as it's available. + +You can [define a `.config_fallback` method](https://gitlab.com/gitlab-org/gitlab/-/blob/a75471dd744678f1a59eeb99f71fca577b155acd/lib/gitlab/redis/wrapper.rb#L69-87) +in `Gitlab::Redis::Wrapper` (the base class for all Redis instances) +that defines the instance to be used if this one is not configured. If we were +adding a `Foo` instance that should fall back to `SharedState`, we can do that +like this: + +```ruby +module Gitlab + module Redis + class Foo < ::Gitlab::Redis::Wrapper + # The data we store on Foo used to be stored on SharedState. + def self.config_fallback + SharedState + end + end + end +end +``` + +We should also add specs like those in +[`trace_chunks_spec.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/spec/lib/gitlab/redis/trace_chunks_spec.rb) +to ensure that this fallback works correctly. + +## Step 2: Support writing to and reading from the new instance + +When migrating to the new instance, we must account for cases where data is +either on: + +- The 'old' (original) instance. +- The new one that we have just added support for. + +As a result we may need to support reading from and writing to both +instances, depending on some condition. + +The exact condition to use varies depending on the data to be migrated. For +the trace chunks case above, there was already a database column indicating where the +data was stored (as there are other storage options than Redis). + +This step may not apply if the data has a very short lifetime (a few minutes at most) +and is not critical. In that case, we +may decide that it is OK to incur a small amount of data loss and switch +over through configuration only. + +If there is not a more natural way to mark where the data is stored, using a +[feature flag](../feature_flags/index.md) may be convenient: + +- It does not require an application restart to take effect. +- It applies to all application instances (Sidekiq, API, web, etc.) at + the same time. +- It supports incremental rollout - ideally by actor (project, group, + user, etc.) - so that we can monitor for errors and roll back easily. + +## Step 3: Migrate the data + +We then need to configure the new instance for GitLab.com's production and +staging environments. Hopefully it will be possible to test this change +effectively on staging, to at least make sure that basic usage continues to +work. + +After that is done, we can roll out the change to production. Ideally this would +be in an incremental fashion, following the +[standard incremental rollout](../feature_flags/controls.md#rolling-out-changes) +documentation for feature flags. + +When we have been using the new instance 100% of the time in production for a +while and there are no issues, we can proceed. + +## Step 4: clean up after the migration + +<!-- markdownlint-disable MD044 --> +We may choose to keep the migration paths or remove them, depending on whether +or not we expect self-managed instances to perform this migration. +[gitlab-com/gl-infra/scalability#1131](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1131#note_603354746) +contains a discussion on this topic for the trace chunks feature flag. It may +be - as in that case - that we decide that the maintenance costs of supporting +the migration code are higher than the benefits of allowing self-managed +instances to perform this migration seamlessly, if we expect self-managed +instances to cope without this functional partition. +<!-- markdownlint-enable MD044 --> + +If we decide to keep the migration code: + +- We should document the migration steps. +- If we used a feature flag, we should ensure it's an [ops type feature + flag](../feature_flags/index.md#ops-type), as these are long-lived flags. + +Otherwise, we can remove the flags and conclude the project. |