Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md')
-rw-r--r--doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md66
1 files changed, 39 insertions, 27 deletions
diff --git a/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md b/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md
index 3227fafca0f..4cfe781c7a4 100644
--- a/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md
+++ b/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md
@@ -63,13 +63,16 @@ Before following any of those steps, make sure you have `root` access to the
**secondary** to promote it, since there isn't provided an automated way to
promote a Geo replica and perform a failover.
-On the **secondary** node, navigate to the **Admin Area > Geo** dashboard to
-review its status. Replicated objects (shown in green) should be close to 100%,
-and there should be no failures (shown in red). If a large proportion of
-objects aren't yet replicated (shown in gray), consider giving the node more
-time to complete.
+On the **secondary** node:
-![Replication status](../img/replication-status.png)
+1. On the top bar, select **Menu >** **{admin}** **Admin**.
+1. On the left sidebar, select **Geo > Nodes** to see its status.
+ Replicated objects (shown in green) should be close to 100%,
+ and there should be no failures (shown in red). If a large proportion of
+ objects aren't yet replicated (shown in gray), consider giving the node more
+ time to complete.
+
+ ![Replication status](../../replication/img/geo_node_dashboard_v14_0.png)
If any objects are failing to replicate, this should be investigated before
scheduling the maintenance window. After a planned failover, anything that
@@ -126,11 +129,14 @@ follow these steps to avoid unnecessary data loss:
existing Git repository with an SSH remote URL. The server should refuse
connection.
- 1. On the **primary** node, disable non-Geo periodic background jobs by navigating
- to **Admin Area > Monitoring > Background Jobs > Cron**, clicking `Disable All`,
- and then clicking `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
- This job will re-enable several other cron jobs that are essential for planned
- failover to complete successfully.
+ 1. On the **primary** node:
+ 1. On the top bar, select **Menu >** **{admin}** **Admin**.
+ 1. On the left sidebar, select **Monitoring > Background Jobs**.
+ 1. On the Sidekiq dhasboard, select **Cron**.
+ 1. Select `Disable All` to disable any non-Geo periodic background jobs.
+ 1. Select `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
+ This job will re-enable several other cron jobs that are essential for planned
+ failover to complete successfully.
1. Finish replicating and verifying all data:
@@ -141,22 +147,28 @@ follow these steps to avoid unnecessary data loss:
1. If you are manually replicating any
[data not managed by Geo](../../replication/datatypes.md#limitations-on-replicationverification),
trigger the final replication process now.
- 1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
- and wait for all queues except those with `geo` in the name to drop to 0.
- These queues contain work that has been submitted by your users; failing over
- before it is completed will cause the work to be lost.
- 1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the
- following conditions to be true of the **secondary** node you are failing over to:
- - All replication meters to each 100% replicated, 0% failures.
- - All verification meters reach 100% verified, 0% failures.
- - Database replication lag is 0ms.
- - The Geo log cursor is up to date (0 events behind).
-
- 1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
- and wait for all the `geo` queues to drop to 0 queued and 0 running jobs.
- 1. On the **secondary** node, use [these instructions](../../../raketasks/check.md)
- to verify the integrity of CI artifacts, LFS objects, and uploads in file
- storage.
+ 1. On the **primary** node:
+ 1. On the top bar, select **Menu >** **{admin}** **Admin**.
+ 1. On the left sidebar, select **Monitoring > Background Jobs**.
+ 1. On the Sidekiq dashboard, select **Queues**, and wait for all queues except
+ those with `geo` in the name to drop to 0.
+ These queues contain work that has been submitted by your users; failing over
+ before it is completed, causes the work to be lost.
+ 1. On the left sidebar, select **Geo > Nodes** and wait for the
+ following conditions to be true of the **secondary** node you are failing over to:
+
+ - All replication meters reach 100% replicated, 0% failures.
+ - All verification meters reach 100% verified, 0% failures.
+ - Database replication lag is 0ms.
+ - The Geo log cursor is up to date (0 events behind).
+
+ 1. On the **secondary** node:
+ 1. On the top bar, select **Menu >** **{admin}** **Admin**.
+ 1. On the left sidebar, select **Monitoring > Background Jobs**.
+ 1. On the Sidekiq dashboard, select **Queues**, and wait for all the `geo`
+ queues to drop to 0 queued and 0 running jobs.
+ 1. [Run an integrity check](../../../raketasks/check.md) to verify the integrity
+ of CI artifacts, LFS objects, and uploads in file storage.
At this point, your **secondary** node will contain an up-to-date copy of everything the
**primary** node has, meaning nothing will be lost when you fail over.