diff options
Diffstat (limited to 'doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md')
-rw-r--r-- | doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md | 66 |
1 files changed, 39 insertions, 27 deletions
diff --git a/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md b/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md index 3227fafca0f..4cfe781c7a4 100644 --- a/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md +++ b/doc/administration/geo/disaster_recovery/runbooks/planned_failover_multi_node.md @@ -63,13 +63,16 @@ Before following any of those steps, make sure you have `root` access to the **secondary** to promote it, since there isn't provided an automated way to promote a Geo replica and perform a failover. -On the **secondary** node, navigate to the **Admin Area > Geo** dashboard to -review its status. Replicated objects (shown in green) should be close to 100%, -and there should be no failures (shown in red). If a large proportion of -objects aren't yet replicated (shown in gray), consider giving the node more -time to complete. +On the **secondary** node: -![Replication status](../img/replication-status.png) +1. On the top bar, select **Menu >** **{admin}** **Admin**. +1. On the left sidebar, select **Geo > Nodes** to see its status. + Replicated objects (shown in green) should be close to 100%, + and there should be no failures (shown in red). If a large proportion of + objects aren't yet replicated (shown in gray), consider giving the node more + time to complete. + + ![Replication status](../../replication/img/geo_node_dashboard_v14_0.png) If any objects are failing to replicate, this should be investigated before scheduling the maintenance window. After a planned failover, anything that @@ -126,11 +129,14 @@ follow these steps to avoid unnecessary data loss: existing Git repository with an SSH remote URL. The server should refuse connection. - 1. On the **primary** node, disable non-Geo periodic background jobs by navigating - to **Admin Area > Monitoring > Background Jobs > Cron**, clicking `Disable All`, - and then clicking `Enable` for the `geo_sidekiq_cron_config_worker` cron job. - This job will re-enable several other cron jobs that are essential for planned - failover to complete successfully. + 1. On the **primary** node: + 1. On the top bar, select **Menu >** **{admin}** **Admin**. + 1. On the left sidebar, select **Monitoring > Background Jobs**. + 1. On the Sidekiq dhasboard, select **Cron**. + 1. Select `Disable All` to disable any non-Geo periodic background jobs. + 1. Select `Enable` for the `geo_sidekiq_cron_config_worker` cron job. + This job will re-enable several other cron jobs that are essential for planned + failover to complete successfully. 1. Finish replicating and verifying all data: @@ -141,22 +147,28 @@ follow these steps to avoid unnecessary data loss: 1. If you are manually replicating any [data not managed by Geo](../../replication/datatypes.md#limitations-on-replicationverification), trigger the final replication process now. - 1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** - and wait for all queues except those with `geo` in the name to drop to 0. - These queues contain work that has been submitted by your users; failing over - before it is completed will cause the work to be lost. - 1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the - following conditions to be true of the **secondary** node you are failing over to: - - All replication meters to each 100% replicated, 0% failures. - - All verification meters reach 100% verified, 0% failures. - - Database replication lag is 0ms. - - The Geo log cursor is up to date (0 events behind). - - 1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** - and wait for all the `geo` queues to drop to 0 queued and 0 running jobs. - 1. On the **secondary** node, use [these instructions](../../../raketasks/check.md) - to verify the integrity of CI artifacts, LFS objects, and uploads in file - storage. + 1. On the **primary** node: + 1. On the top bar, select **Menu >** **{admin}** **Admin**. + 1. On the left sidebar, select **Monitoring > Background Jobs**. + 1. On the Sidekiq dashboard, select **Queues**, and wait for all queues except + those with `geo` in the name to drop to 0. + These queues contain work that has been submitted by your users; failing over + before it is completed, causes the work to be lost. + 1. On the left sidebar, select **Geo > Nodes** and wait for the + following conditions to be true of the **secondary** node you are failing over to: + + - All replication meters reach 100% replicated, 0% failures. + - All verification meters reach 100% verified, 0% failures. + - Database replication lag is 0ms. + - The Geo log cursor is up to date (0 events behind). + + 1. On the **secondary** node: + 1. On the top bar, select **Menu >** **{admin}** **Admin**. + 1. On the left sidebar, select **Monitoring > Background Jobs**. + 1. On the Sidekiq dashboard, select **Queues**, and wait for all the `geo` + queues to drop to 0 queued and 0 running jobs. + 1. [Run an integrity check](../../../raketasks/check.md) to verify the integrity + of CI artifacts, LFS objects, and uploads in file storage. At this point, your **secondary** node will contain an up-to-date copy of everything the **primary** node has, meaning nothing will be lost when you fail over. |