Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/geo/disaster_recovery/planned_failover.md')
-rw-r--r--doc/administration/geo/disaster_recovery/planned_failover.md81
1 files changed, 46 insertions, 35 deletions
diff --git a/doc/administration/geo/disaster_recovery/planned_failover.md b/doc/administration/geo/disaster_recovery/planned_failover.md
index d50078da172..5c15523ac78 100644
--- a/doc/administration/geo/disaster_recovery/planned_failover.md
+++ b/doc/administration/geo/disaster_recovery/planned_failover.md
@@ -109,13 +109,16 @@ The maintenance window won't end until Geo replication and verification is
completely finished. To keep the window as short as possible, you should
ensure these processes are close to 100% as possible during active use.
-Go to the **Admin Area > Geo** dashboard on the **secondary** node to
-review status. Replicated objects (shown in green) should be close to 100%,
-and there should be no failures (shown in red). If a large proportion of
-objects aren't yet replicated (shown in gray), consider giving the node more
-time to complete
+On the **secondary** node:
-![Replication status](img/replication-status.png)
+1. On the top bar, select **Menu >** **{admin}** **Admin**.
+1. On the left sidebar, select **Geo > Nodes**.
+ Replicated objects (shown in green) should be close to 100%,
+ and there should be no failures (shown in red). If a large proportion of
+ objects aren't yet replicated (shown in gray), consider giving the node more
+ time to complete
+
+ ![Replication status](../replication/img/geo_node_dashboard_v14_0.png)
If any objects are failing to replicate, this should be investigated before
scheduling the maintenance window. Following a planned failover, anything that
@@ -134,23 +137,26 @@ This [content was moved to another location](background_verification.md).
### Notify users of scheduled maintenance
-On the **primary** node, navigate to **Admin Area > Messages**, add a broadcast
-message. You can check under **Admin Area > Geo** to estimate how long it
-takes to finish syncing. An example message would be:
+On the **primary** node:
-> A scheduled maintenance takes place at XX:XX UTC. We expect it to take
-> less than 1 hour.
+1. On the top bar, select **Menu >** **{admin}** **Admin**.
+1. On the left sidebar, select **Messages**.
+1. Add a message notifying users on the maintenance window.
+ You can check under **Geo > Nodes** to estimate how long it
+ takes to finish syncing.
+1. Select **Add broadcast message**.
## Prevent updates to the **primary** node
To ensure that all data is replicated to a secondary site, updates (write requests) need to
-be disabled on the primary site:
-
-1. Enable [maintenance mode](../../maintenance_mode/index.md).
-
-1. Disable non-Geo periodic background jobs on the **primary** node by navigating
- to **Admin Area > Monitoring > Background Jobs > Cron**, pressing `Disable All`,
- and then pressing `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
+be disabled on the **primary** site:
+
+1. Enable [maintenance mode](../../maintenance_mode/index.md) on the **primary** node.
+1. On the top bar, select **Menu >** **{admin}** **Admin**.
+1. On the left sidebar, select **Monitoring > Background Jobs**.
+1. On the Sidekiq dashboard, select **Cron**.
+1. Select `Disable All` to disable non-Geo periodic background jobs.
+1. Select `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
This job re-enables several other cron jobs that are essential for planned
failover to complete successfully.
@@ -158,23 +164,28 @@ be disabled on the primary site:
1. If you are manually replicating any data not managed by Geo, trigger the
final replication process now.
-1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
- and wait for all queues except those with `geo` in the name to drop to 0.
- These queues contain work that has been submitted by your users; failing over
- before it is completed, causes the work to be lost.
-1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the
- following conditions to be true of the **secondary** node you are failing over to:
-
- - All replication meters to each 100% replicated, 0% failures.
- - All verification meters reach 100% verified, 0% failures.
- - Database replication lag is 0ms.
- - The Geo log cursor is up to date (0 events behind).
-
-1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
- and wait for all the `geo` queues to drop to 0 queued and 0 running jobs.
-1. On the **secondary** node, use [these instructions](../../raketasks/check.md)
- to verify the integrity of CI artifacts, LFS objects, and uploads in file
- storage.
+1. On the **primary** node:
+ 1. On the top bar, select **Menu >** **{admin}** **Admin**.
+ 1. On the left sidebar, select **Monitoring > Background Jobs**.
+ 1. On the Sidekiq dashboard, select **Queues**, and wait for all queues except
+ those with `geo` in the name to drop to 0.
+ These queues contain work that has been submitted by your users; failing over
+ before it is completed, causes the work to be lost.
+ 1. On the left sidebar, select **Geo > Nodes** and wait for the
+ following conditions to be true of the **secondary** node you are failing over to:
+
+ - All replication meters reach 100% replicated, 0% failures.
+ - All verification meters reach 100% verified, 0% failures.
+ - Database replication lag is 0ms.
+ - The Geo log cursor is up to date (0 events behind).
+
+1. On the **secondary** node:
+ 1. On the top bar, select **Menu >** **{admin}** **Admin**.
+ 1. On the left sidebar, select **Monitoring > Background Jobs**.
+ 1. On the Sidekiq dashboard, select **Queues**, and wait for all the `geo`
+ queues to drop to 0 queued and 0 running jobs.
+ 1. [Run an integrity check](../../raketasks/check.md) to verify the integrity
+ of CI artifacts, LFS objects, and uploads in file storage.
At this point, your **secondary** node contains an up-to-date copy of everything the
**primary** node has, meaning nothing was lost when you fail over.