diff options
Diffstat (limited to 'doc/administration/geo/disaster_recovery')
-rw-r--r-- | doc/administration/geo/disaster_recovery/index.md | 27 | ||||
-rw-r--r-- | doc/administration/geo/disaster_recovery/planned_failover.md | 24 |
2 files changed, 26 insertions, 25 deletions
diff --git a/doc/administration/geo/disaster_recovery/index.md b/doc/administration/geo/disaster_recovery/index.md index 7c6f4a32b57..f6f88e9b193 100644 --- a/doc/administration/geo/disaster_recovery/index.md +++ b/doc/administration/geo/disaster_recovery/index.md @@ -7,17 +7,14 @@ type: howto # Disaster Recovery (Geo) **(PREMIUM SELF)** -Geo replicates your database, your Git repositories, and few other assets. -We will support and replicate more data in the future, that will enable you to -failover with minimal effort, in a disaster situation. - -See [Geo limitations](../index.md#limitations) for more information. +Geo replicates your database, your Git repositories, and few other assets, +but there are some [limitations](../index.md#limitations). WARNING: Disaster recovery for multi-secondary configurations is in **Alpha**. For the latest updates, check the [Disaster Recovery epic for complete maturity](https://gitlab.com/groups/gitlab-org/-/epics/3574). Multi-secondary configurations require the complete re-synchronization and re-configuration of all non-promoted secondaries and -will cause downtime. +causes downtime. ## Promoting a **secondary** Geo node in single-secondary configurations @@ -91,13 +88,16 @@ Note the following when promoting a secondary: before proceeding. If the secondary node [has been paused](../../geo/index.md#pausing-and-resuming-replication), the promotion performs a point-in-time recovery to the last known state. - Data that was created on the primary while the secondary was paused will be lost. + Data that was created on the primary while the secondary was paused is lost. - A new **secondary** should not be added at this time. If you want to add a new **secondary**, do this after you have completed the entire process of promoting the **secondary** to the **primary**. - If you encounter an `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken` error message during this process, for more information, see this [troubleshooting advice](../replication/troubleshooting.md#fixing-errors-during-a-failover-or-when-promoting-a-secondary-to-a-primary-node). +- If you run into errors when using `--force` or `--skip-preflight-checks` before 13.5 during this process, + for more information, see this + [troubleshooting advice](../replication/troubleshooting.md#errors-when-using---skip-preflight-checks-or---force). #### Promoting a **secondary** node running on a single machine @@ -243,6 +243,7 @@ required: sets the database to read-write. The instructions vary depending on where your database is hosted: - [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.Promote) - [Azure PostgreSQL](https://docs.microsoft.com/en-us/azure/postgresql/howto-read-replicas-portal#stop-replication) + - [Google Cloud SQL](https://cloud.google.com/sql/docs/mysql/replication/manage-replicas#promote-replica) - For other external PostgreSQL databases, save the following script in your secondary node, for example `/tmp/geo_promote.sh`, and modify the connection parameters to match your environment. Then, execute it to promote the replica: @@ -493,7 +494,7 @@ must disable the **primary** site: WARNING: If the secondary site [has been paused](../../geo/index.md#pausing-and-resuming-replication), this performs a point-in-time recovery to the last known state. -Data that was created on the primary while the secondary was paused will be lost. +Data that was created on the primary while the secondary was paused is lost. 1. SSH in to the database node in the **secondary** and trigger PostgreSQL to promote to read-write: @@ -509,7 +510,7 @@ Data that was created on the primary while the secondary was paused will be lost `geo_secondary_role`: NOTE: - Depending on your architecture these steps will need to be run on any GitLab node that is external to the **secondary** Kubernetes cluster. + Depending on your architecture, these steps need to run on any GitLab node that is external to the **secondary** Kubernetes cluster. ```ruby ## In pre-11.5 documentation, the role was enabled as follows. Remove this line. @@ -537,13 +538,13 @@ Data that was created on the primary while the secondary was paused will be lost 1. Update the existing cluster configuration. - You can retrieve the existing config with Helm: + You can retrieve the existing configuration with Helm: ```shell helm --namespace gitlab get values gitlab-geo > gitlab.yaml ``` - The existing config will contain a section for Geo that should resemble: + The existing configuration contains a section for Geo that should resemble: ```yaml geo: @@ -560,9 +561,9 @@ Data that was created on the primary while the secondary was paused will be lost To promote the **secondary** cluster to a **primary** cluster, update `role: secondary` to `role: primary`. - You can remove the entire `psql` section if the cluster will remain as a primary site, this refers to the tracking database and will be ignored whilst the cluster is acting as a primary site. + If the cluster remains as a primary site, you can remove the entire `psql` section; it refers to the tracking database and is ignored whilst the cluster is acting as a primary site. - Update the cluster with the new config: + Update the cluster with the new configuration: ```shell helm upgrade --install --version <current Chart version> gitlab-geo gitlab/gitlab --namespace gitlab -f gitlab.yaml diff --git a/doc/administration/geo/disaster_recovery/planned_failover.md b/doc/administration/geo/disaster_recovery/planned_failover.md index bd8467f5437..d50078da172 100644 --- a/doc/administration/geo/disaster_recovery/planned_failover.md +++ b/doc/administration/geo/disaster_recovery/planned_failover.md @@ -35,7 +35,7 @@ required scheduled maintenance period significantly. A common strategy for keeping this period as short as possible for data stored in files is to use `rsync` to transfer the data. An initial `rsync` can be performed ahead of the maintenance window; subsequent `rsync`s (including a -final transfer inside the maintenance window) will then transfer only the +final transfer inside the maintenance window) then transfers only the *changes* between the **primary** node and the **secondary** nodes. Repository-centric strategies for using `rsync` effectively can be found in the @@ -50,7 +50,7 @@ this command reports `ERROR - Replication is not up-to-date` even if replication is actually up-to-date. This bug was fixed in GitLab 13.8 and later. -Run this command to list out all preflight checks and automatically check if replication and verification are complete before scheduling a planned failover to ensure the process will go smoothly: +Run this command to list out all preflight checks and automatically check if replication and verification are complete before scheduling a planned failover to ensure the process goes smoothly: ```shell gitlab-ctl promotion-preflight-checks @@ -73,7 +73,7 @@ In GitLab 12.4, you can optionally allow GitLab to manage replication of Object Database settings are automatically replicated to the **secondary** node, but the `/etc/gitlab/gitlab.rb` file must be set up manually, and differs between nodes. If features such as Mattermost, OAuth or LDAP integration are enabled -on the **primary** node but not the **secondary** node, they will be lost during failover. +on the **primary** node but not the **secondary** node, they are lost during failover. Review the `/etc/gitlab/gitlab.rb` file for both nodes and ensure the **secondary** node supports everything the **primary** node does **before** scheduling a planned failover. @@ -119,7 +119,7 @@ time to complete If any objects are failing to replicate, this should be investigated before scheduling the maintenance window. Following a planned failover, anything that -failed to replicate will be **lost**. +failed to replicate is **lost**. You can use the [Geo status API](../../../api/geo_nodes.md#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node) to review failed objects and the reasons for failure. @@ -136,9 +136,9 @@ This [content was moved to another location](background_verification.md). On the **primary** node, navigate to **Admin Area > Messages**, add a broadcast message. You can check under **Admin Area > Geo** to estimate how long it -will take to finish syncing. An example message would be: +takes to finish syncing. An example message would be: -> A scheduled maintenance will take place at XX:XX UTC. We expect it to take +> A scheduled maintenance takes place at XX:XX UTC. We expect it to take > less than 1 hour. ## Prevent updates to the **primary** node @@ -151,7 +151,7 @@ be disabled on the primary site: 1. Disable non-Geo periodic background jobs on the **primary** node by navigating to **Admin Area > Monitoring > Background Jobs > Cron**, pressing `Disable All`, and then pressing `Enable` for the `geo_sidekiq_cron_config_worker` cron job. - This job will re-enable several other cron jobs that are essential for planned + This job re-enables several other cron jobs that are essential for planned failover to complete successfully. ## Finish replicating and verifying all data @@ -161,7 +161,7 @@ be disabled on the primary site: 1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** and wait for all queues except those with `geo` in the name to drop to 0. These queues contain work that has been submitted by your users; failing over - before it is completed will cause the work to be lost. + before it is completed, causes the work to be lost. 1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the following conditions to be true of the **secondary** node you are failing over to: @@ -176,15 +176,15 @@ be disabled on the primary site: to verify the integrity of CI artifacts, LFS objects, and uploads in file storage. -At this point, your **secondary** node will contain an up-to-date copy of everything the -**primary** node has, meaning nothing will be lost when you fail over. +At this point, your **secondary** node contains an up-to-date copy of everything the +**primary** node has, meaning nothing was lost when you fail over. ## Promote the **secondary** node Finally, follow the [Disaster Recovery docs](index.md) to promote the -**secondary** node to a **primary** node. This process will cause a brief outage on the **secondary** node, and users may need to log in again. +**secondary** node to a **primary** node. This process causes a brief outage on the **secondary** node, and users may need to log in again. -Once it is completed, the maintenance window is over! Your new **primary** node will now +Once it is completed, the maintenance window is over! Your new **primary** node, now begin to diverge from the old one. If problems do arise at this point, failing back to the old **primary** node [is possible](bring_primary_back.md), but likely to result in the loss of any data uploaded to the new **primary** in the meantime. |