Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2022-11-17 14:33:21 +0300
committerGitLab Bot <gitlab-bot@gitlab.com>2022-11-17 14:33:21 +0300
commit7021455bd1ed7b125c55eb1b33c5a01f2bc55ee0 (patch)
tree5bdc2229f5198d516781f8d24eace62fc7e589e9 /doc/administration/geo/replication
parent185b095e93520f96e9cfc31d9c3e69b498cdab7c (diff)
Add latest changes from gitlab-org/gitlab@15-6-stable-eev15.6.0-rc42
Diffstat (limited to 'doc/administration/geo/replication')
-rw-r--r--doc/administration/geo/replication/configuration.md7
-rw-r--r--doc/administration/geo/replication/datatypes.md3
-rw-r--r--doc/administration/geo/replication/disable_geo.md4
-rw-r--r--doc/administration/geo/replication/docker_registry.md11
-rw-r--r--doc/administration/geo/replication/faq.md10
-rw-r--r--doc/administration/geo/replication/geo_validation_tests.md4
-rw-r--r--doc/administration/geo/replication/location_aware_git_url.md2
-rw-r--r--doc/administration/geo/replication/remove_geo_site.md5
-rw-r--r--doc/administration/geo/replication/security_review.md10
-rw-r--r--doc/administration/geo/replication/troubleshooting.md429
10 files changed, 298 insertions, 187 deletions
diff --git a/doc/administration/geo/replication/configuration.md b/doc/administration/geo/replication/configuration.md
index fa74f16cdc8..55c5d3784c2 100644
--- a/doc/administration/geo/replication/configuration.md
+++ b/doc/administration/geo/replication/configuration.md
@@ -12,7 +12,7 @@ type: howto
NOTE:
This is the final step in setting up a **secondary** Geo site. Stages of the
setup process must be completed in the documented order.
-If not, [complete all prior stages](../setup/index.md#using-omnibus-gitlab) before procceed.
+If not, [complete all prior stages](../setup/index.md#using-omnibus-gitlab) before proceeding.
Make sure you [set up the database replication](../setup/database.md), and [configured fast lookup of authorized SSH keys](../../operations/fast_ssh_key_lookup.md) in **both primary and secondary sites**.
@@ -239,8 +239,9 @@ keys must be manually replicated to the **secondary** site.
If any of the checks fail, check the [troubleshooting documentation](troubleshooting.md).
-Once added to the Geo administration page and restarted, the **secondary** site automatically starts
-replicating missing data from the **primary** site in a process known as **backfill**.
+After the **secondary** site is added to the Geo administration page and restarted,
+the site automatically starts replicating missing data from the **primary** site
+in a process known as **backfill**.
Meanwhile, the **primary** site starts to notify each **secondary** site of any changes, so
that the **secondary** site can act on those notifications immediately.
diff --git a/doc/administration/geo/replication/datatypes.md b/doc/administration/geo/replication/datatypes.md
index 566df2ee509..0198d2a63e8 100644
--- a/doc/administration/geo/replication/datatypes.md
+++ b/doc/administration/geo/replication/datatypes.md
@@ -201,8 +201,7 @@ successfully, you must replicate their data using some other means.
|[CI job artifacts](../../../ci/pipelines/job_artifacts.md) | **Yes** (10.4) | **Yes** (14.10) | [**Yes** (15.1)](https://gitlab.com/groups/gitlab-org/-/epics/5551) | [No](object_storage.md#verification-of-files-in-object-storage) | Verification is behind the feature flag `geo_job_artifact_replication`, enabled by default in 14.10. |
|[CI Pipeline Artifacts](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/ci/pipeline_artifact.rb) | [**Yes** (13.11)](https://gitlab.com/gitlab-org/gitlab/-/issues/238464) | [**Yes** (13.11)](https://gitlab.com/gitlab-org/gitlab/-/issues/238464) | [**Yes** (15.1)](https://gitlab.com/groups/gitlab-org/-/epics/5551) | [No](object_storage.md#verification-of-files-in-object-storage) | Persists additional artifacts after a pipeline completes. |
|[CI Secure Files](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/ci/secure_file.rb) | [**Yes** (15.3)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/91430) | [**Yes** (15.3)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/91430) | [**Yes** (15.3)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/91430) | [No](object_storage.md#verification-of-files-in-object-storage) | Verification is behind the feature flag `geo_ci_secure_file_replication`, enabled by default in 15.3. |
-|[Container Registry](../../packages/container_registry.md) | **Yes** (12.3)* | No | No | No | Replication is behind feature flag `geo_container_repository_replication`, enabled by default.
-Requires additional configuration. See [instructions](container_registry.md) to set up the Container Registry replication. |
+|[Container Registry](../../packages/container_registry.md) | **Yes** (12.3)* | No | No | No | Replication is behind feature flag `geo_container_repository_replication`, enabled by default. Requires additional configuration. See [instructions](container_registry.md) to set up the Container Registry replication. |
|[Infrastructure Registry](../../../user/packages/infrastructure_registry/index.md) | **Yes** (14.0) | **Yes** (14.0) | [**Yes** (15.1)](https://gitlab.com/groups/gitlab-org/-/epics/5551) | [No](object_storage.md#verification-of-files-in-object-storage) | Behind feature flag `geo_package_file_replication`, enabled by default. |
|[Project designs repository](../../../user/project/issues/design_management.md) | **Yes** (12.7) | [No](https://gitlab.com/gitlab-org/gitlab/-/issues/32467) | N/A | N/A | Designs also require replication of LFS objects and Uploads. |
|[Package Registry](../../../user/packages/package_registry/index.md) | **Yes** (13.2) | **Yes** (13.10) | [**Yes** (15.1)](https://gitlab.com/groups/gitlab-org/-/epics/5551) | [No](object_storage.md#verification-of-files-in-object-storage) | Behind feature flag `geo_package_file_replication`, enabled by default. |
diff --git a/doc/administration/geo/replication/disable_geo.md b/doc/administration/geo/replication/disable_geo.md
index 3230a92136f..c42130a62a7 100644
--- a/doc/administration/geo/replication/disable_geo.md
+++ b/doc/administration/geo/replication/disable_geo.md
@@ -24,8 +24,8 @@ To disable Geo, follow these steps:
## Remove all secondary Geo sites
-To disable Geo, you need to first remove all your secondary Geo sites, which means replication will not happen
-anymore on these sites. You can follow our docs to [remove your secondary Geo sites](remove_geo_site.md).
+To disable Geo, you need to first remove all your secondary Geo sites, which means replication does not happen
+anymore on these sites. You can follow our documentation to [remove your secondary Geo sites](remove_geo_site.md).
If the current site that you want to keep using is a secondary site, you need to first promote it to primary.
You can use our steps on [how to promote a secondary site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site)
diff --git a/doc/administration/geo/replication/docker_registry.md b/doc/administration/geo/replication/docker_registry.md
deleted file mode 100644
index d0af6f2a66f..00000000000
--- a/doc/administration/geo/replication/docker_registry.md
+++ /dev/null
@@ -1,11 +0,0 @@
----
-redirect_to: 'container_registry.md'
-remove_date: '2022-10-29'
----
-
-This document was moved to [another location](container_registry.md).
-
-<!-- This redirect file can be deleted after <2022-10-29>. -->
-<!-- Redirects that point to other docs in the same project expire in three months. -->
-<!-- Redirects that point to docs in a different project or site (link is not relative and starts with `https:`) expire in one year. -->
-<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> \ No newline at end of file
diff --git a/doc/administration/geo/replication/faq.md b/doc/administration/geo/replication/faq.md
index 81afcc19bb1..311cdeee5b9 100644
--- a/doc/administration/geo/replication/faq.md
+++ b/doc/administration/geo/replication/faq.md
@@ -22,7 +22,7 @@ For each project to sync:
1. Geo issues a `git fetch geo --mirror` to get the latest information from the **primary** site.
If there are no changes, the sync is fast. Otherwise, it has to pull the latest commits.
-1. The **secondary** site updates the tracking database to store the fact that it has synced projects A, B, C, and so on.
+1. The **secondary** site updates the tracking database to store the fact that it has synced projects by name.
1. Repeat until all projects are synced.
When someone pushes a commit to the **primary** site, it generates an event in the GitLab database that the repository has changed.
@@ -45,8 +45,8 @@ Read the documentation for [Disaster Recovery](../disaster_recovery/index.md).
## What data is replicated to a **secondary** site?
We currently replicate project repositories, LFS objects, generated
-attachments and avatars, and the whole database. This means user accounts,
-issues, merge requests, groups, project data, and so on, are available for
+attachments and avatars, and the whole database. This means information such as user accounts,
+issues, merge requests, groups, and project data are available for
query.
For more details, see the [supported Geo data types](datatypes.md).
@@ -58,8 +58,8 @@ Pushing directly to a **secondary** site (for both HTTP and SSH, including Git L
## How long does it take to have a commit replicated to a **secondary** site?
All replication operations are asynchronous and are queued to be dispatched. Therefore, it depends on a lot of
-factors including the amount of traffic, how big your commit is, the
-connectivity between your sites, your hardware, and so on.
+factors such as the amount of traffic, how big your commit is, the
+connectivity between your sites, and your hardware.
## What if the SSH server runs at a different port?
diff --git a/doc/administration/geo/replication/geo_validation_tests.md b/doc/administration/geo/replication/geo_validation_tests.md
index 8fa5a45b579..f09422d1e26 100644
--- a/doc/administration/geo/replication/geo_validation_tests.md
+++ b/doc/administration/geo/replication/geo_validation_tests.md
@@ -29,7 +29,7 @@ The following are GitLab upgrade validation tests we performed.
[Switch from repmgr to Patroni on a Geo primary site](https://gitlab.com/gitlab-org/gitlab/-/issues/224652):
- Description: Tested switching from repmgr to Patroni on a multi-node Geo primary site. Used [the orchestrator tool](https://gitlab.com/gitlab-org/gitlab-orchestrator) to deploy a Geo installation with 3 database nodes managed by repmgr. With this approach, we were also able to address a related issue for [verifying a Geo installation with Patroni and PostgreSQL 11](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/5113).
-- Outcome: Partial success. We enabled Patroni on the primary site and set up database replication on the secondary site. However, we found that Patroni would delete the secondary site's replication slot whenever Patroni was restarted. Another issue is that when Patroni elects a new leader in the cluster, the secondary site will fail to automatically follow the new leader. Until these issues are resolved, we cannot officially support and recommend Patroni for Geo installations.
+- Outcome: Partial success. We enabled Patroni on the primary site and set up database replication on the secondary site. However, we found that Patroni would delete the secondary site's replication slot whenever Patroni was restarted. Another issue is that when Patroni elects a new leader in the cluster, the secondary site fails to automatically follow the new leader. Until these issues are resolved, we cannot officially support and recommend Patroni for Geo installations.
- Follow up issues/actions:
- [Investigate permanent replication slot for Patroni with Geo single node secondary](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/5528)
@@ -213,7 +213,7 @@ The following are additional validation tests we performed.
[Validate Object storage replication using GCP based object storage](https://gitlab.com/gitlab-org/gitlab/-/issues/351464):
- Description: Tested the average time it takes for a single image to replicate from the primary object storage location to the secondary when using GCP based object storage replication and [GitLab based object storage replication](object_storage.md#enabling-gitlab-managed-object-storage-replication). This was tested by uploading a 1mb image to a project on the primary site every second for 60 seconds. The time was then measured until a image was available on the secondary site. This was achieved using a [Ruby Script](https://gitlab.com/gitlab-org/quality/geo-replication-tester).
-- Outcome: GCP handles replication differently than other Cloud Providers. In GCP, the process is to a create single bucket that is either multi, dual or single region based. This means that the bucket will automatically store replicas in a region based on the option chosen. Even when using multi region, this will still only replicate within a single continent, the options being America, Europe, or Asia. At current there doesn't seem to be any way to replicate objects between continents using GCP based replication. For Geo managed replication the average time when replicating within the same region was 6 seconds, and when replicating cross region this rose to just 9 seconds.
+- Outcome: GCP handles replication differently than other Cloud Providers. In GCP, the process is to a create single bucket that is either multi, dual, or single region based. This means that the bucket automatically stores replicas in a region based on the option chosen. Even when using multi region, this only replicates in a single continent, the options being America, Europe, or Asia. At current there doesn't seem to be any way to replicate objects between continents using GCP based replication. For Geo managed replication the average time when replicating in the same region was 6 seconds, and when replicating cross region this rose to just 9 seconds.
## Other tests
diff --git a/doc/administration/geo/replication/location_aware_git_url.md b/doc/administration/geo/replication/location_aware_git_url.md
index e0e113eebbd..dbe543f5a62 100644
--- a/doc/administration/geo/replication/location_aware_git_url.md
+++ b/doc/administration/geo/replication/location_aware_git_url.md
@@ -31,7 +31,7 @@ In this example, we have already set up:
- `primary.example.com` as a Geo **primary** site.
- `secondary.example.com` as a Geo **secondary** site.
-We will create a `git.example.com` subdomain that will automatically direct
+We create a `git.example.com` subdomain that automatically directs
requests:
- From Europe to the **secondary** site.
diff --git a/doc/administration/geo/replication/remove_geo_site.md b/doc/administration/geo/replication/remove_geo_site.md
index 62b1d9fdf7b..4b9f31dc08c 100644
--- a/doc/administration/geo/replication/remove_geo_site.md
+++ b/doc/administration/geo/replication/remove_geo_site.md
@@ -14,7 +14,8 @@ type: howto
1. Select the **Remove** button for the **secondary** site you want to remove.
1. Confirm by selecting **Remove** when the prompt appears.
-Once removed from the Geo administration page, you must stop and uninstall the **secondary** site. For each node on your secondary Geo site:
+After the **secondary** site is removed from the Geo administration page, you must
+stop and uninstall this site. For each node on your secondary Geo site:
1. Stop GitLab:
@@ -35,7 +36,7 @@ Once removed from the Geo administration page, you must stop and uninstall the *
sudo rpm --erase gitlab-ee
```
-Once GitLab has been uninstalled from each node on the **secondary** site, the replication slot must be dropped from the **primary** site's database as follows:
+When GitLab has been uninstalled from each node on the **secondary** site, the replication slot must be dropped from the **primary** site's database as follows:
1. On the **primary** site's database node, start a PostgreSQL console session:
diff --git a/doc/administration/geo/replication/security_review.md b/doc/administration/geo/replication/security_review.md
index 0231da53b9c..afe831dcb9c 100644
--- a/doc/administration/geo/replication/security_review.md
+++ b/doc/administration/geo/replication/security_review.md
@@ -25,8 +25,8 @@ from [owasp.org](https://owasp.org/).
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
- includes full database replication, most files (user-uploaded attachments,
- and so on) and repository + wiki data. In a typical configuration, this will
+ includes full database replication, most files such as user-uploaded attachments,
+ and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/2948)
@@ -37,7 +37,7 @@ from [owasp.org](https://owasp.org/).
private projects. Geo replicates them all indiscriminately. "Selective sync"
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a **secondary** site if desired.
-- See also: [GitLab data classification policy](https://about.gitlab.com/handbook/engineering/security/data-classification-standard.html).
+- See also: [GitLab data classification policy](https://about.gitlab.com/handbook/security/data-classification-standard.html).
### What data backup and retention requirements have been defined for the application?
@@ -59,8 +59,8 @@ from [owasp.org](https://owasp.org/).
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH Git repository
access), but is constrained to read-only activities. The principal use case is
envisioned to be cloning Git repositories from the **secondary** site in favor of the
- **primary** site, but end-users may use the GitLab web interface to view projects,
- issues, merge requests, snippets, and so on.
+ **primary** site, but end-users may use the GitLab web interface to view information like projects,
+ issues, merge requests, and snippets.
### What security expectations do the endā€users have?
diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md
index 3f16c1552ad..fa668091c90 100644
--- a/doc/administration/geo/replication/troubleshooting.md
+++ b/doc/administration/geo/replication/troubleshooting.md
@@ -12,8 +12,9 @@ miss a step.
Here is a list of steps you should take to attempt to fix problem:
1. Perform [basic troubleshooting](#basic-troubleshooting).
-1. Fix any [replication errors](#fixing-replication-errors).
+1. Fix any [PostgreSQL database replication errors](#fixing-postgresql-database-replication-errors).
1. Fix any [common](#fixing-common-errors) errors.
+1. Fix any [non-PostgreSQL replication failures](#fixing-non-postgresql-replication-failures).
## Basic troubleshooting
@@ -131,6 +132,8 @@ http://secondary.example.com/
To find more details about failed items, check
[the `gitlab-rails/geo.log` file](../../logs/log_parsing.md#find-most-common-geo-sync-errors)
+If you notice replication or verification failures, you can try to [resolve them](#fixing-non-postgresql-replication-failures).
+
### Check if PostgreSQL replication is working
To check if PostgreSQL replication is working, check if:
@@ -185,6 +188,41 @@ This machine's Geo node name matches a database record ... no
Learn more about recommended site names in the description of the Name field in
[Geo Admin Area Common Settings](../../../user/admin_area/geo_sites.md#common-settings).
+### Reverify all uploads (or any SSF data type which is verified)
+
+1. SSH into a GitLab Rails node in the primary Geo site.
+1. Open [Rails console](../../operations/rails_console.md).
+1. Mark all uploads as "pending verification":
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+ ```ruby
+ Upload.verification_state_table_class.each_batch do |relation|
+ relation.update_all(verification_state: 0)
+ end
+ ```
+
+1. This will cause the primary to start checksumming all Uploads.
+1. When a primary successfully checksums a record, then all secondaries rechecksum as well, and they compare the values.
+
+A similar thing can be done for all Models handled by the [Geo Self-Service Framework](../../../development/geo/framework.md) which have implemented verification:
+
+- `LfsObject`
+- `MergeRequestDiff`
+- `Packages::PackageFile`
+- `Terraform::StateVersion`
+- `SnippetRepository`
+- `Ci::PipelineArtifact`
+- `PagesDeployment`
+- `Upload`
+- `Ci::JobArtifact`
+- `Ci::SecureFile`
+
+NOTE:
+`GroupWikiRepository` is not in the previous list since verification is not implemented.
+There is an [issue to implement this functionality in the Admin Area UI](https://gitlab.com/gitlab-org/gitlab/-/issues/364729).
+
### Message: `WARNING: oldest xmin is far in the past` and `pg_wal` size growing
If a replication slot is inactive,
@@ -311,52 +349,41 @@ sudo gitlab-rake gitlab:geo:check
When performing a PostgreSQL major version (9 > 10) update this is expected. Follow
the [initiate-the-replication-process](../setup/database.md#step-3-initiate-the-replication-process).
-### Repository verification failures
+### Message: Machine clock is synchronized ... Exception
-[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
-to gather the following, basic troubleshooting information.
+The Rake task attempts to verify that the server clock is synchronized with NTP. Synchronized clocks
+are required for Geo to function correctly. As an example, for security, when the server time on the
+primary site and secondary site differ by about a minute or more, requests between Geo sites
+will fail. If this check task fails to complete due to a reason other than mismatching times, it
+does not necessarily mean that Geo will not work.
-WARNING:
-Any command that changes data directly could be damaging if not run correctly, or under the right conditions. We highly recommend running them in a test environment with a backup of the instance ready to be restored, just in case.
+The Ruby gem which performs the check is hard coded with `pool.ntp.org` as its reference time source.
-#### Get the number of verification failed repositories
+- Exception message `Machine clock is synchronized ... Exception: Timeout::Error`
-```ruby
-Geo::ProjectRegistry.verification_failed('repository').count
-```
+ This issue occurs when your server cannot access the host `pool.ntp.org`.
-#### Find the verification failed repositories
+- Exception message `Machine clock is synchronized ... Exception: No route to host - recvfrom(2)`
-```ruby
-Geo::ProjectRegistry.verification_failed('repository')
-```
+ This issue occurs when the hostname `pool.ntp.org` resolves to a server which does not provide a time service.
-#### Find repositories that failed to sync
+There is [an issue open](https://gitlab.com/gitlab-org/gitlab/-/issues/381422) for this dependency on `pool.ntp.org`.
-```ruby
-Geo::ProjectRegistry.sync_failed('repository')
-```
-
-### Resync repositories
-
-[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
-to enact the following, basic troubleshooting steps.
-
-#### Queue up all repositories for resync. Sidekiq handles each sync
+To workaround this, do one of the following:
-```ruby
-Geo::ProjectRegistry.update_all(resync_repository: true, resync_wiki: true)
-```
+- Add entries in `/etc/hosts` for `pool.ntp.org` to direct the request to valid local time servers.
+ This fixes the long timeout and the timeout error.
+- Direct the check to any valid IP address. This resolves the timeout issue, but the check will fail
+ with the `No route to host` error, as noted above.
-#### Sync individual repository now
+[Cloud native GitLab deployments](https://docs.gitlab.com/charts/advanced/geo/#set-the-geo-primary-site)
+generate an error because containers in Kubernetes do not have access to the host clock:
-```ruby
-project = Project.find_by_full_path('<group/project>')
-
-Geo::RepositorySyncService.new(project).execute
+```plaintext
+Machine clock is synchronized ... Exception: getaddrinfo: Servname not supported for ai_socktype
```
-## Fixing replication errors
+## Fixing PostgreSQL database replication errors
The following sections outline troubleshooting steps for fixing replication
error messages (indicated by `Database replication working? ... no` in the
@@ -469,7 +496,7 @@ This happens because the PostgreSQL certificate that the Omnibus GitLab package
the Common Name `PostgreSQL`, but the replication is connecting to a different host and GitLab attempts to use
the `verify-full` SSL mode by default.
-In order to fix this, you can either:
+To fix this issue, you can either:
- Use the `--sslmode=verify-ca` argument with the `replicate-geo-database` command.
- For an already replicated database, change `sslmode=verify-full` to `sslmode=verify-ca`
@@ -837,120 +864,6 @@ This behavior affects only the following data types through GitLab 14.6:
to make Geo visibly surface data loss risks. The sync/verification loop is
therefore short-circuited. `last_sync_failure` is now set to `The file is missing on the Geo primary site`.
-### Blob types
-
-- `Ci::JobArtifact`
-- `Ci::PipelineArtifact`
-- `Ci::SecureFile`
-- `LfsObject`
-- `MergeRequestDiff`
-- `Packages::PackageFile`
-- `PagesDeployment`
-- `Terraform::StateVersion`
-- `Upload`
-
-`Packages::PackageFile` is used in the following
-[Rails console](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
-examples, but things generally work the same for the other types.
-
-WARNING:
-Any command that changes data directly could be damaging if not run correctly, or under the right conditions. We highly recommend running them in a test environment with a backup of the instance ready to be restored, just in case.
-
-#### The Replicator
-
-The main kinds of classes are Registry, Model, and Replicator. If you have an instance of one of these classes, you can get the others. The Registry and Model mostly manage PostgreSQL DB state. The Replicator knows how to replicate/verify (or it can call a service to do it):
-
-```ruby
-model_record = Packages::PackageFile.last
-model_record.replicator.registry.replicator.model_record # just showing that these methods exist
-```
-
-#### Replicate a package file, synchronously, given an ID
-
-```ruby
-model_record = Packages::PackageFile.find(id)
-model_record.replicator.send(:download)
-```
-
-#### Replicate a package file, synchronously, given a registry ID
-
-```ruby
-registry = Geo::PackageFileRegistry.find(registry_id)
-registry.replicator.send(:download)
-```
-
-#### Verify package files on the secondary manually
-
-This iterates over all package files on the secondary, looking at the
-`verification_checksum` stored in the database (which came from the primary)
-and then calculate this value on the secondary to check if they match. This
-does not change anything in the UI:
-
-```ruby
-# Run on secondary
-status = {}
-
-Packages::PackageFile.find_each do |package_file|
- primary_checksum = package_file.verification_checksum
- secondary_checksum = Packages::PackageFile.hexdigest(package_file.file.path)
- verification_status = (primary_checksum == secondary_checksum)
-
- status[verification_status.to_s] ||= []
- status[verification_status.to_s] << package_file.id
-end
-
-# Count how many of each value we get
-status.keys.each {|key| puts "#{key} count: #{status[key].count}"}
-
-# See the output in its entirety
-status
-```
-
-### Repository types newer than project/wiki repositories
-
-- `SnippetRepository`
-- `GroupWikiRepository`
-
-`SnippetRepository` is used in the examples below, but things generally work the same for the other Repository types.
-
-#### The Replicator
-
-The main kinds of classes are Registry, Model, and Replicator. If you have an instance of one of these classes, you can get the others. The Registry and Model mostly manage PostgreSQL DB state. The Replicator knows how to replicate/verify (or it can call a service to do it).
-
-```ruby
-model_record = SnippetRepository.last
-model_record.replicator.registry.replicator.model_record # just showing that these methods exist
-```
-
-#### Replicate a snippet repository, synchronously, given an ID
-
-```ruby
-model_record = SnippetRepository.find(id)
-model_record.replicator.send(:sync_repository)
-```
-
-#### Replicate a snippet repository, synchronously, given a registry ID
-
-```ruby
-registry = Geo::SnippetRepositoryRegistry.find(registry_id)
-registry.replicator.send(:sync_repository)
-```
-
-### Find failed artifacts
-
-[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
-to run the following commands:
-
-```ruby
-Geo::JobArtifactRegistry.failed
-```
-
-#### Find `ID` of synced artifacts that are missing on primary
-
-```ruby
-Geo::JobArtifactRegistry.synced.missing_on_primary.pluck(:artifact_id)
-```
-
#### Failed syncs with GitLab-managed object storage replication
There is [an issue in GitLab 14.2 through 14.7](https://gitlab.com/gitlab-org/gitlab/-/issues/299819#note_822629467)
@@ -1218,7 +1131,8 @@ If you set up a new secondary from scratch, you must also [remove the old site f
The most common problems that prevent the database from replicating correctly are:
-- **Secondary** sites cannot reach the **primary** site. Check credentials, [firewall rules](../index.md#firewall-rules), and so on.
+- **Secondary** sites cannot reach the **primary** site. Check credentials and
+ [firewall rules](../index.md#firewall-rules).
- SSL certificate problems. Make sure you copied `/etc/gitlab/gitlab-secrets.json` from the **primary** site.
- Database storage disk is full.
- Database replication slot is misconfigured.
@@ -1320,6 +1234,217 @@ To fix this issue, set the primary site's internal URL to a URL that is:
GeoNode.where(primary: true).first.update!(internal_url: "https://unique.url.for.primary.site")
```
+## Fixing non-PostgreSQL replication failures
+
+If you notice replication failures in `Admin > Geo > Sites` or the [Sync status Rake task](#sync-status-rake-task), you can try to resolve the failures with the following general steps:
+
+1. Geo will automatically retry failures. If the failures are new and few in number, or if you suspect the root cause is already resolved, then you can wait to see if the failures go away.
+1. If failures were present for a long time, then many retries have already occurred, and the interval between automatic retries has increased to up to 4 hours depending on the type of failure. If you suspect the root cause is already resolved, you can [manually retry replication or verification](#manually-retry-replication-or-verification).
+1. If the failures persist, use the following sections to try to resolve them.
+
+### Manually retry replication or verification
+
+Project Git repositories and Project Wiki Git repositories have the ability in `Admin > Geo > Replication` to `Resync all`, `Reverify all`, or for a single resource, `Resync` or `Reverify`.
+
+Adding this ability to other data types is proposed in issue [364725](https://gitlab.com/gitlab-org/gitlab/-/issues/364725).
+
+The following sections describe how to use internal application commands in the [Rails console](../../../administration/operations/rails_console.md#starting-a-rails-console-session) to cause replication or verification immediately.
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+### Blob types
+
+- `Ci::JobArtifact`
+- `Ci::PipelineArtifact`
+- `Ci::SecureFile`
+- `LfsObject`
+- `MergeRequestDiff`
+- `Packages::PackageFile`
+- `PagesDeployment`
+- `Terraform::StateVersion`
+- `Upload`
+
+`Packages::PackageFile` is used in the following
+[Rails console](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
+examples, but things generally work the same for the other types.
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+#### The Replicator
+
+The main kinds of classes are Registry, Model, and Replicator. If you have an instance of one of these classes, you can get the others. The Registry and Model mostly manage PostgreSQL DB state. The Replicator knows how to replicate/verify (or it can call a service to do it):
+
+```ruby
+model_record = Packages::PackageFile.last
+model_record.replicator.registry.replicator.model_record # just showing that these methods exist
+```
+
+#### Replicate a package file, synchronously, given an ID
+
+```ruby
+model_record = Packages::PackageFile.find(id)
+model_record.replicator.send(:download)
+```
+
+#### Replicate a package file, synchronously, given a registry ID
+
+```ruby
+registry = Geo::PackageFileRegistry.find(registry_id)
+registry.replicator.send(:download)
+```
+
+#### Verify package files on the secondary manually
+
+This iterates over all package files on the secondary, looking at the
+`verification_checksum` stored in the database (which came from the primary)
+and then calculate this value on the secondary to check if they match. This
+does not change anything in the UI:
+
+```ruby
+# Run on secondary
+status = {}
+
+Packages::PackageFile.find_each do |package_file|
+ primary_checksum = package_file.verification_checksum
+ secondary_checksum = Packages::PackageFile.hexdigest(package_file.file.path)
+ verification_status = (primary_checksum == secondary_checksum)
+
+ status[verification_status.to_s] ||= []
+ status[verification_status.to_s] << package_file.id
+end
+
+# Count how many of each value we get
+status.keys.each {|key| puts "#{key} count: #{status[key].count}"}
+
+# See the output in its entirety
+status
+```
+
+### Reverify all uploads (or any SSF data type which is verified)
+
+1. SSH into a GitLab Rails node in the primary Geo site.
+1. Open [Rails console](../../../administration/operations/rails_console.md#starting-a-rails-console-session).
+1. Mark all uploads as "pending verification":
+
+ ```ruby
+ Upload.verification_state_table_class.each_batch do |relation|
+ relation.update_all(verification_state: 0)
+ end
+ ```
+
+1. This will cause the primary to start checksumming all Uploads.
+1. When a primary successfully checksums a record, then all secondaries rechecksum as well, and they compare the values.
+
+For other SSF data types replace `Upload` in the command above with the desired model class.
+
+NOTE:
+There is an [issue to implement this functionality in the Admin Area UI](https://gitlab.com/gitlab-org/gitlab/-/issues/364729).
+
+### Repository types, except for project or project wiki repositories
+
+- `SnippetRepository`
+- `GroupWikiRepository`
+
+`SnippetRepository` is used in the examples below, but things generally work the same for the other Repository types.
+
+[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
+to enact the following, basic troubleshooting steps.
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+#### The Replicator
+
+The main kinds of classes are Registry, Model, and Replicator. If you have an instance of one of these classes, you can get the others. The Registry and Model mostly manage PostgreSQL DB state. The Replicator knows how to replicate/verify (or it can call a service to do it).
+
+```ruby
+model_record = SnippetRepository.last
+model_record.replicator.registry.replicator.model_record # just showing that these methods exist
+```
+
+#### Replicate a snippet repository, synchronously, given an ID
+
+```ruby
+model_record = SnippetRepository.find(id)
+model_record.replicator.send(:sync_repository)
+```
+
+#### Replicate a snippet repository, synchronously, given a registry ID
+
+```ruby
+registry = Geo::SnippetRepositoryRegistry.find(registry_id)
+registry.replicator.send(:sync_repository)
+```
+
+### Find failed artifacts
+
+[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
+to run the following commands:
+
+```ruby
+Geo::JobArtifactRegistry.failed
+```
+
+#### Find `ID` of synced artifacts that are missing on primary
+
+```ruby
+Geo::JobArtifactRegistry.synced.missing_on_primary.pluck(:artifact_id)
+```
+
+### Project or project wiki repositories
+
+#### Find repository verification failures
+
+[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
+to gather the following, basic troubleshooting information.
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+##### Get the number of verification failed repositories
+
+```ruby
+Geo::ProjectRegistry.verification_failed('repository').count
+```
+
+##### Find the verification failed repositories
+
+```ruby
+Geo::ProjectRegistry.verification_failed('repository')
+```
+
+##### Find repositories that failed to sync
+
+```ruby
+Geo::ProjectRegistry.sync_failed('repository')
+```
+
+#### Resync project and project wiki repositories
+
+[Start a Rails console session](../../../administration/operations/rails_console.md#starting-a-rails-console-session)
+to enact the following, basic troubleshooting steps.
+
+WARNING:
+Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.
+
+##### Queue up all repositories for resync
+
+When you run this, Sidekiq handles each sync.
+
+```ruby
+Geo::ProjectRegistry.update_all(resync_repository: true, resync_wiki: true)
+```
+
+##### Sync individual repository now
+
+```ruby
+project = Project.find_by_full_path('<group/project>')
+
+Geo::RepositorySyncService.new(project).execute
+```
+
## Fixing client errors
### Authorization errors from LFS HTTP(S) client requests
@@ -1390,10 +1515,6 @@ If the above steps are **not successful**, proceed through the next steps:
1. Verify you can connect to the newly-promoted **primary** site using the URL used previously for the **secondary** site.
1. If successful, the **secondary** site is now promoted to the **primary** site.
-## Additional tools
-
-There are useful snippets for manipulating Geo internals in the [GitLab Rails Cheat Sheet](../../troubleshooting/gitlab_rails_cheat_sheet.md#geo). For example, you can find how to manually sync or verify a replicable in Rails console.
-
## Check OS locale data compatibility
If different operating systems or different operating system versions are deployed across Geo sites, we recommend that you perform a locale data compatibility check setting up Geo.