Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2024-01-03 21:12:53 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2024-01-03 21:12:53 +0300
commit: b87af16bf2b9b09309deb902889edc1bff05256a (patch)
tree: c1088a650a2c26fb9636ad2d01744697caf80091 /doc/administration/postgresql
parent: 5429e3d4e0d5f1601ff8989bcb26bd822b3c7c5a (diff)
2 files changed, 476 insertions, 472 deletions
diff --git a/doc/administration/postgresql/replication_and_failover.md b/doc/administration/postgresql/replication_and_failover.md
index f4ed9d99b45..e93dfc8336e 100644
--- a/doc/administration/postgresql/replication_and_failover.md
+++ b/doc/administration/postgresql/replication_and_failover.md
@@ -563,8 +563,7 @@ gitlab-rake gitlab:db:configure
 
 > **Note**: If you encounter a `rake aborted!` error stating that PgBouncer is failing to connect to PostgreSQL it may be that your PgBouncer node's IP address is missing from
 PostgreSQL's `trust_auth_cidr_addresses` in `gitlab.rb` on your database nodes. See
-[PgBouncer error `ERROR:  pgbouncer cannot connect to server`](#pgbouncer-error-error-pgbouncer-cannot-connect-to-server)
-in the Troubleshooting section before proceeding.
+[PgBouncer error `ERROR:  pgbouncer cannot connect to server`](../../administration/postgresql/replication_and_failover_troubleshooting.md#pgbouncer-error-error-pgbouncer-cannot-connect-to-server) before you proceed.
 
 ### Backups
 
@@ -575,8 +574,7 @@ Do not backup or restore GitLab through a PgBouncer connection: this causes a Gi
 ### Ensure GitLab is running
 
 At this point, your GitLab instance should be up and running. Verify you're able
-to sign in, and create issues and merge requests. If you encounter issues, see
-the [Troubleshooting section](#troubleshooting).
+to sign in, and create issues and merge requests. For more information, see [Troubleshooting replication and failover](../../administration/postgresql/replication_and_failover_troubleshooting.md).
 
 ## Example configuration
 
@@ -901,7 +899,7 @@ Stopping or restarting the Patroni service on the leader node triggers an automa
 WARNING:
 In GitLab 16.5 and earlier, PgBouncer nodes do not automatically fail over alongside
 Patroni nodes. PgBouncer services
-[must be restarted manually](#pgbouncer-errors-error-running-command-gitlabctlerrorsexecutionerror-and-error-database-gitlabhq_production-is-not-paused)
+[must be restarted manually](../../administration/postgresql/replication_and_failover_troubleshooting.md#pgbouncer-errors-error-running-command-gitlabctlerrorsexecutionerror-and-error-database-gitlabhq_production-is-not-paused)
 for a successful switchover.
 
 While Patroni supports automatic failover, you also have the ability to perform
@@ -1085,8 +1083,7 @@ Considering these, you should carefully plan your PostgreSQL upgrade:
    ```
 
 If issues are encountered upgrading the replicas,
-[there is a troubleshooting section](#postgresql-major-version-upgrade-fails-on-a-patroni-replica)
-that might be the solution.
+[there is a troubleshooting section](../../administration/postgresql/replication_and_failover_troubleshooting.md#postgresql-major-version-upgrade-fails-on-a-patroni-replica) that might be the solution.
 
 NOTE:
 Reverting the PostgreSQL upgrade with `gitlab-ctl revert-pg-upgrade` has the same considerations as
@@ -1274,468 +1271,3 @@ After completing these steps, then you can clean up the resources of the old Pat
 They are no longer needed. However, before removing the resources, remove the
 logical replication subscription on the new leader by running `DROP SUBSCRIPTION patroni_upgrade`
 with `gitlab-psql`.
-
-## Troubleshooting
-
-### Consul and PostgreSQL changes not taking effect
-
-Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it does not restart the services. However, not all changes can be activated by reloading.
-
-To restart either service, run `gitlab-ctl restart SERVICE`
-
-For PostgreSQL, it is usually safe to restart the leader node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done.
-
-On the Consul server nodes, it is important to [restart the Consul service](../consul.md#restart-consul) in a controlled manner.
-
-### PgBouncer error `ERROR: pgbouncer cannot connect to server`
-
-You may get this error when running `gitlab-rake gitlab:db:configure` or you
-may see the error in the PgBouncer log file.
-
-```plaintext
-PG::ConnectionBad: ERROR:  pgbouncer cannot connect to server
-```
-
-The problem may be that your PgBouncer node's IP address is not included in the
-`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes.
-
-You can confirm that this is the issue by checking the PostgreSQL log on the leader
-database node. If you see the following error then `trust_auth_cidr_addresses`
-is the problem.
-
-```plaintext
-2018-03-29_13:59:12.11776 FATAL:  no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off
-```
-
-To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`.
-
-```ruby
-postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)
-```
-
-[Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
-
-### PgBouncer errors `Error running command: GitlabCtl::Errors::ExecutionError` and `ERROR: database gitlabhq_production is not paused`
-
-Due to a [known issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/8166) that
-affects versions of GitLab prior to 16.5.0, the automatic failover of PgBouncer nodes does not
-happen after a [Patroni switchover](#manual-failover-procedure-for-patroni). In this
-example, GitLab failed to detect a paused database, then attempted to `RESUME` a
-not-paused database:
-
-```plaintext
-INFO -- : Running: gitlab-ctl pgb-notify --pg-database gitlabhq_production --newhost database7.example.com --user pgbouncer --hostuser gitlab-consul
-ERROR -- : STDERR: Error running command: GitlabCtl::Errors::ExecutionError
-ERROR -- : STDERR: ERROR: ERROR:  database gitlabhq_production is not paused
-```
-
-To ensure a [Patroni switchover](#manual-failover-procedure-for-patroni) succeeds,
-you must manually restart the PgBouncer service on all PgBouncer nodes with this command:
-
-```shell
-gitlab-ctl restart pgbouncer
-```
-
-### Reinitialize a replica
-
-If a replica cannot start or rejoin the cluster, or when it lags behind and cannot catch up, it might be necessary to reinitialize the replica:
-
-1. [Check the replication status](#check-replication-status) to confirm which server
-   needs to be reinitialized. For example:
-
-   ```plaintext
-   + Cluster: postgresql-ha (6970678148837286213) ------+---------+--------------+----+-----------+
-   | Member                              | Host         | Role    | State        | TL | Lag in MB |
-   +-------------------------------------+--------------+---------+--------------+----+-----------+
-   | gitlab-database-1.example.com       | 172.18.0.111 | Replica | running      | 55 |         0 |
-   | gitlab-database-2.example.com       | 172.18.0.112 | Replica | start failed |    |   unknown |
-   | gitlab-database-3.example.com       | 172.18.0.113 | Leader  | running      | 55 |           |
-   +-------------------------------------+--------------+---------+--------------+----+-----------+
-   ```
-
-1. Sign in to the broken server and reinitialize the database and replication. Patroni shuts
-   down PostgreSQL on that server, remove the data directory, and reinitialize it from scratch:
-
-   ```shell
-   sudo gitlab-ctl patroni reinitialize-replica --member gitlab-database-2.example.com
-   ```
-
-   This can be run on any Patroni node, but be aware that `sudo gitlab-ctl patroni
-   reinitialize-replica` without `--member` restarts the server it is run on.
-   You should run it locally on the broken server to reduce the risk of
-   unintended data loss.
-1. Monitor the logs:
-
-   ```shell
-   sudo gitlab-ctl tail patroni
-   ```
-
-### Reset the Patroni state in Consul
-
-WARNING:
-Resetting the Patroni state in Consul is a potentially destructive process. Make sure that you have a healthy database backup first.
-
-As a last resort you can reset the Patroni state in Consul completely.
-
-This may be required if your Patroni cluster is in an unknown or bad state and no node can start:
-
-```plaintext
-+ Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
-| Member                              | Host         | Role    | State   | TL | Lag in MB |
-+-------------------------------------+--------------+---------+---------+----+-----------+
-| gitlab-database-1.example.com       | 172.18.0.111 | Replica | stopped |    |   unknown |
-| gitlab-database-2.example.com       | 172.18.0.112 | Replica | stopped |    |   unknown |
-| gitlab-database-3.example.com       | 172.18.0.113 | Replica | stopped |    |   unknown |
-+-------------------------------------+--------------+---------+---------+----+-----------+
-```
-
-**Before deleting the Patroni state in Consul**,
-[try and resolve the `gitlab-ctl` errors](#errors-running-gitlab-ctl) on the Patroni nodes.
-
-This process results in a reinitialized Patroni cluster when
-the first Patroni node starts.
-
-To reset the Patroni state in Consul:
-
-1. Take note of the Patroni node that was the leader, or that the application thinks is the current leader,
-   if the current state shows more than one, or none:
-   - Look on the PgBouncer nodes in `/var/opt/gitlab/consul/databases.ini`,
-     which contains the hostname of the current leader.
-   - Look in the Patroni logs `/var/log/gitlab/patroni/current` (or the older rotated and
-     compressed logs `/var/log/gitlab/patroni/@40000*`) on **all** database nodes to see
-     which server was most recently identified as the leader by the cluster:
-
-     ```plaintext
-     INFO: no action. I am a secondary (database1.local) and following a leader (database2.local)
-     ```
-
-1. Stop Patroni on all nodes:
-
-   ```shell
-   sudo gitlab-ctl stop patroni
-   ```
-
-1. Reset the state in Consul:
-
-   ```shell
-   /opt/gitlab/embedded/bin/consul kv delete -recurse /service/postgresql-ha/
-   ```
-
-1. Start one Patroni node, which initializes the Patroni cluster to elect as a leader.
-   It's highly recommended to start the previous leader (noted in the first step),
-   so as to not lose existing writes that may have not been replicated because
-   of the broken cluster state:
-
-   ```shell
-   sudo gitlab-ctl start patroni
-   ```
-
-1. Start all other Patroni nodes that join the Patroni cluster as replicas:
-
-   ```shell
-   sudo gitlab-ctl start patroni
-   ```
-
-If you are still seeing issues, the next step is restoring the last healthy backup.
-
-### Errors in the Patroni log about a `pg_hba.conf` entry for `127.0.0.1`
-
-The following log entry in the Patroni log indicates the replication is not working
-and a configuration change is needed:
-
-```plaintext
-FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator"
-```
-
-To fix the problem, ensure the loopback interface is included in the CIDR addresses list:
-
-1. Edit `/etc/gitlab/gitlab.rb`:
-
-   ```ruby
-   postgresql['trust_auth_cidr_addresses'] = %w(<other_cidrs> 127.0.0.1/32)
-   ```
-
-1. [Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
-1. Check that [all the replicas are synchronized](#check-replication-status)
-
-### Errors in Patroni logs: the requested start point is ahead of the Write Ahead Log (WAL) flush position
-
-This error indicates that the database is not replicating:
-
-```plaintext
-FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/4000388
-```
-
-This example error is from a replica that was initially misconfigured, and had never replicated.
-
-Fix it [by reinitializing the replica](#reinitialize-a-replica).
-
-### Patroni fails to start with `MemoryError`
-
-Patroni may fail to start, logging an error and stack trace:
-
-```plaintext
-MemoryError
-Traceback (most recent call last):
-  File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>
-    sys.exit(main())
-[..]
-  File "/opt/gitlab/embedded/lib/python3.7/ctypes/__init__.py", line 273, in _reset_cache
-    CFUNCTYPE(c_int)(lambda: None)
-```
-
-If the stack trace ends with `CFUNCTYPE(c_int)(lambda: None)`, this code triggers `MemoryError`
-if the Linux server has been hardened for security.
-
-The code causes Python to write temporary executable files, and if it cannot find a file system in which to do this. For example, if `noexec` is set on the `/tmp` file system, it fails with `MemoryError` ([read more in the issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6184)).
-
-Workarounds:
-
-- Remove `noexec` from the mount options for filesystems like `/tmp` and `/var/tmp`.
-- If set to enforcing, SELinux may also prevent these operations. Verify the issue is fixed by setting
-  SELinux to permissive.
-
-Patroni first shipped in the Linux package for GitLab 13.1, along with a build of Python 3.7.
-The code which causes this was removed in Python 3.8: this fix shipped in
-[the Linux package for GitLab 14.3](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5547)
-and later, removing the need for a workaround.
-
-### Errors running `gitlab-ctl`
-
-Patroni nodes can get into a state where `gitlab-ctl` commands fail
-and `gitlab-ctl reconfigure` cannot fix the node.
-
-If this co-incides with a version upgrade of PostgreSQL, [follow a different procedure](#postgresql-major-version-upgrade-fails-on-a-patroni-replica)
-
-One common symptom is that `gitlab-ctl` cannot determine
-information it needs about the installation if the database server is failing to start:
-
-```plaintext
-Malformed configuration JSON file found at /opt/gitlab/embedded/nodes/<HOSTNAME>.json.
-This usually happens when your last run of `gitlab-ctl reconfigure` didn't complete successfully.
-```
-
-```plaintext
-Error while reinitializing replica on the current node: Attributes not found in
-/opt/gitlab/embedded/nodes/<HOSTNAME>.json, has reconfigure been run yet?
-```
-
-Similarly, the nodes file (`/opt/gitlab/embedded/nodes/<HOSTNAME>.json`) should contain a lot of information,
-but might get created with only:
-
-```json
-{
-  "name": "<HOSTNAME>"
-}
-```
-
-The following process for fixing this includes reinitializing this replica:
-the current state of PostgreSQL on this node is discarded:
-
-1. Shut down the Patroni and (if present) PostgreSQL services:
-
-   ```shell
-   sudo gitlab-ctl status
-   sudo gitlab-ctl stop patroni
-   sudo gitlab-ctl stop postgresql
-   ```
-
-1. Remove `/var/opt/gitlab/postgresql/data` in case its state prevents
-   PostgreSQL from starting:
-
-   ```shell
-   cd /var/opt/gitlab/postgresql
-   sudo rm -rf data
-   ```
-
-   **Take care with this step to avoid data loss**.
-   This step can be also achieved by renaming `data/`:
-   make sure there's enough free disk for a new copy of the primary database,
-   and remove the extra directory when the replica is fixed.
-
-1. With PostgreSQL not running, the nodes file now gets created successfully:
-
-   ```shell
-   sudo gitlab-ctl reconfigure
-   ```
-
-1. Start Patroni:
-
-   ```shell
-   sudo gitlab-ctl start patroni
-   ```
-
-1. Monitor the logs and check the cluster state:
-
-   ```shell
-   sudo gitlab-ctl tail patroni
-   sudo gitlab-ctl patroni members
-   ```
-
-1. Re-run `reconfigure` again:
-
-   ```shell
-   sudo gitlab-ctl reconfigure
-   ```
-
-1. Reinitialize the replica if `gitlab-ctl patroni members` indicates this is needed:
-
-   ```shell
-   sudo gitlab-ctl patroni reinitialize-replica
-   ```
-
-If this procedure doesn't work **and** if the cluster is unable to elect a leader,
-[there is a another fix](#reset-the-patroni-state-in-consul) which should only be
-used as a last resort.
-
-### PostgreSQL major version upgrade fails on a Patroni replica
-
-A Patroni **replica** can get stuck in a loop during `gitlab-ctl pg-upgrade`, and
-the upgrade fails.
-
-An example set of symptoms is as follows:
-
-1. A `postgresql` service is defined,
-   which shouldn't usually be present on a Patroni node. It is present because
-   `gitlab-ctl pg-upgrade` adds it to create a new empty database:
-
-   ```plaintext
-   run: patroni: (pid 1972) 1919s; run: log: (pid 1971) 1919s
-   down: postgresql: 1s, normally up, want up; run: log: (pid 1973) 1919s
-   ```
-
-1. PostgreSQL generates `PANIC` log entries in
-   `/var/log/gitlab/postgresql/current` as Patroni is removing
-   `/var/opt/gitlab/postgresql/data` as part of reinitializing the replica:
-
-   ```plaintext
-   DETAIL:  Could not open file "pg_xact/0000": No such file or directory.
-   WARNING:  terminating connection because of crash of another server process
-   LOG:  all server processes terminated; reinitializing
-   PANIC:  could not open file "global/pg_control": No such file or directory
-   ```
-
-1. In `/var/log/gitlab/patroni/current`, Patroni logs the following.
-   The local PostgreSQL version is different from the cluster leader:
-
-   ```plaintext
-   INFO: trying to bootstrap from leader 'HOSTNAME'
-   pg_basebackup: incompatible server version 12.6
-   pg_basebackup: removing data directory "/var/opt/gitlab/postgresql/data"
-   ERROR: Error when fetching backup: pg_basebackup exited with code=1
-   ```
-
-**Important**: This workaround applies when the Patroni cluster is in the following state:
-
-- The [leader has been successfully upgraded to the new major version](#upgrading-postgresql-major-version-in-a-patroni-cluster).
-- The step to upgrade PostgreSQL on replicas is failing.
-
-This workaround completes the PostgreSQL upgrade on a Patroni replica
-by setting the node to use the new PostgreSQL version, and then reinitializing
-it as a replica in the new cluster that was created
-when the leader was upgraded:
-
-1. Check the cluster status on all nodes to confirm which is the leader
-   and what state the replicas are in
-
-   ```shell
-   sudo gitlab-ctl patroni members
-   ```
-
-1. Replica: check which version of PostgreSQL is active:
-
-   ```shell
-   sudo ls -al /opt/gitlab/embedded/bin | grep postgres
-   ```
-
-1. Replica: ensure the nodes file is correct and `gitlab-ctl` can run. This resolves
-   the [errors running `gitlab-ctl`](#errors-running-gitlab-ctl) issue if the replica
-   has any of those errors as well:
-
-   ```shell
-   sudo gitlab-ctl stop patroni
-   sudo gitlab-ctl reconfigure
-   ```
-
-1. Replica: relink the PostgreSQL binaries to the required version
-   to fix the `incompatible server version` error:
-
-   1. Edit `/etc/gitlab/gitlab.rb` and specify the required version:
-
-      ```ruby
-      postgresql['version'] = 13
-      ```
-
-   1. Reconfigure GitLab:
-
-      ```shell
-      sudo gitlab-ctl reconfigure
-      ```
-
-   1. Check the binaries are relinked. The binaries distributed for
-      PostgreSQL vary between major releases, it's typical to
-      have a small number of incorrect symbolic links:
-
-      ```shell
-      sudo ls -al /opt/gitlab/embedded/bin | grep postgres
-      ```
-
-1. Replica: ensure PostgreSQL is fully reinitialized for the specified version:
-
-   ```shell
-   cd /var/opt/gitlab/postgresql
-   sudo rm -rf data
-   sudo gitlab-ctl reconfigure
-   ```
-
-1. Replica: optionally monitor the database in two additional terminal sessions:
-
-   - Disk use increases as `pg_basebackup` runs. Track progress of the
-     replica initialization with:
-
-     ```shell
-     cd /var/opt/gitlab/postgresql
-     watch du -sh data
-     ```
-
-   - Monitor the process in the logs:
-
-     ```shell
-     sudo gitlab-ctl tail patroni
-     ```
-
-1. Replica: Start Patroni to reinitialize the replica:
-
-   ```shell
-   sudo gitlab-ctl start patroni
-   ```
-
-1. Replica: After it completes, remove the hardcoded version from `/etc/gitlab/gitlab.rb`:
-
-   1. Edit `/etc/gitlab/gitlab.rb` and remove `postgresql['version']`.
-   1. Reconfigure GitLab:
-
-      ```shell
-      sudo gitlab-ctl reconfigure
-      ```
-
-   1. Check the correct binaries are linked:
-
-      ```shell
-      sudo ls -al /opt/gitlab/embedded/bin | grep postgres
-      ```
-
-1. Check the cluster status on all nodes:
-
-   ```shell
-   sudo gitlab-ctl patroni members
-   ```
-
-Repeat this procedure on the other replica if required.
-
-### Issues with other components
-
-If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page:
-
-- [Consul](../consul.md#troubleshooting-consul)
-- [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting)
diff --git a/doc/administration/postgresql/replication_and_failover_troubleshooting.md b/doc/administration/postgresql/replication_and_failover_troubleshooting.md
new file mode 100644
index 00000000000..1b5aa5fc3b5
--- /dev/null
+++ b/doc/administration/postgresql/replication_and_failover_troubleshooting.md
@@ -0,0 +1,472 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Troubleshooting PostgreSQL replication and failover for Linux package installations **(PREMIUM SELF)**
+
+When working with PostgreSQL replication and failover, you might encounter the following issues.
+
+## Consul and PostgreSQL changes not taking effect
+
+Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it does not restart the services. However, not all changes can be activated by reloading.
+
+To restart either service, run `gitlab-ctl restart SERVICE`
+
+For PostgreSQL, it is usually safe to restart the leader node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done.
+
+On the Consul server nodes, it is important to [restart the Consul service](../consul.md#restart-consul) in a controlled manner.
+
+## PgBouncer error `ERROR: pgbouncer cannot connect to server`
+
+You may get this error when running `gitlab-rake gitlab:db:configure` or you
+may see the error in the PgBouncer log file.
+
+```plaintext
+PG::ConnectionBad: ERROR:  pgbouncer cannot connect to server
+```
+
+The problem may be that your PgBouncer node's IP address is not included in the
+`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes.
+
+You can confirm that this is the issue by checking the PostgreSQL log on the leader
+database node. If you see the following error then `trust_auth_cidr_addresses`
+is the problem.
+
+```plaintext
+2018-03-29_13:59:12.11776 FATAL:  no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off
+```
+
+To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`.
+
+```ruby
+postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)
+```
+
+[Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
+
+## PgBouncer errors `Error running command: GitlabCtl::Errors::ExecutionError` and `ERROR: database gitlabhq_production is not paused`
+
+Due to a [known issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/8166) that
+affects versions of GitLab prior to 16.5.0, the automatic failover of PgBouncer nodes does not
+happen after a [Patroni switchover](../../administration/postgresql/replication_and_failover.md#manual-failover-procedure-for-patroni). In this
+example, GitLab failed to detect a paused database, then attempted to `RESUME` a
+not-paused database:
+
+```plaintext
+INFO -- : Running: gitlab-ctl pgb-notify --pg-database gitlabhq_production --newhost database7.example.com --user pgbouncer --hostuser gitlab-consul
+ERROR -- : STDERR: Error running command: GitlabCtl::Errors::ExecutionError
+ERROR -- : STDERR: ERROR: ERROR:  database gitlabhq_production is not paused
+```
+
+To ensure a [Patroni switchover](../../administration/postgresql/replication_and_failover.md#manual-failover-procedure-for-patroni) succeeds,
+you must manually restart the PgBouncer service on all PgBouncer nodes with this command:
+
+```shell
+gitlab-ctl restart pgbouncer
+```
+
+## Reinitialize a replica
+
+If a replica cannot start or rejoin the cluster, or when it lags behind and cannot catch up, it might be necessary to reinitialize the replica:
+
+1. [Check the replication status](../../administration/postgresql/replication_and_failover.md#check-replication-status) to confirm which server
+   needs to be reinitialized. For example:
+
+   ```plaintext
+   + Cluster: postgresql-ha (6970678148837286213) ------+---------+--------------+----+-----------+
+   | Member                              | Host         | Role    | State        | TL | Lag in MB |
+   +-------------------------------------+--------------+---------+--------------+----+-----------+
+   | gitlab-database-1.example.com       | 172.18.0.111 | Replica | running      | 55 |         0 |
+   | gitlab-database-2.example.com       | 172.18.0.112 | Replica | start failed |    |   unknown |
+   | gitlab-database-3.example.com       | 172.18.0.113 | Leader  | running      | 55 |           |
+   +-------------------------------------+--------------+---------+--------------+----+-----------+
+   ```
+
+1. Sign in to the broken server and reinitialize the database and replication. Patroni shuts
+   down PostgreSQL on that server, remove the data directory, and reinitialize it from scratch:
+
+   ```shell
+   sudo gitlab-ctl patroni reinitialize-replica --member gitlab-database-2.example.com
+   ```
+
+   This can be run on any Patroni node, but be aware that `sudo gitlab-ctl patroni
+   reinitialize-replica` without `--member` restarts the server it is run on.
+   You should run it locally on the broken server to reduce the risk of
+   unintended data loss.
+1. Monitor the logs:
+
+   ```shell
+   sudo gitlab-ctl tail patroni
+   ```
+
+## Reset the Patroni state in Consul
+
+WARNING:
+Resetting the Patroni state in Consul is a potentially destructive process. Make sure that you have a healthy database backup first.
+
+As a last resort you can reset the Patroni state in Consul completely.
+
+This may be required if your Patroni cluster is in an unknown or bad state and no node can start:
+
+```plaintext
++ Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
+| Member                              | Host         | Role    | State   | TL | Lag in MB |
++-------------------------------------+--------------+---------+---------+----+-----------+
+| gitlab-database-1.example.com       | 172.18.0.111 | Replica | stopped |    |   unknown |
+| gitlab-database-2.example.com       | 172.18.0.112 | Replica | stopped |    |   unknown |
+| gitlab-database-3.example.com       | 172.18.0.113 | Replica | stopped |    |   unknown |
++-------------------------------------+--------------+---------+---------+----+-----------+
+```
+
+**Before deleting the Patroni state in Consul**,
+[try and resolve the `gitlab-ctl` errors](#errors-running-gitlab-ctl) on the Patroni nodes.
+
+This process results in a reinitialized Patroni cluster when
+the first Patroni node starts.
+
+To reset the Patroni state in Consul:
+
+1. Take note of the Patroni node that was the leader, or that the application thinks is the current leader,
+   if the current state shows more than one, or none:
+   - Look on the PgBouncer nodes in `/var/opt/gitlab/consul/databases.ini`,
+     which contains the hostname of the current leader.
+   - Look in the Patroni logs `/var/log/gitlab/patroni/current` (or the older rotated and
+     compressed logs `/var/log/gitlab/patroni/@40000*`) on **all** database nodes to see
+     which server was most recently identified as the leader by the cluster:
+
+     ```plaintext
+     INFO: no action. I am a secondary (database1.local) and following a leader (database2.local)
+     ```
+
+1. Stop Patroni on all nodes:
+
+   ```shell
+   sudo gitlab-ctl stop patroni
+   ```
+
+1. Reset the state in Consul:
+
+   ```shell
+   /opt/gitlab/embedded/bin/consul kv delete -recurse /service/postgresql-ha/
+   ```
+
+1. Start one Patroni node, which initializes the Patroni cluster to elect as a leader.
+   It's highly recommended to start the previous leader (noted in the first step),
+   so as to not lose existing writes that may have not been replicated because
+   of the broken cluster state:
+
+   ```shell
+   sudo gitlab-ctl start patroni
+   ```
+
+1. Start all other Patroni nodes that join the Patroni cluster as replicas:
+
+   ```shell
+   sudo gitlab-ctl start patroni
+   ```
+
+If you are still seeing issues, the next step is restoring the last healthy backup.
+
+## Errors in the Patroni log about a `pg_hba.conf` entry for `127.0.0.1`
+
+The following log entry in the Patroni log indicates the replication is not working
+and a configuration change is needed:
+
+```plaintext
+FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator"
+```
+
+To fix the problem, ensure the loopback interface is included in the CIDR addresses list:
+
+1. Edit `/etc/gitlab/gitlab.rb`:
+
+   ```ruby
+   postgresql['trust_auth_cidr_addresses'] = %w(<other_cidrs> 127.0.0.1/32)
+   ```
+
+1. [Reconfigure GitLab](../restart_gitlab.md#reconfigure-a-linux-package-installation) for the changes to take effect.
+1. Check that [all the replicas are synchronized](../../administration/postgresql/replication_and_failover.md#check-replication-status)
+
+## Errors in Patroni logs: the requested start point is ahead of the Write Ahead Log (WAL) flush position
+
+This error indicates that the database is not replicating:
+
+```plaintext
+FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/4000388
+```
+
+This example error is from a replica that was initially misconfigured, and had never replicated.
+
+Fix it [by reinitializing the replica](#reinitialize-a-replica).
+
+## Patroni fails to start with `MemoryError`
+
+Patroni may fail to start, logging an error and stack trace:
+
+```plaintext
+MemoryError
+Traceback (most recent call last):
+  File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>
+    sys.exit(main())
+[..]
+  File "/opt/gitlab/embedded/lib/python3.7/ctypes/__init__.py", line 273, in _reset_cache
+    CFUNCTYPE(c_int)(lambda: None)
+```
+
+If the stack trace ends with `CFUNCTYPE(c_int)(lambda: None)`, this code triggers `MemoryError`
+if the Linux server has been hardened for security.
+
+The code causes Python to write temporary executable files, and if it cannot find a file system in which to do this. For example, if `noexec` is set on the `/tmp` file system, it fails with `MemoryError` ([read more in the issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6184)).
+
+Workarounds:
+
+- Remove `noexec` from the mount options for filesystems like `/tmp` and `/var/tmp`.
+- If set to enforcing, SELinux may also prevent these operations. Verify the issue is fixed by setting
+  SELinux to permissive.
+
+Patroni first shipped in the Linux package for GitLab 13.1, along with a build of Python 3.7.
+The code which causes this was removed in Python 3.8: this fix shipped in
+[the Linux package for GitLab 14.3](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5547)
+and later, removing the need for a workaround.
+
+## Errors running `gitlab-ctl`
+
+Patroni nodes can get into a state where `gitlab-ctl` commands fail
+and `gitlab-ctl reconfigure` cannot fix the node.
+
+If this co-incides with a version upgrade of PostgreSQL, [follow a different procedure](#postgresql-major-version-upgrade-fails-on-a-patroni-replica)
+
+One common symptom is that `gitlab-ctl` cannot determine
+information it needs about the installation if the database server is failing to start:
+
+```plaintext
+Malformed configuration JSON file found at /opt/gitlab/embedded/nodes/<HOSTNAME>.json.
+This usually happens when your last run of `gitlab-ctl reconfigure` didn't complete successfully.
+```
+
+```plaintext
+Error while reinitializing replica on the current node: Attributes not found in
+/opt/gitlab/embedded/nodes/<HOSTNAME>.json, has reconfigure been run yet?
+```
+
+Similarly, the nodes file (`/opt/gitlab/embedded/nodes/<HOSTNAME>.json`) should contain a lot of information,
+but might get created with only:
+
+```json
+{
+  "name": "<HOSTNAME>"
+}
+```
+
+The following process for fixing this includes reinitializing this replica:
+the current state of PostgreSQL on this node is discarded:
+
+1. Shut down the Patroni and (if present) PostgreSQL services:
+
+   ```shell
+   sudo gitlab-ctl status
+   sudo gitlab-ctl stop patroni
+   sudo gitlab-ctl stop postgresql
+   ```
+
+1. Remove `/var/opt/gitlab/postgresql/data` in case its state prevents
+   PostgreSQL from starting:
+
+   ```shell
+   cd /var/opt/gitlab/postgresql
+   sudo rm -rf data
+   ```
+
+   **Take care with this step to avoid data loss**.
+   This step can be also achieved by renaming `data/`:
+   make sure there's enough free disk for a new copy of the primary database,
+   and remove the extra directory when the replica is fixed.
+
+1. With PostgreSQL not running, the nodes file now gets created successfully:
+
+   ```shell
+   sudo gitlab-ctl reconfigure
+   ```
+
+1. Start Patroni:
+
+   ```shell
+   sudo gitlab-ctl start patroni
+   ```
+
+1. Monitor the logs and check the cluster state:
+
+   ```shell
+   sudo gitlab-ctl tail patroni
+   sudo gitlab-ctl patroni members
+   ```
+
+1. Re-run `reconfigure` again:
+
+   ```shell
+   sudo gitlab-ctl reconfigure
+   ```
+
+1. Reinitialize the replica if `gitlab-ctl patroni members` indicates this is needed:
+
+   ```shell
+   sudo gitlab-ctl patroni reinitialize-replica
+   ```
+
+If this procedure doesn't work **and** if the cluster is unable to elect a leader,
+[there is a another fix](#reset-the-patroni-state-in-consul) which should only be
+used as a last resort.
+
+## PostgreSQL major version upgrade fails on a Patroni replica
+
+A Patroni **replica** can get stuck in a loop during `gitlab-ctl pg-upgrade`, and
+the upgrade fails.
+
+An example set of symptoms is as follows:
+
+1. A `postgresql` service is defined,
+   which shouldn't usually be present on a Patroni node. It is present because
+   `gitlab-ctl pg-upgrade` adds it to create a new empty database:
+
+   ```plaintext
+   run: patroni: (pid 1972) 1919s; run: log: (pid 1971) 1919s
+   down: postgresql: 1s, normally up, want up; run: log: (pid 1973) 1919s
+   ```
+
+1. PostgreSQL generates `PANIC` log entries in
+   `/var/log/gitlab/postgresql/current` as Patroni is removing
+   `/var/opt/gitlab/postgresql/data` as part of reinitializing the replica:
+
+   ```plaintext
+   DETAIL:  Could not open file "pg_xact/0000": No such file or directory.
+   WARNING:  terminating connection because of crash of another server process
+   LOG:  all server processes terminated; reinitializing
+   PANIC:  could not open file "global/pg_control": No such file or directory
+   ```
+
+1. In `/var/log/gitlab/patroni/current`, Patroni logs the following.
+   The local PostgreSQL version is different from the cluster leader:
+
+   ```plaintext
+   INFO: trying to bootstrap from leader 'HOSTNAME'
+   pg_basebackup: incompatible server version 12.6
+   pg_basebackup: removing data directory "/var/opt/gitlab/postgresql/data"
+   ERROR: Error when fetching backup: pg_basebackup exited with code=1
+   ```
+
+**Important**: This workaround applies when the Patroni cluster is in the following state:
+
+- The [leader has been successfully upgraded to the new major version](../../administration/postgresql/replication_and_failover.md#upgrading-postgresql-major-version-in-a-patroni-cluster).
+- The step to upgrade PostgreSQL on replicas is failing.
+
+This workaround completes the PostgreSQL upgrade on a Patroni replica
+by setting the node to use the new PostgreSQL version, and then reinitializing
+it as a replica in the new cluster that was created
+when the leader was upgraded:
+
+1. Check the cluster status on all nodes to confirm which is the leader
+   and what state the replicas are in
+
+   ```shell
+   sudo gitlab-ctl patroni members
+   ```
+
+1. Replica: check which version of PostgreSQL is active:
+
+   ```shell
+   sudo ls -al /opt/gitlab/embedded/bin | grep postgres
+   ```
+
+1. Replica: ensure the nodes file is correct and `gitlab-ctl` can run. This resolves
+   the [errors running `gitlab-ctl`](#errors-running-gitlab-ctl) issue if the replica
+   has any of those errors as well:
+
+   ```shell
+   sudo gitlab-ctl stop patroni
+   sudo gitlab-ctl reconfigure
+   ```
+
+1. Replica: relink the PostgreSQL binaries to the required version
+   to fix the `incompatible server version` error:
+
+   1. Edit `/etc/gitlab/gitlab.rb` and specify the required version:
+
+      ```ruby
+      postgresql['version'] = 13
+      ```
+
+   1. Reconfigure GitLab:
+
+      ```shell
+      sudo gitlab-ctl reconfigure
+      ```
+
+   1. Check the binaries are relinked. The binaries distributed for
+      PostgreSQL vary between major releases, it's typical to
+      have a small number of incorrect symbolic links:
+
+      ```shell
+      sudo ls -al /opt/gitlab/embedded/bin | grep postgres
+      ```
+
+1. Replica: ensure PostgreSQL is fully reinitialized for the specified version:
+
+   ```shell
+   cd /var/opt/gitlab/postgresql
+   sudo rm -rf data
+   sudo gitlab-ctl reconfigure
+   ```
+
+1. Replica: optionally monitor the database in two additional terminal sessions:
+
+   - Disk use increases as `pg_basebackup` runs. Track progress of the
+     replica initialization with:
+
+     ```shell
+     cd /var/opt/gitlab/postgresql
+     watch du -sh data
+     ```
+
+   - Monitor the process in the logs:
+
+     ```shell
+     sudo gitlab-ctl tail patroni
+     ```
+
+1. Replica: Start Patroni to reinitialize the replica:
+
+   ```shell
+   sudo gitlab-ctl start patroni
+   ```
+
+1. Replica: After it completes, remove the hardcoded version from `/etc/gitlab/gitlab.rb`:
+
+   1. Edit `/etc/gitlab/gitlab.rb` and remove `postgresql['version']`.
+   1. Reconfigure GitLab:
+
+      ```shell
+      sudo gitlab-ctl reconfigure
+      ```
+
+   1. Check the correct binaries are linked:
+
+      ```shell
+      sudo ls -al /opt/gitlab/embedded/bin | grep postgres
+      ```
+
+1. Check the cluster status on all nodes:
+
+   ```shell
+   sudo gitlab-ctl patroni members
+   ```
+
+Repeat this procedure on the other replica if required.
+
+## Issues with other components
+
+If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page:
+
+- [Consul](../consul.md#troubleshooting-consul)
+- [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting)
author	GitLab Bot <gitlab-bot@gitlab.com>	2024-01-03 21:12:53 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2024-01-03 21:12:53 +0300
commit	b87af16bf2b9b09309deb902889edc1bff05256a (patch)
tree	c1088a650a2c26fb9636ad2d01744697caf80091 /doc/administration/postgresql
parent	5429e3d4e0d5f1601ff8989bcb26bd822b3c7c5a (diff)