Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/postgresql/replication_and_failover.md')
-rw-r--r--doc/administration/postgresql/replication_and_failover.md250
1 files changed, 243 insertions, 7 deletions
diff --git a/doc/administration/postgresql/replication_and_failover.md b/doc/administration/postgresql/replication_and_failover.md
index b6d2e36851d..d1dd233f08b 100644
--- a/doc/administration/postgresql/replication_and_failover.md
+++ b/doc/administration/postgresql/replication_and_failover.md
@@ -97,8 +97,8 @@ This is why you will need:
- IP address of each nodes network interface. This can be set to `0.0.0.0` to
listen on all interfaces. It cannot be set to the loopback address `127.0.0.1`.
-- Network Address. This can be in subnet (i.e. `192.168.0.0/255.255.255.0`)
- or CIDR (i.e. `192.168.0.0/24`) form.
+- Network Address. This can be in subnet (that is, `192.168.0.0/255.255.255.0`)
+ or CIDR (that is, `192.168.0.0/24`) form.
#### Consul information
@@ -157,6 +157,13 @@ We will need the following password information for the application's database u
sudo gitlab-ctl pg-password-md5 POSTGRESQL_USERNAME
```
+#### Patroni information
+
+We will need the following password information for the Patroni API:
+
+- `PATRONI_API_USERNAME`. A username for basic auth to the API
+- `PATRONI_API_PASSWORD`. A password for basic auth to the API
+
#### PgBouncer information
When using default setup, minimum configuration requires:
@@ -236,6 +243,11 @@ postgresql['sql_replication_password'] = 'POSTGRESQL_REPLICATION_PASSWORD_HASH'
# Replace POSTGRESQL_PASSWORD_HASH with a generated md5 value
postgresql['sql_user_password'] = 'POSTGRESQL_PASSWORD_HASH'
+# Replace PATRONI_API_USERNAME with a username for Patroni Rest API calls (use the same username in all nodes)
+patroni['username'] = 'PATRONI_API_USERNAME'
+# Replace PATRONI_API_PASSWORD with a password for Patroni Rest API calls (use the same password in all nodes)
+patroni['password'] = 'PATRONI_API_PASSWORD'
+
# Sets `max_replication_slots` to double the number of database nodes.
# Patroni uses one extra slot per node when initiating the replication.
patroni['postgresql']['max_replication_slots'] = X
@@ -246,7 +258,7 @@ patroni['postgresql']['max_replication_slots'] = X
patroni['postgresql']['max_wal_senders'] = X+1
# Replace XXX.XXX.XXX.XXX/YY with Network Address
-postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY)
+postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY 127.0.0.1/32)
# Replace placeholders:
#
@@ -259,8 +271,8 @@ consul['configuration'] = {
# END user configuration
```
-You do not need an additional or different configuration for replica nodes. As a matter of fact, you don't have to have
-a predetermined primary node. Therefore all database nodes use the same configuration.
+All database nodes use the same configuration. The leader node is not determined in configuration,
+and there is no additional or different configuration for either leader or replica nodes.
Once the configuration of a node is done, you must [reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure)
on each node for the changes to take effect.
@@ -555,10 +567,12 @@ gitlab_rails['auto_migrate'] = false
postgresql['pgbouncer_user_password'] = '771a8625958a529132abe6f1a4acb19c'
postgresql['sql_user_password'] = '450409b85a0223a214b5fb1484f34d0f'
+patroni['username'] = 'PATRONI_API_USERNAME'
+patroni['password'] = 'PATRONI_API_PASSWORD'
patroni['postgresql']['max_replication_slots'] = 6
patroni['postgresql']['max_wal_senders'] = 7
-postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16)
+postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16 127.0.0.1/32)
# Configure the Consul agent
consul['services'] = %w(postgresql)
@@ -642,12 +656,15 @@ postgresql['sql_user_password'] = '450409b85a0223a214b5fb1484f34d0f'
# Patroni uses one extra slot per node when initiating the replication.
patroni['postgresql']['max_replication_slots'] = 6
+patroni['username'] = 'PATRONI_API_USERNAME'
+patroni['password'] = 'PATRONI_API_PASSWORD'
+
# Set `max_wal_senders` to one more than the number of replication slots in the cluster.
# This is used to prevent replication from using up all of the
# available database connections.
patroni['postgresql']['max_wal_senders'] = 7
-postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16)
+postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16 127.0.0.1/32)
consul['configuration'] = {
server: true,
@@ -721,6 +738,97 @@ functional or does not have a leader, Patroni and by extension PostgreSQL will n
API which can be accessed via its [default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#patroni)
on each node.
+### Check replication status
+
+Run `gitlab-ctl patroni members` to query Patroni for a summary of the cluster status:
+
+```plaintext
++ Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
+| Member | Host | Role | State | TL | Lag in MB |
++-------------------------------------+--------------+---------+---------+----+-----------+
+| gitlab-database-1.example.com | 172.18.0.111 | Replica | running | 5 | 0 |
+| gitlab-database-2.example.com | 172.18.0.112 | Replica | running | 5 | 100 |
+| gitlab-database-3.example.com | 172.18.0.113 | Leader | running | 5 | |
++-------------------------------------+--------------+---------+---------+----+-----------+
+```
+
+To verify the status of replication:
+
+```shell
+echo 'select * from pg_stat_wal_receiver\x\g\x \n select * from pg_stat_replication\x\g\x' | gitlab-psql
+```
+
+The same command can be run on all three database servers, and will return any information
+about replication available depending on the role the server is performing.
+
+The leader should return one record per replica:
+
+```sql
+-[ RECORD 1 ]----+------------------------------
+pid | 371
+usesysid | 16384
+usename | gitlab_replicator
+application_name | gitlab-database-1.example.com
+client_addr | 172.18.0.111
+client_hostname |
+client_port | 42900
+backend_start | 2021-06-14 08:01:59.580341+00
+backend_xmin |
+state | streaming
+sent_lsn | 0/EA13220
+write_lsn | 0/EA13220
+flush_lsn | 0/EA13220
+replay_lsn | 0/EA13220
+write_lag |
+flush_lag |
+replay_lag |
+sync_priority | 0
+sync_state | async
+reply_time | 2021-06-18 19:17:14.915419+00
+```
+
+Investigate further if:
+
+- There are missing or extra records.
+- `reply_time` is not current.
+
+The `lsn` fields relate to which write-ahead-log segments have been replicated.
+Run the following on the leader to find out the current LSN:
+
+```shell
+echo 'SELECT pg_current_wal_lsn();' | gitlab-psql
+```
+
+If a replica is not in sync, `gitlab-ctl patroni members` indicates the volume
+of missing data, and the `lag` fields indicate the elapsed time.
+
+Read more about the data returned by the leader
+[in the PostgreSQL documentation](https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-REPLICATION-VIEW),
+including other values for the `state` field.
+
+The replicas should return:
+
+```sql
+-[ RECORD 1 ]---------+-------------------------------------------------------------------------------------------------
+pid | 391
+status | streaming
+receive_start_lsn | 0/D000000
+receive_start_tli | 5
+received_lsn | 0/EA13220
+received_tli | 5
+last_msg_send_time | 2021-06-18 19:16:54.807375+00
+last_msg_receipt_time | 2021-06-18 19:16:54.807512+00
+latest_end_lsn | 0/EA13220
+latest_end_time | 2021-06-18 19:07:23.844879+00
+slot_name | gitlab-database-1.example.com
+sender_host | 172.18.0.113
+sender_port | 5432
+conninfo | user=gitlab_replicator host=172.18.0.113 port=5432 application_name=gitlab-database-1.example.com
+```
+
+Read more about the data returned by the replica
+[in the PostgreSQL documentation](https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-WAL-RECEIVER-VIEW).
+
### Selecting the appropriate Patroni replication method
[Review the Patroni documentation carefully](https://patroni.readthedocs.io/en/latest/SETTINGS.html#postgresql)
@@ -1017,6 +1125,134 @@ postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)
[Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect.
+### Reinitialize a replica
+
+If replication is not occurring, it may be necessary to reinitialize a replica.
+
+1. On any server in the cluster, determine the Cluster and Member names,
+ and check the replication lag by running `gitlab-ctl patroni members`. Here is an example:
+
+ ```plaintext
+ + Cluster: postgresql-ha (6970678148837286213) ------+---------+---------+----+-----------+
+ | Member | Host | Role | State | TL | Lag in MB |
+ +-------------------------------------+--------------+---------+---------+----+-----------+
+ | gitlab-database-1.example.com | 172.18.0.111 | Replica | running | 5 | 0 |
+ | gitlab-database-2.example.com | 172.18.0.112 | Replica | running | 5 | 100 |
+ | gitlab-database-3.example.com | 172.18.0.113 | Leader | running | 5 | |
+ +-------------------------------------+--------------+---------+---------+----+-----------+
+ ```
+
+1. Reinitialize the affected replica server:
+
+ ```plaintext
+ gitlab-ctl patroni reinitialize-replica postgresql-ha gitlab-database-2.example.com
+ ```
+
+### Reset the Patroni state in Consul
+
+WARNING:
+This is a destructive process and may lead the cluster into a bad state. Make sure that you have a healthy backup before running this process.
+
+As a last resort, if your Patroni cluster is in an unknown/bad state and no node can start, you can
+reset the Patroni state in Consul completely, resulting in a reinitialized Patroni cluster when
+the first Patroni node starts.
+
+To reset the Patroni state in Consul:
+
+1. Take note of the Patroni node that was the leader, or that the application thinks is the current leader, if the current state shows more than one, or none. One way to do this is to look on the PgBouncer nodes in `/var/opt/gitlab/consul/databases.ini`, which contains the hostname of the current leader.
+1. Stop Patroni on all nodes:
+
+ ```shell
+ sudo gitlab-ctl stop patroni
+ ```
+
+1. Reset the state in Consul:
+
+ ```shell
+ /opt/gitlab/embedded/bin/consul kv delete -recurse /service/postgresql-ha/
+ ```
+
+1. Start one Patroni node, which will initialize the Patroni cluster and be elected as a leader.
+ It's highly recommended to start the previous leader (noted in the first step),
+ in order to not lose existing writes that may have not been replicated because
+ of the broken cluster state:
+
+ ```shell
+ sudo gitlab-ctl start patroni
+ ```
+
+1. Start all other Patroni nodes that will join the Patroni cluster as replicas:
+
+ ```shell
+ sudo gitlab-ctl start patroni
+ ```
+
+If you are still seeing issues, the next step is restoring the last healthy backup.
+
+### Errors in the Patroni log about a `pg_hba.conf` entry for `127.0.0.1`
+
+The following log entry in the Patroni log indicates the replication is not working
+and a configuration change is needed:
+
+```plaintext
+FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "gitlab_replicator"
+```
+
+To fix the problem, ensure the loopback interface is included in the CIDR addresses list:
+
+1. Edit `/etc/gitlab/gitlab.rb`:
+
+ ```ruby
+ postgresql['trust_auth_cidr_addresses'] = %w(<other_cidrs> 127.0.0.1/32)
+ ```
+
+1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect.
+1. Check that [all the replicas are synchronized](#check-replication-status)
+
+### Errors in Patroni logs: the requested start point is ahead of the WAL flush position
+
+This error indicates that the database is not replicating:
+
+```plaintext
+FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/5000000 is ahead of the WAL flush position of this server 0/4000388
+```
+
+This example error is from a replica that was initially misconfigured, and had never replicated.
+
+Fix it [by reinitializing the replica](#reinitialize-a-replica).
+
+### Patroni fails to start with `MemoryError`
+
+Patroni may fail to start, logging an error and stack trace:
+
+```plaintext
+MemoryError
+Traceback (most recent call last):
+ File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>
+ sys.exit(main())
+[..]
+ File "/opt/gitlab/embedded/lib/python3.7/ctypes/__init__.py", line 273, in _reset_cache
+ CFUNCTYPE(c_int)(lambda: None)
+```
+
+If the stack trace ends with `CFUNCTYPE(c_int)(lambda: None)`, this code triggers `MemoryError`
+if the Linux server has been hardened for security.
+
+The code causes Python to write temporary executable files, and if it cannot find a filesystem
+in which to do this, for example if `noexec` is set on the `/tmp` filesystem, it fails with
+`MemoryError` ([read more in the issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6184)).
+
+Workarounds:
+
+- Remove `noexec` from the mount options for filesystems like `/tmp` and `/var/tmp`.
+- If set to enforcing, SELinux may also prevent these operations. Verify the issue is fixed by setting
+ SELinux to permissive.
+
+Omnibus GitLab has shipped with Patroni since 13.1 along with a build of Python 3.7.
+Workarounds should stop being required when GitLab 14.x starts shipping with
+[a later version of Python](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6164) as
+the code which causes this was removed from Python 3.8.
+
### Issues with other components
If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page: