doc/failure_analysis/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

# Gitaly Cluster Failure Mode Analysis

This document attempts to describe the current failure modes of Gitaly
Cluster across versions in an attempt to identify areas of improvements.

## Writes

The following sequence diagram shows what happens during a write in
Gitaly Cluster for GitLab 14.0.x:

```mermaid
sequenceDiagram
    autonumber
    Client->>+Workhorse: git push
    Workhorse->>+Praefect: ReceivePack
    Praefect->>+Database: GetPrimary()
    Database->>+Database: SELECT primary FROM repositories
    Database->>+Praefect: Gitaly A
    Praefect->>+Gitaly: ReceivePack
    Gitaly->>+Git: git receive-pack
    Git->>+Hook: update HEAD master
    Hook->>+Praefect: TX: update HEAD master
    Praefect->>+Praefect: collect votes
    Praefect->>+Hook: Commit
    Hook->>+Git: exit 0
    Git->>+Gitaly: exit 0
    Gitaly->>+Praefect: success
    Praefect->>Workhorse: success
    Workhorse->>Client: success
```

In step 3, when Praefect reveives the `ReceivePack` RPC from Workhorse,
it calls `GetPrimary()`, which consults the database. Praefect considers
a primary repository on a per-repository basis, which means that in a
cluster of 3 nodes, the primary repository could reside in any of the 3
nodes. Each entry in the `repositories` table has a row pertaining to a
specific repository, and the `primary` column denotes the current Gitaly
node serving as the primary.

### Failover

In GitLab 14.0.x, failover happens whenever a majority of Praefect nodes
deem that a Gitaly node is no longer a valid primary:

1. The Gitaly node is no longer in the [`valid_primaries` database view](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L9-20).
1. `valid_primaries` depends on the [`healthy_storages` database view](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L9-21).
2. `healthy_storages` depends on the [`node_status` database table](https://gitlab.com/gitlab-org/gitaly/blob/d0083f4c828772537e6891cae4fe0df1f6b255f4/internal/praefect/datastore/migrations/20210525143540_healthy_storages_view.go#L11).
3. Praefect nodes attempt to send `HealthCheck` RPC messages to Gitaly nodes once per second and update the  [`node_status` table](https://gitlab.com/gitlab-org/gitaly/blob/098c6dcdbde9824385d61b6cc56c2e10724a104b/internal/praefect/nodes/health_manager.go#L144-154) every time.

A failover is triggered whenever a primary node is no longer in the
`valid_primaries` table. This happens when a majority of Praefect nodes:

1. Have updated the `node_status` table to indicate they attempted to
contact the Gitaly node 60 seconds ago.
2. The last successful response from that node was over 10 seconds ago.

In addition, a primary node can also be demoted if the [`replication_queue` for that
node meets certain criteria](https://gitlab.com/gitlab-org/gitaly/blob/371310f8236046666f75710ef02b016011b87deb/internal/praefect/datastore/migrations/20210525173505_valid_primaries_view.go#L22-32).
This can happen if the [Praefect reconciler creates a `delete_replica` job for
that Gitaly node that has not yet been completed](https://gitlab.com/gitlab-org/gitaly/blob/12e0bf3ac80b72bef07a5733a70c270f70771859/internal/praefect/reconciler/reconciler.go#L95-107).

This failover is triggered when a Praefect detects that the
[`healthy_storages` view has changed](https://gitlab.com/gitlab-org/gitaly/blob/098c6dcdbde9824385d61b6cc56c2e10724a104b/internal/praefect/nodes/health_manager.go#L164-189).

For GitLab 14.0.x, when a failover is triggered, Praefect updates **all** repositories
pointing to the original Gitaly node to the new primary. For example,
suppose there are three Gitaly nodes: `gitaly-1`, `gitaly-2`, and
`gitaly-3`. if `gitaly-1` has been marked down, Praefect will attempt to
update the `repositories.primary` column to point to a new primary
chosen randomly. If `gitaly-2` and `gitaly-3` are available, one of them
will randomly be picked.

#### Differences with GitLab 14.1

Starting with GitLab 14.1, Gitaly has been changed to perform the
failover election lazily. That means the failover will NOT affect
**all** repositories pointing to the original node.

### Failure Mode Analysis

Based on the background above, we can summarize possible failures:

| Failure mode       | Generic root cause                        | Specific root causes                      | Likelihood | Effect level |
|--------------------|-------------------------------------------|-------------------------------------------|------------|--------------|
| Writes not working | Workhorse not relaying ReceivePack        | Network outage                            |            |              |
|                    |                                           | DNS failure                               |            |              |
|                    |                                           | Configuration error                       |            |              |
|                    | Incorrect/invalid Praefect database state | Stalled database queries                  |            |              |
|                    |                                           | Database connection limits                |            |              |
|                    |                                           | Out of disk space                         |            |              |
|                    |                                           | Praefect migrations not applied           |            |              |
|                    |                                           | `delete_replica` job incorrectly inserted |            |              |
|                    |                                           | Single node restored from backup          |            |              |
|                    |                                           | Missing/deleted `repositories` entry      |            |              |
|                    | Node status not properly updating         | Praefect deadlock                         |            |              |
|                    |                                           | Gitaly deadlock (Health RPC OK)           |            |              |
|                    |                                           | Database deadlock                         |            |              |
|                    |                                           | Database table/row locks                  |            |              |
|                    |                                           | Inconsistent network partition            |            |              |
|                    |                                           | Clocks out of sync                        |            |              |
|                    |                                           | Configuration error (e.g. auth)           |            |              |
|                    | Repository corruption                     | Hardware reboots/failures                 |            |              |
|                    |                                           | Split-brain due to improper failover      |            |              |
|                    |                                           | Out of disk space                         |            |              |

### Mitigation Strategies (TODO)

1. Since `node_status` is a critical part of failover detection, this should be as robust as possible.
We should [consider a Gossip-based protocol approach](https://gitlab.com/gitlab-org/gitaly/-/issues/3807) to take
the database out of the equation and use a distributed consensus algorithm to obtain a consistent view of the cluster.

1. Check repository checksums with every write.