Welcome to mirror list, hosted at ThFree Co, Russian Federation.

praefect-recovery-dataloss-strategy.md « rfcs « doc - gitlab.com/gitlab-org/gitaly.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: 26dbdd9ea0dfcce6ff81f009d2b5ac318035aaad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# RFC: Praefect Dataloss and Recovery Strategy

## Abstract

This RFC proposes a strategy for Praefect to prevent, detect, and recover from dataloss.

## Dataloss Prevention

To reduce additional dataloss and conflicts, Praefect may keep repositories in read-only mode when a primary-failover occurs that results in potential dataloss.
By putting a repository in read-only mode, it prevents the unavailable data on the previous primary from diverging from new writes on the remaining replicas.
Diverging replicas of a repository can lead to split-brain situations which may require manual intervention to repair, so they should be avoided when possible.

A primary-failover happens when a primary node is no longer considered healthy by a Praefect cluster.
The Praefect cluster will then designate a new Primary from one of the remaining nodes in order to continue serving requests.

Once a new primary has been designated, a repository is in one of the following possible states:

- **Read-Write**
  - Happens when the new primary replica has the latest changes
  - Praefect will schedule new replications from the primary to any replicas missing the latest changes
  - Praefect will continue to process both mutator and accessor RPCs for the repository
- **Recovery**
  - Happens when the new primary replica is missing the latest changes but the latest changes exist on another available replica
  - Praefect will only process accessor RPCs until the primary receives the latest changes
  - Praefect will attempt to propagate changes from repos with the latest changes to the repos missing them
  - Once the primary is repaired, Praefect will start processing mutator RPCs again. This will put the repository into read-write state.
- **Read-only**
  - Happens when none of the remaining replicas have the latest changes
  - Praefect will only process accessor RPCs
  - Read-only state can be overridden manually by an administrator via forgiveness process
  - Read-only state can be resolved when a node containing the latest changes becomes healthy again. This will put the repository into the recovery state.

### Repository State Summary

| State      | Primary has latest | Secondaries have latest | RPCs Processed     |
| ---------- | ------------------ | ----------------------- | ------------------ |
| Read-Write | Yes                | Irrelevant              | Accessor + Mutator |
| Recovery   | No                 | Yes, at least one does  | Accessor           |
| Read-Only  | No                 | No, none of them do     | Accessor           |

### Repository State Diagram

```mermaid
stateDiagram
    FO: "failover"
    RW: "read-write"
    R: "recovery"
    RO: "read-only"
    
    RW --> FO: [3]
    R  --> FO: [3]
    RO --> FO: [3]
    
    FO-->RW: Primary has latest
    FO-->R:  Secondary has latest
    FO-->RO: Nobody has latest
    

    RO-->R:  [1] Latest recovers
    R -->RW: [2] Primary repaired
```

- [1] **Recovery or Forgiveness** - occurs when the missing latest changes for repository become available again or an admin decides to forgive the dataloss
- [2] **Primary Repaired** - when the primary replica of a repository is repaired by receiving latest changes from a secondary. Once repair is complete, a repository will be in read-write mode
- [3] - Any any state, the primary may failover

## Replication Log

Praefect is capable of detecting dataloss resulting from incomplete replication jobs in the replication log.[^1]
The replication log is a database persisted log data structure of all open and closed replication jobs in sequential order.
Open jobs are jobs that have not yet been completed or cancelled.
Closed jobs are jobs that Praefect no longer needs to process.
When processing replication jobs, a Praefect instance will process each unique[^2] open job in sequential order.
A Praefect cluster will not process more than one job at a time per repository replica.
The job will be attempted indefinitely until completed successfully or cancelled (via forgiveness or repair).

```
┏━━━Replication Log━━━┓                                        
┃                     ┃                                        
┃ ┌ ─ Closed Jobs ─ ┐ ┃                                        
┃   ┌─────────────┐   ┃                                        
┃ │ │    J[0]     │ │ ┃                                        
┃   ├─────────────┤   ┃                                        
┃ │ │     ...     │ │ ┃                                        
┃   ├─────────────┤   ┃                                        
┃ │ │    J[m]     │ │ ┃                                        
┃   └─────────────┘   ┃                                        
┃ ├ ─ ─Open Jobs─ ─ ┤ ┃                                        
┃   ┌─────────────┐   ┃                                        
┃ │ │   J[m+1]    │◀┼─╋─── Oldest open job processed first     
┃   ├─────────────┤   ┃                                        
┃ │ │     ...     │ │ ┃                                        
┃   ├─────────────┤   ┃                                        
┃ │ │   J[m+n]    │◀┼─╋─── New jobs inserted at end of queue   
┃   └─────────────┘   ┃                                        
┃ └ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ┃                                        
┗━━━━━━━━━━━━━━━━━━━━━┛                                        
```

Each secondary replica of a repository is considered to have the latest changes when the last replication job for that replica is completed successfully.

When a repository is in read-only mode and a node containing the latest changes becomes available (i.e. healthy), this will signal to Praefect that the primary should be repaired.

### Recovery from Primary Node

Given 3 replicas of a repository (Primary X, Secondary Y, and Secondary Z), consider the following replication logs after the primary is mutated:

| Replica X (Primary) | Replica Y                            | Replica Z                            |
| ---------------- | ------------------------------------ | ------------------------------------ |
| `-- start open jobs--` | `-- start open jobs--`               | `-- start open jobs--`               |
|                  | `ID=1, Src=X, Dst=Y, State=Incomplete` | `ID=1, Src=X, Dst=Z, State=Incomplete` |

Then, Primary X becomes unavailable before the open jobs have a chance to complete. Praefect chooses a new primary (Y). The repo will now be in read-only mode.

If replica X becomes available again, this will trigger Praefect to reattempt the original jobs:

| Replica X | Replica Y (Primary) | Replica Z |
| ---------------- | ------------------------------------ | ------------------------------------ |
| `-- start open jobs--` | `-- start open jobs--`               | `-- start open jobs--`               |
|                  | `ID=1, Src=X, Dst=Y, State=Complete` | `ID=1, Src=X, Dst=Z, State=Complete` |

Now the repository is in read-write mode.

### Recovery from Secondary Node

Given 3 replicas of a repository (Primary X, Secondary Y, and Secondary Z), consider the following replication logs after the primary is mutated:

| Replica X (Primary) | Replica Y | Replica Z                            |
| ---------------- | ---------------- | ------------------- |
| `-- start closed jobs--` | `-- start closed jobs--` | `-- start closed jobs--` |
| | |  `ID=1, Src=X, Dst=Z, State=Complete` |
| `-- start open jobs--` | `-- start open jobs--` | `-- start open jobs--` |
| | `ID=1, Src=X, Dst=Y, State=Incomplete` | |

Replica Z finishes all open replication jobs. Then, Primary X and Y become unavailable before Replica Y completes. Praefect chooses a new primary (Y). The repo will now be in read-only mode:

| Replica X (OFFLINE) | Replica Y (Primary) | Replica Z (OFFLINE) |
| ---------------- | ---------------- | ------------------- |
| `-- start closed jobs--` | `-- start closed jobs--` | `-- start closed jobs--` |
| | |  `ID=1, Src=X, Dst=Z, State=Complete` |
| `-- start open jobs--` | `-- start open jobs--` | `-- start open jobs--` |
| | `ID=1, Src=X, Dst=Y, State=Incomplete` | |

Then, Z becomes available again. This puts the repository into recovery mode. This will trigger Praefect to cancel the previous replication job for Y and schedule a new one to propagate from Z to Y:

| Replica X (OFFLINE) | Replica Y (Primary) | Replica Z |
| ---------------- | ---------------- | ------------------- |
| `-- start closed jobs--` | `-- start closed jobs--` | `-- start closed jobs--` |
| | `ID=1, Src=X, Dst=Y, State=Cancelled` | `ID=1, Src=X, Dst=Z, State=Complete` |
| `-- start open jobs--` | `-- start open jobs--` | `-- start open jobs--` |
| | `ID=2, Src=Z, Dst=Y, State=Incomplete` | |

Once Y completes all open jobs, it will become a read-write repository again:

| Replica X (OFFLINE) | Replica Y (Primary) | Replica Z |
| ---------------- | ---------------- | ------------------- |
| `-- start closed jobs--` | `-- start closed jobs--` | `-- start closed jobs--` |
| | `ID=1, Src=X, Dst=Y, State=Cancelled` | `ID=1, Src=X, Dst=Z, State=Complete` |
| | `ID=2, Src=Z, Dst=Y, State=Complete` | |
| `-- start open jobs--` | `-- start open jobs--` | `-- start open jobs--` |

## Dataloss Forgiveness

When the only replicas containing the latest changes for a repository are no longer viable, a GitLab administrator may wish to "forgive" the dataloss.

[^1]: This proposal intentionally leaves out dataloss resulting from disk corruption or other physical defects during data transfer and storage that are not detected by Gitaly.

[^2]: Praefect will intelligently collapse jobs of a similar type to reduce the number of replication jobs processed once https://gitlab.com/gitlab-org/gitaly/-/issues/2438 is closed. As part of that issue, Praefect will work on the newest job first, and once successful mark all older contiguous jobs of the same kind with the same status.