Welcome to mirror list, hosted at ThFree Co, Russian Federation.

ci_mirrored_tables.md « database « development « doc - gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: bf3a744b936d31050de6dda0e05c7672a487278e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# CI mirrored tables

## Problem statement

As part of the database [decomposition work](https://gitlab.com/groups/gitlab-org/-/epics/6168),
which had the goal of splitting the single database GitLab is using, into two databases: `main` and
`ci`, came the big challenge of
[removing all joins between the `main` and the `ci` tables](multiple_databases.md#removing-joins-between-ci-and-non-ci-tables).
That is because PostgreSQL doesn't support joins between tables that belong to different databases.
However, some core application models in the main database are queried very often by the CI side.
For example:

- `Namespace`, in the `namespaces` table.
- `Project`, in the `projects` table.

Not being able to do `joins` on these tables brings a great challenge. The team chose to perform logical
replication of those tables from the main database to the CI database, in the new tables:

- `ci_namespace_mirrors`, as a mirror of the `namespaces` table
- `ci_project_mirrors`, as a mirror of the `projects` table

This logical replication means two things:

1. The `main` database tables can be queried and joined to the `namespaces` and `projects` tables.
1. The `ci` database tables can be joined with the `ci_namespace_mirrors` and `ci_project_mirrors` tables.

```mermaid
graph LR

    subgraph "Main database (tables)"
        A[namespaces] -->|updates| B[namespaces_sync_events]
        A -->|deletes| C[loose_foreign_keys_deleted_records]
        D[projects] -->|deletes| C
        D -->|updates| E[projects_sync_events]
    end

    B --> F
    C --> G
    E --> H

   subgraph "Sidekiq worker jobs"
     F[Namespaces::ProcessSyncEventsWorker]
     G[LooseForeignKeys::CleanupWorker]
     H[Projects::ProcessSyncEventsWorker]
   end

    F -->|do update| I
    G -->|delete records| I
    G -->|delete records| J
    H -->|do update| J

    subgraph "CI database (tables)"
        I[ci_namespace_mirrors]
        J[ci_project_mirrors]
    end
```

This replication was restricted only to a few attributes that are needed from each model:

- From `Namespace` we replicate `traversal_ids`.
- From `Project` we replicate only the `namespace_id`, which represents the group which the project belongs to.

## Keeping the CI mirrored tables in sync with the source tables

We must care about two type 3 events to keep
the source and the target tables in sync:

1. Creation of new namespaces or projects.
1. Updating the namespaces or projects.
1. Deleting namespaces/projects.

```mermaid
graph TD

    subgraph "CI database (tables)"
        E[other CI tables]
        F{queries with joins allowed}
        G[ci_project_mirrors]
        H[ci_namespace_mirrors]

        E---F
        F---G
        F---H
    end

    A---B
    B---C
    B---D

L["⛔ ← Joins are not allowed → ⛔"]

    subgraph "Main database (tables)"
        A[other main tables]
        B{queries with joins allowed}
        C[projects]
        D[namespaces]
    end
```

### Create and update

Syncing the data of newly created or updated namespaces or projects happens in this
order:

1. **On the `main` database**: Any `INSERT` or `UPDATE` on the `namespaces` or `projects` tables
   adds an entry to the tables `namespaces_sync_events`, and `projects_sync_events`. These tables
   also exist on the `main` database. These entries are added by triggers on both of the tables.
1. **On the model level**: After a commit happens on either of the source models `Namespace` or
   `Project`, it schedules the corresponding Sidekiq jobs `Namespaces::ProcessSyncEventsWorker`
   or `Projects::ProcessSyncEventsWorker` to run.
1. These workers then:
   1. Read the entries from the tables `(namespaces/project)_sync_events`
      from the `main` database, to check which namespaces or projects to sync.
   1. Copy the data for any updated records into the target
      tables `ci_namespace_mirrors`, `ci_project_mirrors`.

### Delete

When any of `namespaces` or `projects` are deleted, the target records on the mirrored
CI tables are deleted using the [loose foreign keys](loose_foreign_keys.md) (LFK) mechanism.

By having these items in the `config/gitlab_loose_foreign_keys.yml`, the LFK mechanism
was already working as expected. It deleted any records on the CI mirrored
tables that mapped to deleted `namespaces` or `projects` in the `main` database.

```yaml
ci_namespace_mirrors:
  - table: namespaces
    column: namespace_id
    on_delete: async_delete
ci_project_mirrors:
  - table: projects
    column: project_id
    on_delete: async_delete
```

## Consistency Checking

To make sure that both syncing mechanisms work as expected, we deploy
two extra worker jobs, triggered by cron jobs every few minutes:

1. `Database::CiNamespaceMirrorsConsistencyCheckWorker`
1. `Database::CiProjectMirrorsConsistencyCheckWorker`

These jobs:

1. Scan both of the source tables on the `main` database, using a cursor.
1. Compare the items in the `namespaces` and `projects` with the target tables on the `ci` database.
1. Report the items that are not in sync to Kibana and Prometheus.
1. Corrects any discrepancies.