Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database/table_partitioning.md')
-rw-r--r--doc/development/database/table_partitioning.md51
1 files changed, 25 insertions, 26 deletions
diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md
index 34cb73978bc..582c988bef9 100644
--- a/doc/development/database/table_partitioning.md
+++ b/doc/development/database/table_partitioning.md
@@ -1,5 +1,5 @@
---
-stage: Enablement
+stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
@@ -36,23 +36,23 @@ before attempting to leverage this feature.
While partitioning can be very useful when properly applied, it's
imperative to identify if the data and workload of a table naturally fit a
-partitioning scheme. There are a few details you'll have to understand
-in order to decide if partitioning is a good fit for your particular
+partitioning scheme. There are a few details you have to understand
+to decide if partitioning is a good fit for your particular
problem.
First, a table is partitioned on a partition key, which is a column or
-set of columns which determine how the data will be split across the
+set of columns which determine how the data is split across the
partitions. The partition key is used by the database when reading or
-writing data, to decide which partitions need to be accessed. The
+writing data, to decide which partitions must be accessed. The
partition key should be a column that would be included in a `WHERE`
clause on almost all queries accessing that table.
-Second, it's necessary to understand the strategy the database will
-use to split the data across the partitions. The scheme supported by the
+Second, it's necessary to understand the strategy the database uses
+to split the data across the partitions. The scheme supported by the
GitLab migration helpers is date-range partitioning, where each partition
in the table contains data for a single month. In this case, the partitioning
-key would need to be a timestamp or date column. In order for this type of
-partitioning to work well, most queries would need to access data within a
+key must be a timestamp or date column. In order for this type of
+partitioning to work well, most queries must access data in a
certain date range.
For a more concrete example, the `audit_events` table can be used, which
@@ -73,7 +73,7 @@ CREATE TABLE audit_events (
created_at timestamptz NOT NULL);
```
-Now imagine typical queries in the UI would display the data within a
+Now imagine typical queries in the UI would display the data in a
certain date range, like a single week:
```sql
@@ -117,7 +117,7 @@ partition key falls in the specified range. For example, the partition
greater than or equal to `2020-01-01` and less than `2020-02-01`.
Now, if we look at the previous example query again, the database can
-use the `WHERE` to recognize that all matching rows will be in the
+use the `WHERE` to recognize that all matching rows are in the
`audit_events_202001` partition. Rather than searching all of the data
in all of the partitions, it can search only the single month's worth
of data in the appropriate partition. In a large table, this can
@@ -136,11 +136,11 @@ LIMIT 100
In this example, the database can't prune any partitions from the search,
because matching data could exist in any of them. As a result, it has to
query each partition individually, and aggregate the rows into a single result
-set. Since `author_id` would be indexed, the performance impact could
+set. Because `author_id` would be indexed, the performance impact could
likely be acceptable, but on more complex queries the overhead can be
substantial. Partitioning should only be leveraged if the access patterns
-of the data support the partitioning strategy, otherwise performance will
-suffer.
+of the data support the partitioning strategy, otherwise performance
+suffers.
## Partitioning a table
@@ -158,15 +158,15 @@ migration to copy data into the new table. Changes to the original table
schema can be made in parallel with the partitioning migration, but they
must take care to not break the underlying mechanism that makes the migration
work. For example, if a column is added to the table that is being
-partitioned, both the partitioned table and the trigger definition need to
+partitioned, both the partitioned table and the trigger definition must
be updated to match.
### Step 1: Creating the partitioned copy (Release N)
The first step is to add a migration to create the partitioned copy of
-the original table. This migration will also create the appropriate
+the original table. This migration creates the appropriate
partitions based on the data in the original table, and install a
-trigger that will sync writes from the original table into the
+trigger that syncs writes from the original table into the
partitioned copy.
An example migration of partitioning the `audit_events` table by its
@@ -186,15 +186,15 @@ class PartitionAuditEvents < Gitlab::Database::Migration[1.0]
end
```
-Once this has executed, any inserts, updates or deletes in the
-original table will also be duplicated in the new table. For updates and
-deletes, the operation will only have an effect if the corresponding row
+After this has executed, any inserts, updates, or deletes in the
+original table are also duplicated in the new table. For updates and
+deletes, the operation only has an effect if the corresponding row
exists in the partitioned table.
### Step 2: Backfill the partitioned copy (Release N)
-The second step is to add a post-deployment migration that will schedule
-the background jobs that will backfill existing data from the original table
+The second step is to add a post-deployment migration that schedules
+the background jobs that backfill existing data from the original table
into the partitioned copy.
Continuing the above example, the migration would look like:
@@ -225,7 +225,7 @@ partitioning migration.
The third step must occur at least one release after the release that
includes the background migration. This gives time for the background
migration to execute properly in self-managed installations. In this step,
-add another post-deployment migration that will cleanup after the
+add another post-deployment migration that cleans up after the
background migration. This includes forcing any remaining jobs to
execute, and copying data that may have been missed, due to dropped or
failed jobs.
@@ -248,12 +248,11 @@ end
After this migration has completed, the original table and partitioned
table should contain identical data. The trigger installed on the
-original table guarantees that the data will remain in sync going
-forward.
+original table guarantees that the data remains in sync going forward.
### Step 4: Swap the partitioned and non-partitioned tables (Release N+1)
-The final step of the migration will make the partitioned table ready
+The final step of the migration makes the partitioned table ready
for use by the application. This section will be updated when the
migration helper is ready, for now development can be followed in the
[Tracking Issue](https://gitlab.com/gitlab-org/gitlab/-/issues/241267).