Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2024-01-23 06:10:35 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2024-01-23 06:10:35 +0300
commit: 89ff92639b7ad6499fb1e9470e4151fb112a904e (patch)
tree: c03a3e8d5f5b905cd07932e7780e8f0d533e5ec6 /doc/development/database/partitioning/date_range.md
parent: 422294262e50d47bee73c2e85bfbc21473c2508a (diff)
1 files changed, 237 insertions, 0 deletions
diff --git a/doc/development/database/partitioning/date_range.md b/doc/development/database/partitioning/date_range.md
new file mode 100644
index 00000000000..5c68eb3d075
--- /dev/null
+++ b/doc/development/database/partitioning/date_range.md
@@ -0,0 +1,237 @@
+---
+stage: Data Stores
+group: Database
+info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
+---
+
+# Date range partitioning
+
+> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32240) in GitLab 13.1.
+
+## Description
+
+The scheme best supported by the GitLab migration helpers is date-range partitioning,
+where each partition in the table contains data for a single month. In this case,
+the partitioning key must be a timestamp or date column. For this type of
+partitioning to work well, most queries must access data in a
+certain date range.
+
+For a more concrete example, consider using the `audit_events` table.
+It was the first table to be partitioned in the application database
+(scheduled for deployment with the GitLab 13.5 release). This
+table tracks audit entries of security events that happen in the
+application. In almost all cases, users want to see audit activity that
+occurs in a certain time frame. As a result, date-range partitioning
+was a natural fit for how the data would be accessed.
+
+To look at this in more detail, imagine a simplified `audit_events` schema:
+
+```sql
+CREATE TABLE audit_events (
+  id SERIAL NOT NULL PRIMARY KEY,
+  author_id INT NOT NULL,
+  details jsonb NOT NULL,
+  created_at timestamptz NOT NULL);
+```
+
+Now imagine typical queries in the UI would display the data in a
+certain date range, like a single week:
+
+```sql
+SELECT *
+FROM audit_events
+WHERE created_at >= '2020-01-01 00:00:00'
+  AND created_at < '2020-01-08 00:00:00'
+ORDER BY created_at DESC
+LIMIT 100
+```
+
+If the table is partitioned on the `created_at` column the base table would
+look like:
+
+```sql
+CREATE TABLE audit_events (
+  id SERIAL NOT NULL,
+  author_id INT NOT NULL,
+  details jsonb NOT NULL,
+  created_at timestamptz NOT NULL,
+  PRIMARY KEY (id, created_at))
+PARTITION BY RANGE(created_at);
+```
+
+NOTE:
+The primary key of a partitioned table must include the partition key as
+part of the primary key definition.
+
+And we might have a list of partitions for the table, such as:
+
+```sql
+audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
+audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
+audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01')
+```
+
+Each partition is a separate physical table, with the same structure as
+the base `audit_events` table, but contains only data for rows where the
+partition key falls in the specified range. For example, the partition
+`audit_events_202001` contains rows where the `created_at` column is
+greater than or equal to `2020-01-01` and less than `2020-02-01`.
+
+Now, if we look at the previous example query again, the database can
+use the `WHERE` to recognize that all matching rows are in the
+`audit_events_202001` partition. Rather than searching all of the data
+in all of the partitions, it can search only the single month's worth
+of data in the appropriate partition. In a large table, this can
+dramatically reduce the amount of data the database needs to access.
+However, imagine a query that does not filter based on the partitioning
+key, such as:
+
+```sql
+SELECT *
+FROM audit_events
+WHERE author_id = 123
+ORDER BY created_at DESC
+LIMIT 100
+```
+
+In this example, the database can't prune any partitions from the search,
+because matching data could exist in any of them. As a result, it has to
+query each partition individually, and aggregate the rows into a single result
+set. Because `author_id` would be indexed, the performance impact could
+likely be acceptable, but on more complex queries the overhead can be
+substantial. Partitioning should only be leveraged if the access patterns
+of the data support the partitioning strategy, otherwise performance
+suffers.
+
+## Example
+
+### Step 1: Creating the partitioned copy (Release N)
+
+The first step is to add a migration to create the partitioned copy of
+the original table. This migration creates the appropriate
+partitions based on the data in the original table, and install a
+trigger that syncs writes from the original table into the
+partitioned copy.
+
+An example migration of partitioning the `audit_events` table by its
+`created_at` column would look like:
+
+```ruby
+class PartitionAuditEvents < Gitlab::Database::Migration[2.1]
+  include Gitlab::Database::PartitioningMigrationHelpers
+
+  def up
+    partition_table_by_date :audit_events, :created_at
+  end
+
+  def down
+    drop_partitioned_table_for :audit_events
+  end
+end
+```
+
+After this has executed, any inserts, updates, or deletes in the
+original table are also duplicated in the new table. For updates and
+deletes, the operation only has an effect if the corresponding row
+exists in the partitioned table.
+
+### Step 2: Backfill the partitioned copy (Release N)
+
+The second step is to add a post-deployment migration that schedules
+the background jobs that backfill existing data from the original table
+into the partitioned copy.
+
+Continuing the above example, the migration would look like:
+
+```ruby
+class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1]
+  include Gitlab::Database::PartitioningMigrationHelpers
+
+  disable_ddl_transaction!
+
+  restrict_gitlab_migration gitlab_schema: :gitlab_main
+
+  def up
+    enqueue_partitioning_data_migration :audit_events
+  end
+
+  def down
+    cleanup_partitioning_data_migration :audit_events
+  end
+end
+```
+
+This step [queues a batched background migration](../batched_background_migrations.md#enqueue-a-batched-background-migration) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](../batched_background_migrations.md) for more details.
+
+### Step 3: Post-backfill cleanup (Release N+1)
+
+This step must occur at least one release after the release that
+includes step (2). This gives time for the background
+migration to execute properly in self-managed installations. In this step,
+add another post-deployment migration that cleans up after the
+background migration. This includes forcing any remaining jobs to
+execute, and copying data that may have been missed, due to dropped or
+failed jobs.
+
+WARNING:
+A required stop must occur between steps 2 and 3 to allow the background migration from step 2 to complete successfully.
+
+Once again, continuing the example, this migration would look like:
+
+```ruby
+class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1]
+  include Gitlab::Database::PartitioningMigrationHelpers
+
+  disable_ddl_transaction!
+
+  restrict_gitlab_migration gitlab_schema: :gitlab_main
+
+  def up
+    finalize_backfilling_partitioned_table :audit_events
+  end
+
+  def down
+    # no op
+  end
+end
+```
+
+After this migration completes, the original table and partitioned
+table should contain identical data. The trigger installed on the
+original table guarantees that the data remains in sync going forward.
+
+### Step 4: Swap the partitioned and non-partitioned tables (Release N+1)
+
+This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully.
+
+Some limitations to this method MUST be handled before, or during, the swap migration:
+
+- Secondary indexes and foreign keys are not automatically recreated on the partitioned table.
+- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated
+  on the partitioned table, since the underlying index will not be present.
+- Foreign keys referencing the original non-partitioned table should be updated to reference the
+  partitioned table. This is not supported in PostgreSQL 11.
+- Views referencing the original table are not automatically updated to reference the partitioned table.
+
+```ruby
+# frozen_string_literal: true
+
+class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0]
+  include Gitlab::Database::PartitioningMigrationHelpers
+
+  def up
+    replace_with_partitioned_table :audit_events
+  end
+
+  def down
+    rollback_replace_with_partitioned_table :audit_events
+  end
+end
+```
+
+After this migration completes:
+
+- The partitioned table replaces the non-partitioned (original) table.
+- The sync trigger created earlier is dropped.
+
+The partitioned table is now ready for use by the application.
author	GitLab Bot <gitlab-bot@gitlab.com>	2024-01-23 06:10:35 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2024-01-23 06:10:35 +0300
commit	89ff92639b7ad6499fb1e9470e4151fb112a904e (patch)
tree	c03a3e8d5f5b905cd07932e7780e8f0d533e5ec6 /doc/development/database/partitioning/date_range.md
parent	422294262e50d47bee73c2e85bfbc21473c2508a (diff)