diff options
Diffstat (limited to 'doc/development/database')
-rw-r--r-- | doc/development/database/partitioning/date_range.md | 237 | ||||
-rw-r--r-- | doc/development/database/partitioning/hash.md | 35 | ||||
-rw-r--r-- | doc/development/database/partitioning/index.md | 71 | ||||
-rw-r--r-- | doc/development/database/partitioning/int_range.md | 205 | ||||
-rw-r--r-- | doc/development/database/partitioning/list.md | 372 | ||||
-rw-r--r-- | doc/development/database/table_partitioning.md | 707 |
6 files changed, 927 insertions, 700 deletions
diff --git a/doc/development/database/partitioning/date_range.md b/doc/development/database/partitioning/date_range.md new file mode 100644 index 00000000000..5c68eb3d075 --- /dev/null +++ b/doc/development/database/partitioning/date_range.md @@ -0,0 +1,237 @@ +--- +stage: Data Stores +group: Database +info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +--- + +# Date range partitioning + +> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32240) in GitLab 13.1. + +## Description + +The scheme best supported by the GitLab migration helpers is date-range partitioning, +where each partition in the table contains data for a single month. In this case, +the partitioning key must be a timestamp or date column. For this type of +partitioning to work well, most queries must access data in a +certain date range. + +For a more concrete example, consider using the `audit_events` table. +It was the first table to be partitioned in the application database +(scheduled for deployment with the GitLab 13.5 release). This +table tracks audit entries of security events that happen in the +application. In almost all cases, users want to see audit activity that +occurs in a certain time frame. As a result, date-range partitioning +was a natural fit for how the data would be accessed. + +To look at this in more detail, imagine a simplified `audit_events` schema: + +```sql +CREATE TABLE audit_events ( + id SERIAL NOT NULL PRIMARY KEY, + author_id INT NOT NULL, + details jsonb NOT NULL, + created_at timestamptz NOT NULL); +``` + +Now imagine typical queries in the UI would display the data in a +certain date range, like a single week: + +```sql +SELECT * +FROM audit_events +WHERE created_at >= '2020-01-01 00:00:00' + AND created_at < '2020-01-08 00:00:00' +ORDER BY created_at DESC +LIMIT 100 +``` + +If the table is partitioned on the `created_at` column the base table would +look like: + +```sql +CREATE TABLE audit_events ( + id SERIAL NOT NULL, + author_id INT NOT NULL, + details jsonb NOT NULL, + created_at timestamptz NOT NULL, + PRIMARY KEY (id, created_at)) +PARTITION BY RANGE(created_at); +``` + +NOTE: +The primary key of a partitioned table must include the partition key as +part of the primary key definition. + +And we might have a list of partitions for the table, such as: + +```sql +audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01') +audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01') +audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01') +``` + +Each partition is a separate physical table, with the same structure as +the base `audit_events` table, but contains only data for rows where the +partition key falls in the specified range. For example, the partition +`audit_events_202001` contains rows where the `created_at` column is +greater than or equal to `2020-01-01` and less than `2020-02-01`. + +Now, if we look at the previous example query again, the database can +use the `WHERE` to recognize that all matching rows are in the +`audit_events_202001` partition. Rather than searching all of the data +in all of the partitions, it can search only the single month's worth +of data in the appropriate partition. In a large table, this can +dramatically reduce the amount of data the database needs to access. +However, imagine a query that does not filter based on the partitioning +key, such as: + +```sql +SELECT * +FROM audit_events +WHERE author_id = 123 +ORDER BY created_at DESC +LIMIT 100 +``` + +In this example, the database can't prune any partitions from the search, +because matching data could exist in any of them. As a result, it has to +query each partition individually, and aggregate the rows into a single result +set. Because `author_id` would be indexed, the performance impact could +likely be acceptable, but on more complex queries the overhead can be +substantial. Partitioning should only be leveraged if the access patterns +of the data support the partitioning strategy, otherwise performance +suffers. + +## Example + +### Step 1: Creating the partitioned copy (Release N) + +The first step is to add a migration to create the partitioned copy of +the original table. This migration creates the appropriate +partitions based on the data in the original table, and install a +trigger that syncs writes from the original table into the +partitioned copy. + +An example migration of partitioning the `audit_events` table by its +`created_at` column would look like: + +```ruby +class PartitionAuditEvents < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + partition_table_by_date :audit_events, :created_at + end + + def down + drop_partitioned_table_for :audit_events + end +end +``` + +After this has executed, any inserts, updates, or deletes in the +original table are also duplicated in the new table. For updates and +deletes, the operation only has an effect if the corresponding row +exists in the partitioned table. + +### Step 2: Backfill the partitioned copy (Release N) + +The second step is to add a post-deployment migration that schedules +the background jobs that backfill existing data from the original table +into the partitioned copy. + +Continuing the above example, the migration would look like: + +```ruby +class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + disable_ddl_transaction! + + restrict_gitlab_migration gitlab_schema: :gitlab_main + + def up + enqueue_partitioning_data_migration :audit_events + end + + def down + cleanup_partitioning_data_migration :audit_events + end +end +``` + +This step [queues a batched background migration](../batched_background_migrations.md#enqueue-a-batched-background-migration) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](../batched_background_migrations.md) for more details. + +### Step 3: Post-backfill cleanup (Release N+1) + +This step must occur at least one release after the release that +includes step (2). This gives time for the background +migration to execute properly in self-managed installations. In this step, +add another post-deployment migration that cleans up after the +background migration. This includes forcing any remaining jobs to +execute, and copying data that may have been missed, due to dropped or +failed jobs. + +WARNING: +A required stop must occur between steps 2 and 3 to allow the background migration from step 2 to complete successfully. + +Once again, continuing the example, this migration would look like: + +```ruby +class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + disable_ddl_transaction! + + restrict_gitlab_migration gitlab_schema: :gitlab_main + + def up + finalize_backfilling_partitioned_table :audit_events + end + + def down + # no op + end +end +``` + +After this migration completes, the original table and partitioned +table should contain identical data. The trigger installed on the +original table guarantees that the data remains in sync going forward. + +### Step 4: Swap the partitioned and non-partitioned tables (Release N+1) + +This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully. + +Some limitations to this method MUST be handled before, or during, the swap migration: + +- Secondary indexes and foreign keys are not automatically recreated on the partitioned table. +- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated + on the partitioned table, since the underlying index will not be present. +- Foreign keys referencing the original non-partitioned table should be updated to reference the + partitioned table. This is not supported in PostgreSQL 11. +- Views referencing the original table are not automatically updated to reference the partitioned table. + +```ruby +# frozen_string_literal: true + +class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + replace_with_partitioned_table :audit_events + end + + def down + rollback_replace_with_partitioned_table :audit_events + end +end +``` + +After this migration completes: + +- The partitioned table replaces the non-partitioned (original) table. +- The sync trigger created earlier is dropped. + +The partitioned table is now ready for use by the application. diff --git a/doc/development/database/partitioning/hash.md b/doc/development/database/partitioning/hash.md new file mode 100644 index 00000000000..8852097d89a --- /dev/null +++ b/doc/development/database/partitioning/hash.md @@ -0,0 +1,35 @@ +--- +stage: Data Stores +group: Database +info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +--- + +# Hash Partitioning + +Hash partitioning is a method of dividing a large table into smaller, more manageable partitions based on a hash function applied to a specified column, typically the ID column. It offers unique advantages for certain use cases, but it also comes with limitations. + +Key points: + +- Data distribution: Rows are assigned to partitions based on the hash value of their ID and a modulus-remainder calculation. + For example, if partitioning by `HASH(ID)` with `MODULUS 64` and `REMAINDER 1`, rows with `hash(ID) % 64 == 1` would go into the corresponding partition. + +- Query requirements: Hash partitioning works best when most queries include a `WHERE hashed_column = ?` condition, + as this allows PostgreSQL to quickly identify the relevant partition. + +- ID uniqueness: It's the only partitioning method (aside from complex list partitioning) that can guarantee ID uniqueness across multiple partitions at the database level. + +Upfront decisions: + +- The number of partitions must be chosen before table creation and cannot be easily added later. This makes it crucial to anticipate future data growth. + +Unsupported query types: + +- Range queries `(WHERE id BETWEEN ? AND ?)` and lookups by other keys `(WHERE other_id = ?)` are not directly supported on hash-partitioned tables. + +Considerations: + +- Choose a large number of partitions to accommodate future growth. +- Ensure application queries align with hash partitioning requirements. +- Evaluate alternatives like range partitioning or list partitioning if range queries or lookups by other keys are essential. + +In summary, hash partitioning is a valuable tool for specific scenarios, particularly when ID uniqueness across partitions is crucial. However, it's essential to carefully consider its limitations and query patterns before implementation. diff --git a/doc/development/database/partitioning/index.md b/doc/development/database/partitioning/index.md new file mode 100644 index 00000000000..78b0c069705 --- /dev/null +++ b/doc/development/database/partitioning/index.md @@ -0,0 +1,71 @@ +--- +stage: Data Stores +group: Database +info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +--- + +# Database table partitioning + +WARNING: +If you have questions not answered below, check for and add them +to [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/398650). +Tag `@gitlab-org/database-team/triage` and we'll get back to you with an +answer as soon as possible. If you get an answer in Slack, document +it on the issue as well so we can update this document in the future. + +Table partitioning is a powerful database feature that allows a table's +data to be split into smaller physical tables that act as a single large +table. If the application is designed to work with partitioning in mind, +there can be multiple benefits, such as: + +- Query performance can be improved greatly, because the database can + cheaply eliminate much of the data from the search space, while still + providing full SQL capabilities. + +- Bulk deletes can be achieved with minimal impact on the database by + dropping entire partitions. This is a natural fit for features that need + to periodically delete data that falls outside the retention window. + +- Administrative tasks like `VACUUM` and index rebuilds can operate on + individual partitions, rather than across a single massive table. + +Unfortunately, not all models fit a partitioning scheme, and there are +significant drawbacks if implemented incorrectly. Additionally, +**tables can only be partitioned at their creation**, making it nontrivial +to apply partitioning to a busy database. A suite of migration tools are available +to enable backend developers to partition existing tables, but the +migration process is rather heavy, taking multiple steps split across +several releases. Due to the limitations of partitioning and the related +migrations, you should understand how partitioning fits your use case +before attempting to leverage this feature. + +The partitioning migration helpers work by creating a partitioned duplicate +of the original table and using a combination of a trigger and a background +migration to copy data into the new table. Changes to the original table +schema can be made in parallel with the partitioning migration, but they +must take care to not break the underlying mechanism that makes the migration +work. For example, if a column is added to the table that is being +partitioned, both the partitioned table and the trigger definition must +be updated to match. + +## Determine when to use partitioning + +While partitioning can be very useful when properly applied, it's +imperative to identify if the data and workload of a table naturally fit a +partitioning scheme. Understand a few details to decide if partitioning +is a good fit for your particular problem: + +- **Table partitioning**. A table is partitioned on a partition key, which is a + column or set of columns which determine how the data is split across the + partitions. The partition key is used by the database when reading or + writing data, to decide which partitions must be accessed. The + partition key should be a column that would be included in a `WHERE` + clause on almost all queries accessing that table. + +- **How the data is split**. What strategy does the database use + to split the data across the partitions? The available choices are `range`, + `hash`, and `list`. + +## Determine the appropriate partitioning strategy + +The available partitioning strategy choices are `date range`, `int range`, `hash`, and `list`. diff --git a/doc/development/database/partitioning/int_range.md b/doc/development/database/partitioning/int_range.md new file mode 100644 index 00000000000..7fbdd4da865 --- /dev/null +++ b/doc/development/database/partitioning/int_range.md @@ -0,0 +1,205 @@ +--- +stage: Data Stores +group: Database +info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +--- + +# Int range partitioning + +> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/132148) in GitLab 16.8. + +## Description + +Int range partition is a technique for dividing a large table into smaller, +more manageable chunks based on an integer column. +This can be particularly useful for tables with large numbers of rows, +as it can significantly improve query performance, reduce storage requirements, and simplify maintenance tasks. +For this type of partitioning to work well, most queries must access data in a +certain int range. + +To look at this in more detail, imagine a simplified `merge_request_diff_files` schema: + +```sql +CREATE TABLE merge_request_diff_files ( + merge_request_diff_id INT NOT NULL, + relative_order INT NOT NULL, + PRIMARY KEY (merge_request_diff_id, relative_order)); +``` + +Now imagine typical queries in the UI would display the data in a certain int range: + +```sql +SELECT * +FROM merge_request_diff_files +WHERE merge_request_diff_id > 1 AND merge_request_diff_id < 10 +LIMIT 100 +``` + +If the table is partitioned on the `merge_request_diff_id` column the base table would look like: + +```sql +CREATE TABLE merge_request_diff_files ( + merge_request_diff_id INT NOT NULL, + relative_order INT NOT NULL, + PRIMARY KEY (merge_request_diff_id, relative_order)) +PARTITION BY RANGE(merge_request_diff_id); +``` + +NOTE: +The primary key of a partitioned table must include the partition key as +part of the primary key definition. + +And we might have a list of partitions for the table, such as: + +```sql +merge_request_diff_files_1 FOR VALUES FROM (1) TO (20) +merge_request_diff_files_20 FOR VALUES FROM (20) TO (40) +merge_request_diff_files_40 FOR VALUES FROM (40) TO (60) +``` + +Each partition is a separate physical table, with the same structure as +the base `merge_request_diff_files` table, but contains only data for rows where the +partition key falls in the specified range. For example, the partition +`merge_request_diff_files_1` contains rows where the `merge_request_diff_id` column is +greater than or equal to `1` and less than `20`. + +Now, if we look at the previous example query again, the database can +use the `WHERE` to recognize that all matching rows are in the +`merge_request_diff_files_1` partition. Rather than searching all of the data +in all of the partitions. In a large table, this can +dramatically reduce the amount of data the database needs to access. + +## Example + +### Step 1: Creating the partitioned copy (Release N) + +The first step is to add a migration to create the partitioned copy of +the original table. This migration creates the appropriate +partitions based on the data in the original table, and install a +trigger that syncs writes from the original table into the +partitioned copy. + +An example migration of partitioning the `merge_request_diff_commits` table by its +`merge_request_diff_id` column would look like: + +```ruby +class PartitionAuditEvents < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + partition_table_by_int_range('merge_request_diff_commits', 'merge_request_diff_id', partition_size: 10000000, + primary_key: %w[merge_request_diff_id relative_order]) + + end + + def down + drop_partitioned_table_for('merge_request_diff_commits') + end +end +``` + +After this has executed, any inserts, updates, or deletes in the +original table are also duplicated in the new table. For updates and +deletes, the operation only has an effect if the corresponding row +exists in the partitioned table. + +### Step 2: Backfill the partitioned copy (Release N) + +The second step is to add a post-deployment migration that schedules +the background jobs that backfill existing data from the original table +into the partitioned copy. + +Continuing the above example, the migration would look like: + +```ruby +class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + disable_ddl_transaction! + + restrict_gitlab_migration gitlab_schema: :gitlab_main + + def up + enqueue_partitioning_data_migration :merge_request_diff_commits + end + + def down + cleanup_partitioning_data_migration :merge_request_diff_commits + end +end +``` + +This step [queues a batched background migration](../batched_background_migrations.md#enqueue-a-batched-background-migration) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](../batched_background_migrations.md) for more details. + +### Step 3: Post-backfill cleanup (Release N+1) + +This step must occur at least one release after the release that +includes step (2). This gives time for the background +migration to execute properly in self-managed installations. In this step, +add another post-deployment migration that cleans up after the +background migration. This includes forcing any remaining jobs to +execute, and copying data that may have been missed, due to dropped or +failed jobs. + +WARNING: +A required stop must occur between steps 2 and 3 to allow the background migration from step 2 to complete successfully. + +Once again, continuing the example, this migration would look like: + +```ruby +class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers + + disable_ddl_transaction! + + restrict_gitlab_migration gitlab_schema: :gitlab_main + + def up + finalize_backfilling_partitioned_table :merge_request_diff_commits + end + + def down + # no op + end +end +``` + +After this migration completes, the original table and partitioned +table should contain identical data. The trigger installed on the +original table guarantees that the data remains in sync going forward. + +### Step 4: Swap the partitioned and non-partitioned tables (Release N+1) + +This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully. + +Some limitations to this method MUST be handled before, or during, the swap migration: + +- Secondary indexes and foreign keys are not automatically recreated on the partitioned table. +- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated + on the partitioned table, since the underlying index will not be present. +- Foreign keys referencing the original non-partitioned table should be updated to reference the + partitioned table. This is not supported in PostgreSQL 11. +- Views referencing the original table are not automatically updated to reference the partitioned table. + +```ruby +# frozen_string_literal: true + +class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + replace_with_partitioned_table :audit_events + end + + def down + rollback_replace_with_partitioned_table :audit_events + end +end +``` + +After this migration completes: + +- The partitioned table replaces the non-partitioned (original) table. +- The sync trigger created earlier is dropped. + +The partitioned table is now ready for use by the application. diff --git a/doc/development/database/partitioning/list.md b/doc/development/database/partitioning/list.md new file mode 100644 index 00000000000..74ce5a00f6a --- /dev/null +++ b/doc/development/database/partitioning/list.md @@ -0,0 +1,372 @@ +--- +stage: Data Stores +group: Database +info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +--- + +# List partition + +> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/96815) in GitLab 15.4. + +## Description + +Add the partitioning key column to the table you are partitioning. +Include the partitioning key in the following constraints: + +- The primary key. +- All foreign keys referencing the table to be partitioned. +- All unique constraints. + +## Example + +### Step 1 - Add partition key + +Add the partitioning key column. For example, in a rails migration: + +```ruby +class AddPartitionNumberForPartitioning < Gitlab::Database::Migration[2.1] + enable_lock_retries! + + TABLE_NAME = :table_name + COLUMN_NAME = :partition_id + DEFAULT_VALUE = 100 + + def change + add_column(TABLE_NAME, COLUMN_NAME, :bigint, default: 100) + end +end +``` + +### Step 2 - Create required indexes + +Add indexes including the partitioning key column. For example, in a rails migration: + +```ruby +class PrepareIndexesForPartitioning < Gitlab::Database::Migration[2.1] + disable_ddl_transaction! + + TABLE_NAME = :table_name + INDEX_NAME = :index_name + + def up + add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: INDEX_NAME) + end + + def down + remove_concurrent_index_by_name(TABLE_NAME, INDEX_NAME) + end +end +``` + +### Step 3 - Enforce unique constraint + +Change all unique indexes to include the partitioning key column, +including the primary key index. You can start by adding an unique +index on `[primary_key_column, :partition_id]`, which will be +required for the next two steps. For example, in a rails migration: + +```ruby +class PrepareUniqueContraintForPartitioning < Gitlab::Database::Migration[2.1] + disable_ddl_transaction! + + TABLE_NAME = :table_name + OLD_UNIQUE_INDEX_NAME = :index_name_unique + NEW_UNIQUE_INDEX_NAME = :new_index_name + + def up + add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: NEW_UNIQUE_INDEX_NAME) + + remove_concurrent_index_by_name(TABLE_NAME, OLD_UNIQUE_INDEX_NAME) + end + + def down + add_concurrent_index(TABLE_NAME, :id, unique: true, name: OLD_UNIQUE_INDEX_NAME) + + remove_concurrent_index_by_name(TABLE_NAME, NEW_UNIQUE_INDEX_NAME) + end +end +``` + +### Step 4 - Enforce foreign key constraint + +Enforce foreign keys including the partitioning key column. For example, in a rails migration: + +```ruby +class PrepareForeignKeyForPartitioning < Gitlab::Database::Migration[2.1] + disable_ddl_transaction! + + SOURCE_TABLE_NAME = :source_table_name + TARGET_TABLE_NAME = :target_table_name + COLUMN = :foreign_key_id + TARGET_COLUMN = :id + FK_NAME = :fk_365d1db505_p + PARTITION_COLUMN = :partition_id + + def up + add_concurrent_foreign_key( + SOURCE_TABLE_NAME, + TARGET_TABLE_NAME, + column: [PARTITION_COLUMN, COLUMN], + target_column: [PARTITION_COLUMN, TARGET_COLUMN], + validate: false, + on_update: :cascade, + name: FK_NAME + ) + + # This should be done in a separate post migration when dealing with a high traffic table + validate_foreign_key(TABLE_NAME, [PARTITION_COLUMN, COLUMN], name: FK_NAME) + end + + def down + with_lock_retries do + remove_foreign_key_if_exists(SOURCE_TABLE_NAME, name: FK_NAME) + end + end +end +``` + +The `on_update: :cascade` option is mandatory if we want the partitioning column +to be updated. This will cascade the update to all dependent rows. Without +specifying it, updating the partition column on the target table we would +result in a `Key is still referenced from table ...` error and updating the +partition column on the source table would raise a +`Key is not present in table ...` error. + +This migration can be automatically generated using: + +```shell +./scripts/partitioning/generate-fk --source source_table_name --target target_table_name +``` + +### Step 5 - Swap primary key + +Swap the primary key including the partitioning key column. This can be done only after +including the partition key for all references foreign keys. For example, in a rails migration: + +```ruby +class PreparePrimaryKeyForPartitioning < Gitlab::Database::Migration[2.1] + disable_ddl_transaction! + + TABLE_NAME = :table_name + PRIMARY_KEY = :primary_key + OLD_INDEX_NAME = :old_index_name + NEW_INDEX_NAME = :new_index_name + + def up + swap_primary_key(TABLE_NAME, PRIMARY_KEY, NEW_INDEX_NAME) + end + + def down + add_concurrent_index(TABLE_NAME, :id, unique: true, name: OLD_INDEX_NAME) + add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: NEW_INDEX_NAME) + + unswap_primary_key(TABLE_NAME, PRIMARY_KEY, OLD_INDEX_NAME) + end +end +``` + +NOTE: +Do not forget to set the primary key explicitly in your model as `ActiveRecord` does not support composite primary keys. + +```ruby +class Model < ApplicationRecord + self.primary_key = :id +end +``` + +### Step 6 - Create parent table and attach existing table as the initial partition + +You can now create the parent table attaching the existing table as the initial +partition by using the following helpers provided by the database team. + +For example, using list partitioning in Rails post migrations: + +```ruby +class PrepareTableConstraintsForListPartitioning < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers + + disable_ddl_transaction! + + TABLE_NAME = :table_name + PARENT_TABLE_NAME = :p_table_name + FIRST_PARTITION = 100 + PARTITION_COLUMN = :partition_id + + def up + prepare_constraint_for_list_partitioning( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION + ) + end + + def down + revert_preparing_constraint_for_list_partitioning( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION + ) + end +end +``` + +```ruby +class ConvertTableToListPartitioning < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers + + disable_ddl_transaction! + + TABLE_NAME = :table_name + TABLE_FK = :table_references_by_fk + PARENT_TABLE_NAME = :p_table_name + FIRST_PARTITION = 100 + PARTITION_COLUMN = :partition_id + + def up + convert_table_to_first_list_partition( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION, + lock_tables: [TABLE_FK, TABLE_NAME] + ) + end + + def down + revert_converting_table_to_first_list_partition( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION + ) + end +end +``` + +NOTE: +Do not forget to set the sequence name explicitly in your model because it will +be owned by the routing table and `ActiveRecord` can't determine it. This can +be cleaned up after the `table_name` is changed to the routing table. + +```ruby +class Model < ApplicationRecord + self.sequence_name = 'model_id_seq' +end +``` + +If the partitioning constraint migration takes [more than 10 minutes](../../migration_style_guide.md#how-long-a-migration-should-take) to finish, +it can be made to run asynchronously to avoid running the post-migration during busy hours. + +Prepend the following migration `AsyncPrepareTableConstraintsForListPartitioning` +and use `async: true` option. This change marks the partitioning constraint as `NOT VALID` +and enqueues a scheduled job to validate the existing data in the table during the weekend. + +Then the second post-migration `PrepareTableConstraintsForListPartitioning` only +marks the partitioning constraint as validated, because the existing data is already +tested during the previous weekend. + +For example: + +```ruby +class AsyncPrepareTableConstraintsForListPartitioning < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers + + disable_ddl_transaction! + + TABLE_NAME = :table_name + PARENT_TABLE_NAME = :p_table_name + FIRST_PARTITION = 100 + PARTITION_COLUMN = :partition_id + + def up + prepare_constraint_for_list_partitioning( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION, + async: true + ) + end + + def down + revert_preparing_constraint_for_list_partitioning( + table_name: TABLE_NAME, + partitioning_column: PARTITION_COLUMN, + parent_table_name: PARENT_TABLE_NAME, + initial_partitioning_value: FIRST_PARTITION + ) + end +end +``` + +### Step 7 - Re-point foreign keys to parent table + +The tables that reference the initial partition must be updated to point to the +parent table now. Without this change, the records from those tables +will not be able to locate the rows in the next partitions because they will look +for them in the initial partition. + +Steps: + +- Add the foreign key to the partitioned table and validate it asynchronously, + [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230524135543_replace_ci_build_pending_states_foreign_key.rb). +- Validate it synchronously after the asynchronously validation was completed on GitLab.com, + [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230530140456_validate_fk_ci_build_pending_states_p_ci_builds.rb). +- Remove the old foreign key and rename the new one to the old name, + [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230615083713_replace_old_fk_ci_build_pending_states_to_builds.rb#L9). + +### Step 8 - Ensure ID uniqueness across partitions + +All uniqueness constraints must include the partitioning key, so we can have +duplicate IDs across partitions. To solve this we enforce that only the database +can set the ID values and use a sequence to generate them because sequences are +guaranteed to generate unique values. + +For example: + +```ruby +class EnsureIdUniquenessForPCiBuilds < Gitlab::Database::Migration[2.1] + include Gitlab::Database::PartitioningMigrationHelpers::UniquenessHelpers + + enable_lock_retries! + + TABLE_NAME = :p_ci_builds + FUNCTION_NAME = :assign_p_ci_builds_id_value + + def up + ensure_unique_id(TABLE_NAME) + end + + def down + execute(<<~SQL.squish) + ALTER TABLE #{TABLE_NAME} + ALTER COLUMN id SET DEFAULT nextval('ci_builds_id_seq'::regclass); + + DROP FUNCTION IF EXISTS #{FUNCTION_NAME} CASCADE; + SQL + end +``` + +### Step 9 - Analyze the partitioned table and create new partitions + +The autovacuum daemon does not process partitioned tables. It is necessary to +periodically run a manual `ANALYZE` to keep the statistics of the table hierarchy +up to date. + +Models that implement `Ci::Partitionable` with `partitioned: true` option are +analyzed by default on a weekly basis. To enable this and create new partitions +you need to register the model in the [PostgreSQL initializer](https://gitlab.com/gitlab-org/gitlab/-/blob/b7f0e3f1bcd2ffc220768bbc373364151775ca8e/config/initializers/postgres_partitioning.rb). + +### Step 10 - Update the application to use the partitioned table + +Now that the parent table is ready, we can update the application to use it: + +```ruby +class Model < ApplicationRecord + self.table_name = :partitioned_table +end +``` + +Depending on the model, it might be safer to use a [change management issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16387). diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md index cb159a404fd..b82bb11f662 100644 --- a/doc/development/database/table_partitioning.md +++ b/doc/development/database/table_partitioning.md @@ -1,704 +1,11 @@ --- -stage: Data Stores -group: Database -info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review. +redirect_to: 'partitioning/index.md' +remove_date: '2024-04-16' --- -# Database table partitioning +This document was moved to [another location](partitioning/index.md). -WARNING: -If you have questions not answered below, check for and add them -to [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/398650). -Tag `@gitlab-org/database-team/triage` and we'll get back to you with an -answer as soon as possible. If you get an answer in Slack, document -it on the issue as well so we can update this document in the future. - -Table partitioning is a powerful database feature that allows a table's -data to be split into smaller physical tables that act as a single large -table. If the application is designed to work with partitioning in mind, -there can be multiple benefits, such as: - -- Query performance can be improved greatly, because the database can - cheaply eliminate much of the data from the search space, while still - providing full SQL capabilities. - -- Bulk deletes can be achieved with minimal impact on the database by - dropping entire partitions. This is a natural fit for features that need - to periodically delete data that falls outside the retention window. - -- Administrative tasks like `VACUUM` and index rebuilds can operate on - individual partitions, rather than across a single massive table. - -Unfortunately, not all models fit a partitioning scheme, and there are -significant drawbacks if implemented incorrectly. Additionally, tables -can only be partitioned at their creation, making it nontrivial to apply -partitioning to a busy database. A suite of migration tools are available -to enable backend developers to partition existing tables, but the -migration process is rather heavy, taking multiple steps split across -several releases. Due to the limitations of partitioning and the related -migrations, you should understand how partitioning fits your use case -before attempting to leverage this feature. - -## Determine when to use partitioning - -While partitioning can be very useful when properly applied, it's -imperative to identify if the data and workload of a table naturally fit a -partitioning scheme. Understand a few details to decide if partitioning -is a good fit for your particular problem: - -- **Table partitioning**. A table is partitioned on a partition key, which is a - column or set of columns which determine how the data is split across the - partitions. The partition key is used by the database when reading or - writing data, to decide which partitions must be accessed. The - partition key should be a column that would be included in a `WHERE` - clause on almost all queries accessing that table. - -- **How the data is split**. What strategy does the database use - to split the data across the partitions? The available choices are `range`, - `hash`, and `list`. - -## Determine the appropriate partitioning strategy - -The available partitioning strategy choices are `range`, `hash`, and `list`. - -### Range partitioning - -The scheme best supported by the GitLab migration helpers is date-range partitioning, -where each partition in the table contains data for a single month. In this case, -the partitioning key must be a timestamp or date column. For this type of -partitioning to work well, most queries must access data in a -certain date range. - -For a more concrete example, consider using the `audit_events` table. -It was the first table to be partitioned in the application database -(scheduled for deployment with the GitLab 13.5 release). This -table tracks audit entries of security events that happen in the -application. In almost all cases, users want to see audit activity that -occurs in a certain time frame. As a result, date-range partitioning -was a natural fit for how the data would be accessed. - -To look at this in more detail, imagine a simplified `audit_events` schema: - -```sql -CREATE TABLE audit_events ( - id SERIAL NOT NULL PRIMARY KEY, - author_id INT NOT NULL, - details jsonb NOT NULL, - created_at timestamptz NOT NULL); -``` - -Now imagine typical queries in the UI would display the data in a -certain date range, like a single week: - -```sql -SELECT * -FROM audit_events -WHERE created_at >= '2020-01-01 00:00:00' - AND created_at < '2020-01-08 00:00:00' -ORDER BY created_at DESC -LIMIT 100 -``` - -If the table is partitioned on the `created_at` column the base table would -look like: - -```sql -CREATE TABLE audit_events ( - id SERIAL NOT NULL, - author_id INT NOT NULL, - details jsonb NOT NULL, - created_at timestamptz NOT NULL, - PRIMARY KEY (id, created_at)) -PARTITION BY RANGE(created_at); -``` - -NOTE: -The primary key of a partitioned table must include the partition key as -part of the primary key definition. - -And we might have a list of partitions for the table, such as: - -```sql -audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01') -audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01') -audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01') -``` - -Each partition is a separate physical table, with the same structure as -the base `audit_events` table, but contains only data for rows where the -partition key falls in the specified range. For example, the partition -`audit_events_202001` contains rows where the `created_at` column is -greater than or equal to `2020-01-01` and less than `2020-02-01`. - -Now, if we look at the previous example query again, the database can -use the `WHERE` to recognize that all matching rows are in the -`audit_events_202001` partition. Rather than searching all of the data -in all of the partitions, it can search only the single month's worth -of data in the appropriate partition. In a large table, this can -dramatically reduce the amount of data the database needs to access. -However, imagine a query that does not filter based on the partitioning -key, such as: - -```sql -SELECT * -FROM audit_events -WHERE author_id = 123 -ORDER BY created_at DESC -LIMIT 100 -``` - -In this example, the database can't prune any partitions from the search, -because matching data could exist in any of them. As a result, it has to -query each partition individually, and aggregate the rows into a single result -set. Because `author_id` would be indexed, the performance impact could -likely be acceptable, but on more complex queries the overhead can be -substantial. Partitioning should only be leveraged if the access patterns -of the data support the partitioning strategy, otherwise performance -suffers. - -### Hash Partitioning - -Hash partitioning splits a logical table into a series of partitioned -tables. Each partition corresponds to the ID range that matches -a hash and remainder. For example, if partitioning `BY HASH(id)`, rows -with `hash(id) % 64 == 1` would end up in the partition -`WITH (MODULUS 64, REMAINDER 1)`. - -When hash partitioning, you must include a `WHERE hashed_column = ?` condition in -every performance-sensitive query issued by the application. If this is not possible, -hash partitioning may not be the correct fit for your use case. - -Hash partitioning has one main advantage: it is the only type of partitioning that -can enforce uniqueness on a single numeric `id` column. (While also possible with -range partitioning, it's rarely the correct choice). - -Hash partitioning has downsides: - -- The number of partitions must be known up-front. -- It's difficult to move new data to an extra partition if current partitions become too large. -- Range queries, such as `WHERE id BETWEEN ? and ?`, are unsupported. -- Lookups by other keys, such as `WHERE other_id = ?`, are unsupported. - -For this reason, it's often best to choose a large number of hash partitions to accommodate future table growth. - -## Partitioning a table (Range) - -Unfortunately, tables can only be partitioned at their creation, making -it nontrivial to apply to a busy database. A suite of migration -tools have been developed to enable backend developers to partition -existing tables. This migration process takes multiple steps which must -be split across several releases. - -### Caveats - -The partitioning migration helpers work by creating a partitioned duplicate -of the original table and using a combination of a trigger and a background -migration to copy data into the new table. Changes to the original table -schema can be made in parallel with the partitioning migration, but they -must take care to not break the underlying mechanism that makes the migration -work. For example, if a column is added to the table that is being -partitioned, both the partitioned table and the trigger definition must -be updated to match. - -### Step 1: Creating the partitioned copy (Release N) - -The first step is to add a migration to create the partitioned copy of -the original table. This migration creates the appropriate -partitions based on the data in the original table, and install a -trigger that syncs writes from the original table into the -partitioned copy. - -An example migration of partitioning the `audit_events` table by its -`created_at` column would look like: - -```ruby -class PartitionAuditEvents < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers - - def up - partition_table_by_date :audit_events, :created_at - end - - def down - drop_partitioned_table_for :audit_events - end -end -``` - -After this has executed, any inserts, updates, or deletes in the -original table are also duplicated in the new table. For updates and -deletes, the operation only has an effect if the corresponding row -exists in the partitioned table. - -### Step 2: Backfill the partitioned copy (Release N) - -The second step is to add a post-deployment migration that schedules -the background jobs that backfill existing data from the original table -into the partitioned copy. - -Continuing the above example, the migration would look like: - -```ruby -class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers - - disable_ddl_transaction! - - restrict_gitlab_migration gitlab_schema: :gitlab_main - - def up - enqueue_partitioning_data_migration :audit_events - end - - def down - cleanup_partitioning_data_migration :audit_events - end -end -``` - -This step [queues a batched background migration](batched_background_migrations.md#enqueue-a-batched-background-migration) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](batched_background_migrations.md) for more details. - -### Step 3: Post-backfill cleanup (Release N+1) - -This step must occur at least one release after the release that -includes step (2). This gives time for the background -migration to execute properly in self-managed installations. In this step, -add another post-deployment migration that cleans up after the -background migration. This includes forcing any remaining jobs to -execute, and copying data that may have been missed, due to dropped or -failed jobs. - -Once again, continuing the example, this migration would look like: - -```ruby -class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers - - disable_ddl_transaction! - - restrict_gitlab_migration gitlab_schema: :gitlab_main - - def up - finalize_backfilling_partitioned_table :audit_events - end - - def down - # no op - end -end -``` - -After this migration completes, the original table and partitioned -table should contain identical data. The trigger installed on the -original table guarantees that the data remains in sync going forward. - -### Step 4: Swap the partitioned and non-partitioned tables (Release N+1) - -This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully. - -Some limitations to this method MUST be handled before, or during, the swap migration: - -- Secondary indexes and foreign keys are not automatically recreated on the partitioned table. -- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated - on the partitioned table, since the underlying index will not be present. -- Foreign keys referencing the original non-partitioned table should be updated to reference the - partitioned table. This is not supported in PostgreSQL 11. -- Views referencing the original table are not automatically updated to reference the partitioned table. - -```ruby -# frozen_string_literal: true - -class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0] - include Gitlab::Database::PartitioningMigrationHelpers - - def up - replace_with_partitioned_table :audit_events - end - - def down - rollback_replace_with_partitioned_table :audit_events - end -end -``` - -After this migration completes: - -- The partitioned table replaces the non-partitioned (original) table. -- The sync trigger created earlier is dropped. - -The partitioned table is now ready for use by the application. - -## Partitioning a table (Hash) - -Hash partitioning divides data into partitions based on a hash of their ID. -It works well only if most queries against the table include a clause like `WHERE id = ?`, -so that PostgreSQL can decide which partition to look in based on the ID or ids being requested. - -Another key downside is that hash partitioning does not allow adding additional partitions after table creation. -The correct number of partitions must be chosen up-front. - -Hash partitioning is the only type of partitioning (aside from some complex uses of list partitioning) that can guarantee -uniqueness of an ID across multiple partitions at the database level. - -## Partitioning a table (List) - -> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/96815) in GitLab 15.4. - -Add the partitioning key column to the table you are partitioning. -Include the partitioning key in the following constraints: - -- The primary key. -- All foreign keys referencing the table to be partitioned. -- All unique constraints. - -### Step 1 - Add partition key - -Add the partitioning key column. For example, in a rails migration: - -```ruby -class AddPartitionNumberForPartitioning < Gitlab::Database::Migration[2.1] - enable_lock_retries! - - TABLE_NAME = :table_name - COLUMN_NAME = :partition_id - DEFAULT_VALUE = 100 - - def change - add_column(TABLE_NAME, COLUMN_NAME, :bigint, default: 100) - end -end -``` - -### Step 2 - Create required indexes - -Add indexes including the partitioning key column. For example, in a rails migration: - -```ruby -class PrepareIndexesForPartitioning < Gitlab::Database::Migration[2.1] - disable_ddl_transaction! - - TABLE_NAME = :table_name - INDEX_NAME = :index_name - - def up - add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: INDEX_NAME) - end - - def down - remove_concurrent_index_by_name(TABLE_NAME, INDEX_NAME) - end -end -``` - -### Step 3 - Enforce unique constraint - -Change all unique indexes to include the partitioning key column, -including the primary key index. You can start by adding an unique -index on `[primary_key_column, :partition_id]`, which will be -required for the next two steps. For example, in a rails migration: - -```ruby -class PrepareUniqueContraintForPartitioning < Gitlab::Database::Migration[2.1] - disable_ddl_transaction! - - TABLE_NAME = :table_name - OLD_UNIQUE_INDEX_NAME = :index_name_unique - NEW_UNIQUE_INDEX_NAME = :new_index_name - - def up - add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: NEW_UNIQUE_INDEX_NAME) - - remove_concurrent_index_by_name(TABLE_NAME, OLD_UNIQUE_INDEX_NAME) - end - - def down - add_concurrent_index(TABLE_NAME, :id, unique: true, name: OLD_UNIQUE_INDEX_NAME) - - remove_concurrent_index_by_name(TABLE_NAME, NEW_UNIQUE_INDEX_NAME) - end -end -``` - -### Step 4 - Enforce foreign key constraint - -Enforce foreign keys including the partitioning key column. For example, in a rails migration: - -```ruby -class PrepareForeignKeyForPartitioning < Gitlab::Database::Migration[2.1] - disable_ddl_transaction! - - SOURCE_TABLE_NAME = :source_table_name - TARGET_TABLE_NAME = :target_table_name - COLUMN = :foreign_key_id - TARGET_COLUMN = :id - FK_NAME = :fk_365d1db505_p - PARTITION_COLUMN = :partition_id - - def up - add_concurrent_foreign_key( - SOURCE_TABLE_NAME, - TARGET_TABLE_NAME, - column: [PARTITION_COLUMN, COLUMN], - target_column: [PARTITION_COLUMN, TARGET_COLUMN], - validate: false, - on_update: :cascade, - name: FK_NAME - ) - - # This should be done in a separate post migration when dealing with a high traffic table - validate_foreign_key(TABLE_NAME, [PARTITION_COLUMN, COLUMN], name: FK_NAME) - end - - def down - with_lock_retries do - remove_foreign_key_if_exists(SOURCE_TABLE_NAME, name: FK_NAME) - end - end -end -``` - -The `on_update: :cascade` option is mandatory if we want the partitioning column -to be updated. This will cascade the update to all dependent rows. Without -specifying it, updating the partition column on the target table we would -result in a `Key is still referenced from table ...` error and updating the -partition column on the source table would raise a -`Key is not present in table ...` error. - -This migration can be automatically generated using: - -```shell -./scripts/partitioning/generate-fk --source source_table_name --target target_table_name -``` - -### Step 5 - Swap primary key - -Swap the primary key including the partitioning key column. This can be done only after -including the partition key for all references foreign keys. For example, in a rails migration: - -```ruby -class PreparePrimaryKeyForPartitioning < Gitlab::Database::Migration[2.1] - disable_ddl_transaction! - - TABLE_NAME = :table_name - PRIMARY_KEY = :primary_key - OLD_INDEX_NAME = :old_index_name - NEW_INDEX_NAME = :new_index_name - - def up - swap_primary_key(TABLE_NAME, PRIMARY_KEY, NEW_INDEX_NAME) - end - - def down - add_concurrent_index(TABLE_NAME, :id, unique: true, name: OLD_INDEX_NAME) - add_concurrent_index(TABLE_NAME, [:id, :partition_id], unique: true, name: NEW_INDEX_NAME) - - unswap_primary_key(TABLE_NAME, PRIMARY_KEY, OLD_INDEX_NAME) - end -end -``` - -NOTE: -Do not forget to set the primary key explicitly in your model as `ActiveRecord` does not support composite primary keys. - -```ruby -class Model < ApplicationRecord - self.primary_key = :id -end -``` - -### Step 6 - Create parent table and attach existing table as the initial partition - -You can now create the parent table attaching the existing table as the initial -partition by using the following helpers provided by the database team. - -For example, using list partitioning in Rails post migrations: - -```ruby -class PrepareTableConstraintsForListPartitioning < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers - - disable_ddl_transaction! - - TABLE_NAME = :table_name - PARENT_TABLE_NAME = :p_table_name - FIRST_PARTITION = 100 - PARTITION_COLUMN = :partition_id - - def up - prepare_constraint_for_list_partitioning( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION - ) - end - - def down - revert_preparing_constraint_for_list_partitioning( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION - ) - end -end -``` - -```ruby -class ConvertTableToListPartitioning < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers - - disable_ddl_transaction! - - TABLE_NAME = :table_name - TABLE_FK = :table_references_by_fk - PARENT_TABLE_NAME = :p_table_name - FIRST_PARTITION = 100 - PARTITION_COLUMN = :partition_id - - def up - convert_table_to_first_list_partition( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION, - lock_tables: [TABLE_FK, TABLE_NAME] - ) - end - - def down - revert_converting_table_to_first_list_partition( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION - ) - end -end -``` - -NOTE: -Do not forget to set the sequence name explicitly in your model because it will -be owned by the routing table and `ActiveRecord` can't determine it. This can -be cleaned up after the `table_name` is changed to the routing table. - -```ruby -class Model < ApplicationRecord - self.sequence_name = 'model_id_seq' -end -``` - -If the partitioning constraint migration takes [more than 10 minutes](../migration_style_guide.md#how-long-a-migration-should-take) to finish, -it can be made to run asynchronously to avoid running the post-migration during busy hours. - -Prepend the following migration `AsyncPrepareTableConstraintsForListPartitioning` -and use `async: true` option. This change marks the partitioning constraint as `NOT VALID` -and enqueues a scheduled job to validate the existing data in the table during the weekend. - -Then the second post-migration `PrepareTableConstraintsForListPartitioning` only -marks the partitioning constraint as validated, because the existing data is already -tested during the previous weekend. - -For example: - -```ruby -class AsyncPrepareTableConstraintsForListPartitioning < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers - - disable_ddl_transaction! - - TABLE_NAME = :table_name - PARENT_TABLE_NAME = :p_table_name - FIRST_PARTITION = 100 - PARTITION_COLUMN = :partition_id - - def up - prepare_constraint_for_list_partitioning( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION, - async: true - ) - end - - def down - revert_preparing_constraint_for_list_partitioning( - table_name: TABLE_NAME, - partitioning_column: PARTITION_COLUMN, - parent_table_name: PARENT_TABLE_NAME, - initial_partitioning_value: FIRST_PARTITION - ) - end -end -``` - -### Step 7 - Re-point foreign keys to parent table - -The tables that reference the initial partition must be updated to point to the -parent table now. Without this change, the records from those tables -will not be able to locate the rows in the next partitions because they will look -for them in the initial partition. - -Steps: - -- Add the foreign key to the partitioned table and validate it asynchronously, - [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230524135543_replace_ci_build_pending_states_foreign_key.rb). -- Validate it synchronously after the asynchronously validation was completed on GitLab.com, - [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230530140456_validate_fk_ci_build_pending_states_p_ci_builds.rb). -- Remove the old foreign key and rename the new one to the old name, - [for example](https://gitlab.com/gitlab-org/gitlab/-/blob/65d63f6a00196c3a7d59f15191920f271ab2b145/db/post_migrate/20230615083713_replace_old_fk_ci_build_pending_states_to_builds.rb#L9). - -### Step 8 - Ensure ID uniqueness across partitions - -All uniqueness constraints must include the partitioning key, so we can have -duplicate IDs across partitions. To solve this we enforce that only the database -can set the ID values and use a sequence to generate them because sequences are -guaranteed to generate unique values. - -For example: - -```ruby -class EnsureIdUniquenessForPCiBuilds < Gitlab::Database::Migration[2.1] - include Gitlab::Database::PartitioningMigrationHelpers::UniquenessHelpers - - enable_lock_retries! - - TABLE_NAME = :p_ci_builds - FUNCTION_NAME = :assign_p_ci_builds_id_value - - def up - ensure_unique_id(TABLE_NAME) - end - - def down - execute(<<~SQL.squish) - ALTER TABLE #{TABLE_NAME} - ALTER COLUMN id SET DEFAULT nextval('ci_builds_id_seq'::regclass); - - DROP FUNCTION IF EXISTS #{FUNCTION_NAME} CASCADE; - SQL - end -``` - -### Step 9 - Analyze the partitioned table and create new partitions - -The autovacuum daemon does not process partitioned tables. It is necessary to -periodically run a manual `ANALYZE` to keep the statistics of the table hierarchy -up to date. - -Models that implement `Ci::Partitionable` with `partitioned: true` option are -analyzed by default on a weekly basis. To enable this and create new partitions -you need to register the model in the [PostgreSQL initializer](https://gitlab.com/gitlab-org/gitlab/-/blob/b7f0e3f1bcd2ffc220768bbc373364151775ca8e/config/initializers/postgres_partitioning.rb). - -### Step 10 - Update the application to use the partitioned table - -Now that the parent table is ready, we can update the application to use it: - -```ruby -class Model < ApplicationRecord - self.table_name = :partitioned_table -end -``` - -Depending on the model, it might be safer to use a [change management issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16387). +<!-- This redirect file can be deleted after <2024-04-16>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> |