1 files changed, 179 insertions, 23 deletions
diff --git a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
index 868dae4fc6c..baec14e3f0f 100644
--- a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
+++ b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
@@ -74,7 +74,12 @@ violates our [principle of 100 GB max size](../database_scaling/size-limits.md).
 We also want to [build alerting](https://gitlab.com/gitlab-com/gl-infra/tamland/-/issues/5)
 to notify us when this number is exceeded.
 
-We’ve seen numerous S1 and S2 database-related production environment
+Large SQL tables increase index maintenance time, during which freshly deleted tuples
+cannot be cleaned by `autovacuum`. This highlight the need for small tables.
+We will measure how much bloat we accumulate when (re)indexing huge tables. Base on this analysis,
+we will be able to set up SLO (dead tuples / bloat), associated with (re)indexing.
+
+We've seen numerous S1 and S2 database-related production environment
 incidents, over the last couple of months, for example:
 
 - S1: 2022-03-17 [Increase in writes in `ci_builds` table](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6625)
@@ -130,7 +135,7 @@ remaining database tables when it becomes necessary.
 It is also important to avoid large data migrations. We store almost 6
 terabytes of data in the biggest CI/CD tables, in many different columns and
 indexes. Migrating this amount of data might be challenging and could cause
-instability in the production environment. Due to this concern, we’ve developed
+instability in the production environment. Due to this concern, we've developed
 a way to attach an existing database table as a partition zero without downtime
 and excessive database locking, what has been demonstrated in one of the
 [first proofs of concept](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/80186).
@@ -145,7 +150,7 @@ Our plan is to use logical partition IDs. We want to start with the
 `ci_pipelines` table and create a `partition_id` column with a `DEFAULT` value
 of `100` or `1000`. Using a `DEFAULT` value avoids the challenge of backfilling
 this value for every row. Adding a `CHECK` constraint prior to attaching the
-first partition tells PostgreSQL that we’ve already ensured consistency and
+first partition tells PostgreSQL that we've already ensured consistency and
 there is no need to check it while holding an exclusive table lock when
 attaching this table as a partition to the routing table (partitioned schema
 definition). We will increment this value every time we create a new partition
@@ -159,21 +164,32 @@ and artifacts, will share the same value. We want to add the `partition_id`
 column into all 6 problematic tables because we can avoid backfilling this data
 when we decide it is time to start partitioning them.
 
-We want to partition CI/CD data iteratively, so we will start with the
-pipelines table, and create at least one, but likely two, partitions. The
-pipelines table will be partitioned using the `LIST` partitioning strategy. It
-is possible that, after some time, `p_ci_pipelines` will store data in two
-partitions with IDs of `100` and `101`. Then we will try partitioning
-`ci_builds`. Therefore we might want to use `RANGE` partitioning in
-`p_ci_builds` with IDs `100` and `101`, because builds for the two logical
-partitions used will still be stored in a single table.
+We want to partition CI/CD data iteratively. We plan to start with the
+`ci_builds_metadata` table, because this is the fastest growing table in the CI
+database and want to contain this rapid growth. This table has also the most
+simple access patterns - a row from it is being read when a build is exposed to
+a runner, and other access patterns are relatively simple too. Starting with
+`p_ci_builds_metadata` will allow us to achieve tangible and quantifiable
+results earlier, and will become a new pattern that makes partitioning the
+largest table possible. We will partition builds metadata using the `LIST`
+partitioning strategy.
+
+Once we have many partitions attached to `p_ci_builds_metadata`, with many
+`partition_ids` we will choose another CI table to partition next. In that case
+we might want to use `RANGE` partitioning in for that next table because
+`p_ci_builds_metadata` will already have many physical partitions, and
+therefore many logical `partition_ids` will be used at that time. For example,
+if we choose `ci_builds` as the next partitioning candidate, after having
+partitioned `p_ci_builds_metadata`, it will have many different values stored
+in `ci_builds.partition_id`. Using `RANGE` partitioning in that case might be
+easier.
 
 Physical partitioning and logical partitioning will be separated, and a
-strategy will be determined when we implement partitioning for the respective
-database tables. Using `RANGE` partitioning works similarly to using `LIST`
-partitioning in database tables other than `ci_pipelines`, but because we can
-guarantee continuity of `partition_id` values, using `RANGE` partitioning might
-be a better strategy.
+strategy will be determined when we implement physical partitioning for the
+respective database tables. Using `RANGE` partitioning works similarly to using
+`LIST` partitioning in database tables, but because we can guarantee continuity
+of `partition_id` values, using `RANGE` partitioning might be a better
+strategy.
 
 ## Why do we want to use explicit logical partition ids?
 
@@ -201,9 +217,30 @@ find this number, though we might not need to do this.
 The single and uniform `partition_id` value for pipeline data gives us more
 choices later on than primary-keys-based partitioning.
 
+## Altering partitioned tables
+
+It will still be possible to run `ALTER TABLE` statements against partitioned tables,
+similarly to how the tables behaved before partitioning. When PostgreSQL runs
+an `ALTER TABLE` statement against a parent partitioned table, it acquires the same
+lock on all child partitions and updates each to keep them in sync. This differs from
+running `ALTER TABLE` on a non-partitioned table in a few key ways:
+
+- PostgreSQL acquires `ACCESS EXCLUSIVE` locks against a larger number of tables, but
+  not a larger amount of data, than it would were the table not partitioned.
+  Each partition will be locked similarly to the parent table, and all will be updated
+  in a single transaction.
+- Lock duration will be increased based on the number of partitions involved.
+  All `ALTER TABLE` statements executed on the GitLab database (other than `VALIDATE CONSTRAINT`)
+  take small constant amounts of time per table modified. PostgreSQL will need
+  to modify each partition in sequence, increasing the runtime of the lock. This
+  time will still remain very small until there are many partitions involved.
+- If thousands of partitions are involved in an `ALTER TABLE`, we will need to verify that
+  the value of `max_locks_per_transaction` is high enough to support all of the locks that
+  need to be taken during the operation.
+
 ## Splitting large partitions into smaller ones
 
-We want to start with the initial `pipeline_id` number `100` (or higher, like
+We want to start with the initial `partition_id` number `100` (or higher, like
 `1000`, depending on our calculations and estimations). We do not want to start
 from 1, because existing tables are also large already, and we might want to
 split them into smaller partitions. If we start with `100`, we will be able to
@@ -217,6 +254,18 @@ smaller ones (it's not yet clear if we will need to do this), we might be able
 to just use background migrations to update partition IDs, and PostgreSQL is
 smart enough to move rows between partitions on its own.
 
+### Naming conventions
+
+A partitioned table is called a **routing** table and it will use the `p_`
+prefix which should help us with building automated tooling for query analysis.
+
+A table partition will be simply called **partition** and it can use the a
+physical partition ID as suffix, leaded by a `p` letter, for example
+`ci_builds_p101`. Existing CI tables will become **zero partitions** of the
+new routing tables. Depending on the chosen
+[partitioning strategy](#how-do-we-want-to-partition-cicd-data) for a given
+table, it is possible to have many logical partitions per one physical partition.
+
 ## Storing partitions metadata in the database
 
 In order to build an efficient mechanism that will be responsible for creating
@@ -225,8 +274,8 @@ metadata table, called `ci_partitions`. In that table we would store metadata
 about all the logical partitions, with many pipelines per partition. We may
 need to store a range of pipeline ids per logical partition. Using it we will
 be able to find the `partition_id` number for a given pipeline ID and we will
-also find information about which logical partitions are “active” or
-“archived”, which will help us to implement a time-decay pattern using database
+also find information about which logical partitions are "active" or
+"archived", which will help us to implement a time-decay pattern using database
 declarative partitioning.
 
 `ci_partitions` table will store information about a partition identifier,
@@ -302,9 +351,116 @@ scope block takes an argument). Preloading instance dependent scopes is not
 supported.
 ```
 
-We also need to build a proof of concept for removing data on the PostgreSQL
-side (using foreign keys with `ON DELETE CASCADE`) and removing data through
-Rails associations, as this might be an important area of uncertainty.
+### Foreign keys
+
+Foreign keys must reference columns that either are a primary key or form a
+unique constraint. We can define them using these strategies:
+
+#### Between routing tables sharing partition ID
+
+For relations that are part of the same pipeline hierarchy it is possible to
+share the `partition_id` column to define the foreign key constraint:
+
+```plaintext
+p_ci_pipelines:
+ - id
+ - partition_id
+
+p_ci_builds:
+ - id
+ - partition_id
+ - pipeline_id
+```
+
+In this case, `p_ci_builds.partition_id` indicates the partition for the build
+and also for the pipeline. We can add a FK on the routing table using:
+
+```sql
+ALTER TABLE ONLY p_ci_builds
+    ADD CONSTRAINT fk_on_pipeline_and_partition
+    FOREIGN KEY (pipeline_id, partition_id)
+    REFERENCES p_ci_pipelines(id, partition_id) ON DELETE CASCADE;
+```
+
+#### Between routing tables with different partition IDs
+
+It's not possible to reuse the `partition_id` for all relations in the CI domain,
+so in this case we'll need to store the value as a different attribute. For
+example, when canceling redundant pipelines we store on the old pipeline row
+the ID of the new pipeline that cancelled it as `auto_canceled_by_id`:
+
+```plaintext
+p_ci_pipelines:
+ - id
+ - partition_id
+ - auto_canceled_by_id
+ - auto_canceled_by_partition_id
+```
+
+In this case we can't ensure that the canceling pipeline is part of the same
+hierarchy as the canceled pipelines, so we need an extra attribute to store its
+partition, `auto_canceled_by_partition_id`, and the FK becomes:
+
+```sql
+ALTER TABLE ONLY p_ci_pipelines
+    ADD CONSTRAINT fk_cancel_redundant_pieplines
+    FOREIGN KEY (auto_canceled_by_id, auto_canceled_by_partition_id)
+    REFERENCES p_ci_pipelines(id, partition_id) ON DELETE SET NULL;
+```
+
+#### Between routing tables and regular tables
+
+Not all of the tables in the CI domain will be partitioned, so we'll have routing
+tables that will reference non-partitioned tables, for example we reference
+`external_pull_requests` from `ci_pipelines`:
+
+```sql
+FOREIGN KEY (external_pull_request_id)
+REFERENCES external_pull_requests(id)
+ON DELETE SET NULL
+```
+
+In this case we only need to move the FK definition from the partition level
+to the routing table so that new pipeline partitions may use it:
+
+```sql
+ALTER TABLE p_ci_pipelines
+  ADD CONSTRAINT fk_external_request
+  FOREIGN KEY (external_pull_request_id)
+  REFERENCES external_pull_requests(id) ON DELETE SET NULL;
+```
+
+#### Between regular tables and routing tables
+
+Most of the tables from the CI domain reference at least one table that will be
+turned into a routing tables, for example `ci_pipeline_messages` references
+`ci_pipelines`. These definitions will need to be updated to use the routing
+tables and for this they will need a `partition_id` column:
+
+```plaintext
+p_ci_pipelines:
+ - id
+ - partition_id
+
+ci_pipeline_messages:
+ - id
+ - pipeline_id
+ - pipeline_partition_id
+```
+
+The foreign key can be defined by using:
+
+```sql
+ALTER TABLE ci_pipeline_messages ADD CONSTRAINT fk_pipeline_partitioned
+  FOREIGN KEY (pipeline_id, pipeline_partition_id)
+  REFERENCES p_ci_pipelines(id, partition_id) ON DELETE CASCADE;
+```
+
+The old FK definition will need to be removed, otherwise new inserts in the
+`ci_pipeline_messages` with pipeline IDs from non-zero partition will fail with
+reference errors.
+
+### Indexes
 
 We [learned](https://gitlab.com/gitlab-org/gitlab/-/issues/360148) that `PostgreSQL`
 does not allow to create a single index (unique or otherwise) across all partitions of a table.
@@ -465,7 +621,7 @@ strategy. The strategy, described in this document, is subject to iteration as
 well. Whenever we find a better way to reduce the risk and improve our plan, we
 should update this document as well.
 
-We’ve managed to find a way to avoid large-scale data migrations, and we are
+We've managed to find a way to avoid large-scale data migrations, and we are
 building an iterative strategy for partitioning CI/CD data. We documented our
 strategy here to share knowledge and solicit feedback from other team members.