Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database')
-rw-r--r--doc/development/database/add_foreign_key_to_existing_column.md4
-rw-r--r--doc/development/database/adding_database_indexes.md410
-rw-r--r--doc/development/database/avoiding_downtime_in_migrations.md40
-rw-r--r--doc/development/database/background_migrations.md6
-rw-r--r--doc/development/database/batched_background_migrations.md198
-rw-r--r--doc/development/database/ci_mirrored_tables.md156
-rw-r--r--doc/development/database/client_side_connection_pool.md13
-rw-r--r--doc/development/database/creating_enums.md154
-rw-r--r--doc/development/database/database_debugging.md177
-rw-r--r--doc/development/database/database_dictionary.md51
-rw-r--r--doc/development/database/database_lab.md2
-rw-r--r--doc/development/database/database_query_comments.md62
-rw-r--r--doc/development/database/database_reviewer_guidelines.md10
-rw-r--r--doc/development/database/db_dump.md56
-rw-r--r--doc/development/database/filtering_by_label.md179
-rw-r--r--doc/development/database/foreign_keys.md199
-rw-r--r--doc/development/database/hash_indexes.md26
-rw-r--r--doc/development/database/index.md52
-rw-r--r--doc/development/database/insert_into_tables_in_batches.md196
-rw-r--r--doc/development/database/iterating_tables_in_batches.md598
-rw-r--r--doc/development/database/loose_foreign_keys.md11
-rw-r--r--doc/development/database/multiple_databases.md9
-rw-r--r--doc/development/database/namespaces_storage_statistics.md193
-rw-r--r--doc/development/database/not_null_constraints.md4
-rw-r--r--doc/development/database/ordering_table_columns.md152
-rw-r--r--doc/development/database/pagination_guidelines.md2
-rw-r--r--doc/development/database/pagination_performance_guidelines.md10
-rw-r--r--doc/development/database/polymorphic_associations.md152
-rw-r--r--doc/development/database/post_deployment_migrations.md4
-rw-r--r--doc/development/database/query_count_limits.md70
-rw-r--r--doc/development/database/query_performance.md74
-rw-r--r--doc/development/database/query_recorder.md145
-rw-r--r--doc/development/database/rename_database_tables.md6
-rw-r--r--doc/development/database/serializing_data.md90
-rw-r--r--doc/development/database/sha1_as_binary.md42
-rw-r--r--doc/development/database/single_table_inheritance.md63
-rw-r--r--doc/development/database/strings_and_the_text_data_type.md7
-rw-r--r--doc/development/database/swapping_tables.md51
-rw-r--r--doc/development/database/transaction_guidelines.md2
-rw-r--r--doc/development/database/understanding_explain_plans.md829
-rw-r--r--doc/development/database/verifying_database_capabilities.md38
41 files changed, 4383 insertions, 160 deletions
diff --git a/doc/development/database/add_foreign_key_to_existing_column.md b/doc/development/database/add_foreign_key_to_existing_column.md
index 7a18da2223f..8a8fe3c0a1e 100644
--- a/doc/development/database/add_foreign_key_to_existing_column.md
+++ b/doc/development/database/add_foreign_key_to_existing_column.md
@@ -6,7 +6,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w
# Add a foreign key constraint to an existing column
-Foreign keys ensure consistency between related database tables. The current database review process **always** encourages you to add [foreign keys](../foreign_keys.md) when creating tables that reference records from other tables.
+Foreign keys ensure consistency between related database tables. The current database review process **always** encourages you to add [foreign keys](foreign_keys.md) when creating tables that reference records from other tables.
Starting with Rails version 4, Rails includes migration helpers to add foreign key constraints
to database tables. Before Rails 4, the only way for ensuring some level of consistency was the
@@ -95,7 +95,7 @@ The approach here depends on the data volume and the cleanup strategy. If we can
records by doing a database query and the record count is not high, then the data migration can
be executed in a Rails migration.
-In case the data volume is higher (>1000 records), it's better to create a background migration. If unsure, please contact the database team for advice.
+In case the data volume is higher (>1000 records), it's better to create a background migration. If unsure, contact the database team for advice.
Example for cleaning up records in the `emails` table in a database migration:
diff --git a/doc/development/database/adding_database_indexes.md b/doc/development/database/adding_database_indexes.md
new file mode 100644
index 00000000000..8abd7c8298e
--- /dev/null
+++ b/doc/development/database/adding_database_indexes.md
@@ -0,0 +1,410 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Adding Database Indexes
+
+Indexes can be used to speed up database queries, but when should you add a new
+index? Traditionally the answer to this question has been to add an index for
+every column used for filtering or joining data. For example, consider the
+following query:
+
+```sql
+SELECT *
+FROM projects
+WHERE user_id = 2;
+```
+
+Here we are filtering by the `user_id` column and as such a developer may decide
+to index this column.
+
+While in certain cases indexing columns using the above approach may make sense,
+it can actually have a negative impact. Whenever you write data to a table, any
+existing indexes must also be updated. The more indexes there are, the slower this
+can potentially become. Indexes can also take up significant disk space, depending
+on the amount of data indexed and the index type. For example, PostgreSQL offers
+`GIN` indexes which can be used to index certain data types that cannot be
+indexed by regular B-tree indexes. These indexes, however, generally take up more
+data and are slower to update compared to B-tree indexes.
+
+Because of all this, it's important make the following considerations
+when adding a new index:
+
+1. Do the new queries re-use as many existing indexes as possible?
+1. Is there enough data that using an index is faster than iterating over
+ rows in the table?
+1. Is the overhead of maintaining the index worth the reduction in query
+ timings?
+
+## Re-using Queries
+
+The first step is to make sure your query re-uses as many existing indexes as
+possible. For example, consider the following query:
+
+```sql
+SELECT *
+FROM todos
+WHERE user_id = 123
+AND state = 'open';
+```
+
+Now imagine we already have an index on the `user_id` column but not on the
+`state` column. One may think this query performs badly due to `state` being
+unindexed. In reality the query may perform just fine given the index on
+`user_id` can filter out enough rows.
+
+The best way to determine if indexes are re-used is to run your query using
+`EXPLAIN ANALYZE`. Depending on the joined tables and the columns being used for filtering,
+you may find an extra index doesn't make much, if any, difference.
+
+In short:
+
+1. Try to write your query in such a way that it re-uses as many existing
+ indexes as possible.
+1. Run the query using `EXPLAIN ANALYZE` and study the output to find the most
+ ideal query.
+
+## Data Size
+
+A database may not use an index even when a regular sequence scan
+(iterating over all rows) is faster, especially for small tables.
+
+Consider adding an index if a table is expected to grow, and your query has to filter a lot of rows.
+You may _not_ want to add an index if the table size is small (<`1,000` records),
+or if existing indexes already filter out enough rows.
+
+## Maintenance Overhead
+
+Indexes have to be updated on every table write. In the case of PostgreSQL, _all_
+existing indexes are updated whenever data is written to a table. As a
+result, having many indexes on the same table slows down writes. It's therefore important
+to balance query performance with the overhead of maintaining an extra index.
+
+Let's say that adding an index reduces SELECT timings by 5 milliseconds but increases
+INSERT/UPDATE/DELETE timings by 10 milliseconds. In this case, the new index may not be worth
+it. A new index is more valuable when SELECT timings are reduced and INSERT/UPDATE/DELETE
+timings are unaffected.
+
+## Finding Unused Indexes
+
+To see which indexes are unused you can run the following query:
+
+```sql
+SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
+FROM pg_stat_all_indexes
+WHERE schemaname = 'public'
+AND idx_scan = 0
+AND idx_tup_read = 0
+AND idx_tup_fetch = 0
+ORDER BY pg_relation_size(indexrelname::regclass) desc;
+```
+
+This query outputs a list containing all indexes that are never used and sorts
+them by indexes sizes in descending order. This query helps in
+determining whether existing indexes are still required. More information on
+the meaning of the various columns can be found at
+<https://www.postgresql.org/docs/current/monitoring-stats.html>.
+
+To determine if an index is still being used on production, use the following
+Thanos query with your index name:
+
+```sql
+sum(rate(pg_stat_user_indexes_idx_tup_read{env="gprd", indexrelname="index_ci_name", type="patroni-ci"}[5m]))
+```
+
+Because the query output relies on the actual usage of your database, it
+may be affected by factors such as:
+
+- Certain queries never being executed, thus not being able to use certain
+ indexes.
+- Certain tables having little data, resulting in PostgreSQL using sequence
+ scans instead of index scans.
+
+This data is only reliable for a frequently used database with
+plenty of data, and using as many GitLab features as possible.
+
+## Requirements for naming indexes
+
+Indexes with complex definitions must be explicitly named rather than
+relying on the implicit naming behavior of migration methods. In short,
+that means you **must** provide an explicit name argument for an index
+created with one or more of the following options:
+
+- `where`
+- `using`
+- `order`
+- `length`
+- `type`
+- `opclass`
+
+### Considerations for index names
+
+Check our [Constraints naming conventions](constraint_naming_convention.md) page.
+
+### Why explicit names are required
+
+As Rails is database agnostic, it generates an index name only
+from the required options of all indexes: table name and column names.
+For example, imagine the following two indexes are created in a migration:
+
+```ruby
+def up
+ add_index :my_table, :my_column
+
+ add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
+end
+```
+
+Creation of the second index would fail, because Rails would generate
+the same name for both indexes.
+
+This naming issue is further complicated by the behavior of the `index_exists?` method.
+It considers only the table name, column names, and uniqueness specification
+of the index when making a comparison. Consider:
+
+```ruby
+def up
+ unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL')
+ add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
+ end
+end
+```
+
+The call to `index_exists?` returns true if **any** index exists on
+`:my_table` and `:my_column`, and index creation is bypassed.
+
+The `add_concurrent_index` helper is a requirement for creating indexes
+on populated tables. Because it cannot be used inside a transactional
+migration, it has a built-in check that detects if the index already
+exists. In the event a match is found, index creation is skipped.
+Without an explicit name argument, Rails can return a false positive
+for `index_exists?`, causing a required index to not be created
+properly. By always requiring a name for certain types of indexes, the
+chance of error is greatly reduced.
+
+## Temporary indexes
+
+There may be times when an index is only needed temporarily.
+
+For example, in a migration, a column of a table might be conditionally
+updated. To query which columns must be updated in the
+[query performance guidelines](query_performance.md), an index is needed
+that would otherwise not be used.
+
+In these cases, consider a temporary index. To specify a
+temporary index:
+
+1. Prefix the index name with `tmp_` and follow the [naming conventions](constraint_naming_convention.md).
+1. Create a follow-up issue to remove the index in the next (or future) milestone.
+1. Add a comment in the migration mentioning the removal issue.
+
+A temporary migration would look like:
+
+```ruby
+INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled'
+
+def up
+ # Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234
+ add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME
+end
+
+def down
+ remove_concurrent_index_by_name :projects, INDEX_NAME
+end
+```
+
+## Create indexes asynchronously
+
+For very large tables, index creation can be a challenge to manage.
+While `add_concurrent_index` creates indexes in a way that does not block
+normal traffic, it can still be problematic when index creation runs for
+many hours. Necessary database operations like `autovacuum` cannot run, and
+on GitLab.com, the deployment process is blocked waiting for index
+creation to finish.
+
+To limit impact on GitLab.com, a process exists to create indexes
+asynchronously during weekend hours. Due to generally lower traffic and fewer deployments,
+index creation can proceed at a lower level of risk.
+
+### Schedule index creation for a low-impact time
+
+1. [Schedule the index to be created](#schedule-the-index-to-be-created).
+1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production).
+1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously).
+
+### Schedule the index to be created
+
+Create an MR with a post-deployment migration which prepares the index
+for asynchronous creation. An example of creating an index using
+the asynchronous index helpers can be seen in the block below. This migration
+enters the index name and definition into the `postgres_async_indexes`
+table. The process that runs on weekends pulls indexes from this
+table and attempt to create them.
+
+```ruby
+# in db/post_migrate/
+
+INDEX_NAME = 'index_ci_builds_on_some_column'
+
+def up
+ prepare_async_index :ci_builds, :some_column, name: INDEX_NAME
+end
+
+def down
+ unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
+end
+```
+
+### Verify the MR was deployed and the index exists in production
+
+You can verify if the post-deploy migration was executed on GitLab.com by:
+
+- Executing `/chatops run auto_deploy status <merge_sha>`. If the output returns `db/gprd`,
+ the post-deploy migration has been executed in the production database. More details in this
+ [guide](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/post_deploy_migration/readme.md#how-to-determine-if-a-post-deploy-migration-has-been-executed-on-gitlabcom).
+- Use a meta-command in #database-lab, such as: `\d <index_name>`.
+ - Ensure that the index is not [`invalid`](https://www.postgresql.org/docs/12/sql-createindex.html#:~:text=The%20psql%20%5Cd%20command%20will%20report%20such%20an%20index%20as%20INVALID).
+- Ask someone in #database to check if the index exists.
+- With proper access, you can also verify directly on production or in a
+ production clone.
+
+### Add a migration to create the index synchronously
+
+After the index is verified to exist on the production database, create a second
+merge request that adds the index synchronously. The schema changes must be
+updated and committed to `structure.sql` in this second merge request.
+The synchronous migration results in a no-op on GitLab.com, but you should still add the
+migration as expected for other installations. The below block
+demonstrates how to create the second migration for the previous
+asynchronous example.
+
+**WARNING:**
+Verify that the index exists in production before merging a second migration with `add_concurrent_index`.
+If the second migration is deployed before the index has been created,
+the index is created synchronously when the second migration executes.
+
+```ruby
+# in db/post_migrate/
+
+INDEX_NAME = 'index_ci_builds_on_some_column'
+
+disable_ddl_transaction!
+
+def up
+ add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
+end
+
+def down
+ remove_concurrent_index_by_name :ci_builds, INDEX_NAME
+end
+```
+
+## Test database index changes locally
+
+You must test the database index changes locally before creating a merge request.
+
+### Verify indexes created asynchronously
+
+Use the asynchronous index helpers on your local environment to test changes for creating an index:
+
+1. Enable the feature flags by running `Feature.enable(:database_async_index_creation)` and `Feature.enable(:database_reindexing)` in the Rails console.
+1. Run `bundle exec rails db:migrate` so that it creates an entry in the `postgres_async_indexes` table.
+1. Run `bundle exec rails gitlab:db:reindex` so that the index is created asynchronously.
+1. To verify the index, open the PostgreSQL console using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md) command `gdk psql` and run the command `\d <index_name>` to check that your newly created index exists.
+
+## Drop indexes asynchronously
+
+For very large tables, index destruction can be a challenge to manage.
+While `remove_concurrent_index` removes indexes in a way that does not block
+normal traffic, it can still be problematic if index destruction runs for
+during autovacuum. Necessary database operations like `autovacuum` cannot run, and
+the deployment process on GitLab.com is blocked while waiting for index
+destruction to finish.
+
+To limit the impact on GitLab.com, use the following process to remove indexes
+asynchronously during weekend hours. Due to generally lower traffic and fewer deployments,
+index destruction can proceed at a lower level of risk.
+
+1. [Schedule the index to be removed](#schedule-the-index-to-be-removed).
+1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production).
+1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously).
+
+### Schedule the index to be removed
+
+Create an MR with a post-deployment migration which prepares the index
+for asynchronous destruction. For example. to destroy an index using
+the asynchronous index helpers:
+
+```ruby
+# in db/post_migrate/
+
+INDEX_NAME = 'index_ci_builds_on_some_column'
+
+def up
+ prepare_async_index_removal :ci_builds, :some_column, name: INDEX_NAME
+end
+
+def down
+ unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
+end
+```
+
+This migration enters the index name and definition into the `postgres_async_indexes`
+table. The process that runs on weekends pulls indexes from this table and attempt
+to remove them.
+
+You must test the database index changes locally before creating a merge request.
+
+### Verify the MR was deployed and the index exists in production
+
+You can verify if the MR was deployed to GitLab.com with
+`/chatops run auto_deploy status <merge_sha>`. To verify the existence of
+the index, you can:
+
+- Use a meta-command in `#database-lab`, for example: `\d <index_name>`.
+ - Make sure the index is not [`invalid`](https://www.postgresql.org/docs/12/sql-createindex.html#:~:text=The%20psql%20%5Cd%20command%20will%20report%20such%20an%20index%20as%20INVALID).
+- Ask someone in `#database` to check if the index exists.
+- If you have access, you can verify directly on production or in a
+ production clone.
+
+### Add a migration to destroy the index synchronously
+
+After you verify the index exists in the production database, create a second
+merge request that removes the index synchronously. The schema changes must be
+updated and committed to `structure.sql` in this second merge request.
+The synchronous migration results in a no-op on GitLab.com, but you should still add the
+migration as expected for other installations. For example, to
+create the second migration for the previous asynchronous example:
+
+**WARNING:**
+Verify that the index no longer exist in production before merging a second migration with `remove_concurrent_index_by_name`.
+If the second migration is deployed before the index has been destroyed,
+the index is destroyed synchronously when the second migration executes.
+
+```ruby
+# in db/post_migrate/
+
+INDEX_NAME = 'index_ci_builds_on_some_column'
+
+disable_ddl_transaction!
+
+def up
+ remove_concurrent_index_by_name :ci_builds, name: INDEX_NAME
+end
+
+def down
+ add_concurrent_index :ci_builds, :some_column, INDEX_NAME
+end
+```
+
+### Verify indexes removed asynchronously
+
+To test changes for removing an index, use the asynchronous index helpers on your local environment:
+
+1. Enable the feature flags by running `Feature.enable(:database_async_index_destruction)` and `Feature.enable(:database_reindexing)` in the Rails console.
+1. Run `bundle exec rails db:migrate` which should create an entry in the `postgres_async_indexes` table.
+1. Run `bundle exec rails gitlab:db:reindex` destroy the index asynchronously.
+1. To verify the index, open the PostgreSQL console by using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md)
+ command `gdk psql` and run `\d <index_name>` to check that the destroyed index no longer exists.
diff --git a/doc/development/database/avoiding_downtime_in_migrations.md b/doc/development/database/avoiding_downtime_in_migrations.md
index 2d079656e23..79c76b351c8 100644
--- a/doc/development/database/avoiding_downtime_in_migrations.md
+++ b/doc/development/database/avoiding_downtime_in_migrations.md
@@ -93,9 +93,8 @@ class RemoveUsersUpdatedAtColumn < Gitlab::Database::Migration[2.0]
end
```
-You can consider [enabling lock retries](
-https://docs.gitlab.com/ee/development/migration_style_guide.html#usage-with-transactional-migrations
-) when you run a migration on big tables, because it might take some time to
+You can consider [enabling lock retries](../migration_style_guide.md#usage-with-transactional-migrations)
+when you run a migration on big tables, because it might take some time to
acquire a lock on this table.
#### B. The removed column has an index or constraint that belongs to it
@@ -104,7 +103,7 @@ If the `down` method requires adding back any dropped indexes or constraints, th
be done within a transactional migration, then the migration would look like this:
```ruby
-class RemoveUsersUpdatedAtColumn < Gitlab::Database::Migration[1.0]
+class RemoveUsersUpdatedAtColumn < Gitlab::Database::Migration[2.0]
disable_ddl_transaction!
def up
@@ -126,13 +125,11 @@ end
In the `down` method, we check to see if the column already exists before adding it again.
We do this because the migration is non-transactional and might have failed while it was running.
-The [`disable_ddl_transaction!`](
-https://docs.gitlab.com/ee/development/migration_style_guide.html#usage-with-non-transactional-migrations-disable_ddl_transaction
-) is used to disable the transaction that wraps the whole migration.
+The [`disable_ddl_transaction!`](../migration_style_guide.md#usage-with-non-transactional-migrations-disable_ddl_transaction)
+is used to disable the transaction that wraps the whole migration.
-You can refer to the page [Migration Style Guide](
-https://docs.gitlab.com/ee/development/migration_style_guide.html
-) for more information about database migrations.
+You can refer to the page [Migration Style Guide](../migration_style_guide.md)
+for more information about database migrations.
### Step 3: Removing the ignore rule (release M+2)
@@ -145,9 +142,13 @@ the `remove_after` date has passed.
## Renaming Columns
Renaming columns the normal way requires downtime as an application may continue
-using the old column name during/after a database migration. To rename a column
-without requiring downtime we need two migrations: a regular migration, and a
-post-deployment migration. Both these migration can go in the same release.
+to use the old column names during or after a database migration. To rename a column
+without requiring downtime, we need two migrations: a regular migration and a
+post-deployment migration. Both these migrations can go in the same release.
+
+NOTE:
+It's not possible to rename columns with default values. For more details, see
+[this merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/52032#default-values).
### Step 1: Add The Regular Migration
@@ -157,7 +158,7 @@ renaming. For example
```ruby
# A regular migration in db/migrate
-class RenameUsersUpdatedAtToUpdatedAtTimestamp < Gitlab::Database::Migration[1.0]
+class RenameUsersUpdatedAtToUpdatedAtTimestamp < Gitlab::Database::Migration[2.0]
disable_ddl_transaction!
def up
@@ -185,7 +186,7 @@ We can perform this cleanup using
```ruby
# A post-deployment migration in db/post_migrate
-class CleanupUsersUpdatedAtRename < Gitlab::Database::Migration[1.0]
+class CleanupUsersUpdatedAtRename < Gitlab::Database::Migration[2.0]
disable_ddl_transaction!
def up
@@ -198,7 +199,7 @@ class CleanupUsersUpdatedAtRename < Gitlab::Database::Migration[1.0]
end
```
-If you're renaming a [large table](https://gitlab.com/gitlab-org/gitlab/-/blob/master/rubocop/rubocop-migrations.yml#L3), please carefully consider the state when the first migration has run but the second cleanup migration hasn't been run yet.
+If you're renaming a [large table](https://gitlab.com/gitlab-org/gitlab/-/blob/master/rubocop/rubocop-migrations.yml#L3), carefully consider the state when the first migration has run but the second cleanup migration hasn't been run yet.
With [Canary](https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/library/canary/) it is possible that the system runs in this state for a significant amount of time.
## Changing Column Constraints
@@ -232,7 +233,7 @@ as follows:
```ruby
# A regular migration in db/migrate
-class ChangeUsersUsernameStringToText < Gitlab::Database::Migration[1.0]
+class ChangeUsersUsernameStringToText < Gitlab::Database::Migration[2.0]
disable_ddl_transaction!
def up
@@ -251,7 +252,7 @@ Next we need to clean up our changes using a post-deployment migration:
```ruby
# A post-deployment migration in db/post_migrate
-class ChangeUsersUsernameStringToTextCleanup < Gitlab::Database::Migration[1.0]
+class ChangeUsersUsernameStringToTextCleanup < Gitlab::Database::Migration[2.0]
disable_ddl_transaction!
def up
@@ -295,8 +296,7 @@ when migrating a column in a large table (for example, `issues`). Background
migrations spread the work / load over a longer time period, without slowing
down deployments.
-For more information, see [the documentation on cleaning up background
-migrations](background_migrations.md#cleaning-up).
+For more information, see [the documentation on cleaning up background migrations](background_migrations.md#cleaning-up).
## Adding Indexes
diff --git a/doc/development/database/background_migrations.md b/doc/development/database/background_migrations.md
index 0124dbae51f..9b596eb7379 100644
--- a/doc/development/database/background_migrations.md
+++ b/doc/development/database/background_migrations.md
@@ -8,7 +8,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w
WARNING:
Background migrations are strongly discouraged in favor of the new [batched background migrations framework](batched_background_migrations.md).
-Please check that documentation and determine if that framework suits your needs and fall back
+Check that documentation and determine if that framework suits your needs and fall back
to these only if required.
Background migrations should be used to perform data migrations whenever a
@@ -368,9 +368,9 @@ A strategy to make the migration run faster is to schedule larger batches, and t
within the background migration to perform multiple statements.
The background migration helpers that queue multiple jobs such as
-`queue_background_migration_jobs_by_range_at_intervals` use [`EachBatch`](../iterating_tables_in_batches.md).
+`queue_background_migration_jobs_by_range_at_intervals` use [`EachBatch`](iterating_tables_in_batches.md).
The example above has batches of 1000, where each queued job takes two seconds. If the query has been optimized
-to make the time for the delete statement within the [query performance guidelines](../query_performance.md),
+to make the time for the delete statement within the [query performance guidelines](query_performance.md),
1000 may be the largest number of records that can be deleted in a reasonable amount of time.
The minimum and most common interval for delaying jobs is two minutes. This results in two seconds
diff --git a/doc/development/database/batched_background_migrations.md b/doc/development/database/batched_background_migrations.md
index f3ea82b5c61..edb22fcf436 100644
--- a/doc/development/database/batched_background_migrations.md
+++ b/doc/development/database/batched_background_migrations.md
@@ -105,11 +105,16 @@ for more details.
## Batched background migrations for EE-only features
-All the background migration classes for EE-only features should be present in GitLab CE.
-For this purpose, create an empty class for GitLab CE, and extend it for GitLab EE
+All the background migration classes for EE-only features should be present in GitLab FOSS.
+For this purpose, create an empty class for GitLab FOSS, and extend it for GitLab EE
as explained in the guidelines for
[implementing Enterprise Edition features](../ee_features.md#code-in-libgitlabbackground_migration).
+NOTE:
+Background migration classes for EE-only features that use job arguments should define them
+in the GitLab FOSS class. This is required to prevent job arguments validation from failing when
+migration is scheduled in GitLab FOSS context.
+
Batched Background migrations are simple classes that define a `perform` method. A
Sidekiq worker then executes such a class, passing any arguments to it. All
migration classes must be defined in the namespace
@@ -132,6 +137,10 @@ queue_batched_background_migration(
)
```
+NOTE:
+This helper raises an error if the number of provided job arguments does not match
+the number of [job arguments](#job-arguments) defined in `JOB_CLASS_NAME`.
+
Make sure the newly-created data is either migrated, or
saved in both the old and new version upon creation. Removals in
turn can be handled by defining foreign keys with cascading deletes.
@@ -186,6 +195,115 @@ Bump to the [import/export version](../../user/project/settings/import_export.md
be required, if importing a project from a prior version of GitLab requires the
data to be in the new format.
+## Job arguments
+
+`BatchedMigrationJob` provides the `job_arguments` helper method for job classes to define the job arguments they need.
+
+Batched migrations scheduled with `queue_batched_background_migration` **must** use the helper to define the job arguments:
+
+```ruby
+queue_batched_background_migration(
+ 'CopyColumnUsingBackgroundMigrationJob',
+ TABLE_NAME,
+ 'name', 'name_convert_to_text',
+ job_interval: DELAY_INTERVAL
+)
+```
+
+NOTE:
+If the number of defined job arguments does not match the number of job arguments provided when
+scheduling the migration, `queue_batched_background_migration` raises an error.
+
+In this example, `copy_from` returns `name`, and `copy_to` returns `name_convert_to_text`:
+
+```ruby
+class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
+ job_arguments :copy_from, :copy_to
+
+ def perform
+ from_column = connection.quote_column_name(copy_from)
+ to_column = connection.quote_column_name(copy_to)
+
+ assignment_clause = "#{to_column} = #{from_column}"
+
+ each_sub_batch(operation_name: :update_all) do |relation|
+ relation.update_all(assignment_clause)
+ end
+ end
+end
+```
+
+### Additional filters
+
+By default, when creating background jobs to perform the migration, batched background migrations
+iterate over the full specified table. This iteration is done using the
+[`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). If the table has 1000 records
+and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes,
+`EachBatch` is used like this:
+
+```ruby
+# PrimaryKeyBatchingStrategy
+Namespace.each_batch(of: 100) do |relation|
+ relation.where(type: nil).update_all(type: 'User') # this happens in each background job
+end
+```
+
+In some cases, only a subset of records must be examined. If only 10% of the 1000 records
+need examination, apply a filter to the initial relation when the jobs are created:
+
+```ruby
+Namespace.where(type: nil).each_batch(of: 100) do |relation|
+ relation.update_all(type: 'User')
+end
+```
+
+In the first example, we don't know how many records will be updated in each batch.
+In the second (filtered) example, we know exactly 100 will be updated with each batch.
+
+`BatchedMigrationJob` provides a `scope_to` helper method to apply additional filters and achieve this:
+
+1. Create a new migration job class that inherits from `BatchedMigrationJob` and defines the additional filter:
+
+ ```ruby
+ class BackfillNamespaceType < BatchedMigrationJob
+ scope_to ->(relation) { relation.where(type: nil) }
+
+ def perform
+ each_sub_batch(operation_name: :update_all) do |sub_batch|
+ sub_batch.update_all(type: 'User')
+ end
+ end
+ end
+ ```
+
+1. In the post-deployment migration, enqueue the batched background migration:
+
+ ```ruby
+ class BackfillNamespaceType < Gitlab::Database::Migration[2.0]
+ MIGRATION = 'BackfillNamespaceType'
+ DELAY_INTERVAL = 2.minutes
+
+ restrict_gitlab_migration gitlab_schema: :gitlab_main
+
+ def up
+ queue_batched_background_migration(
+ MIGRATION,
+ :namespaces,
+ :id,
+ job_interval: DELAY_INTERVAL
+ )
+ end
+
+ def down
+ delete_batched_background_migration(MIGRATION, :namespaces, :id, [])
+ end
+ end
+ ```
+
+NOTE:
+When applying additional filters, it is important to ensure they are properly covered by an index to optimize `EachBatch` performance.
+In the example above we need an index on `(type, id)` to support the filters. See [the `EachBatch` docs for more information](../iterating_tables_in_batches.md).
+
## Example
The `routes` table has a `source_type` field that's used for a polymorphic relationship.
@@ -221,8 +339,6 @@ background migration.
correctly handled by the batched migration framework. Any subclass of
`BatchedMigrationJob` is initialized with necessary arguments to
execute the batch, as well as a connection to the tracking database.
- Additional `job_arguments` set on the migration are passed to the
- job's `perform` method.
1. Add a new trigger to the database to update newly created and updated routes,
similar to this example:
@@ -320,7 +436,7 @@ The default batching strategy provides an efficient way to iterate over primary
However, if you need to iterate over columns where values are not unique, you must use a
different batching strategy.
-The `LooseIndexScanBatchingStrategy` batching strategy uses a special version of [`EachBatch`](../iterating_tables_in_batches.md#loose-index-scan-with-distinct_each_batch)
+The `LooseIndexScanBatchingStrategy` batching strategy uses a special version of [`EachBatch`](iterating_tables_in_batches.md#loose-index-scan-with-distinct_each_batch)
to provide efficient and stable iteration over the distinct column values.
This example shows a batched background migration where the `issues.project_id` column is used as
@@ -374,76 +490,8 @@ module Gitlab
end
```
-### Adding filters to the initial batching
-
-By default, when creating background jobs to perform the migration, batched background migrations will iterate over the full specified table. This is done using the [`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). This means if there are 1000 records in the table and the batch size is 100, there will be 10 jobs. For illustrative purposes, `EachBatch` is used like this:
-
-```ruby
-# PrimaryKeyBatchingStrategy
-Projects.all.each_batch(of: 100) do |relation|
- relation.where(foo: nil).update_all(foo: 'bar') # this happens in each background job
-end
-```
-
-There are cases where we only need to look at a subset of records. Perhaps we only need to update 1 out of every 10 of those 1000 records. It would be best if we could apply a filter to the initial relation when the jobs are created:
-
-```ruby
-Projects.where(foo: nil).each_batch(of: 100) do |relation|
- relation.update_all(foo: 'bar')
-end
-```
-
-In the `PrimaryKeyBatchingStrategy` example, we do not know how many records will be updated in each batch. In the filtered example, we know exactly 100 will be updated with each batch.
-
-The `PrimaryKeyBatchingStrategy` contains [a method that can be overwritten](https://gitlab.com/gitlab-org/gitlab/-/blob/dd1e70d3676891025534dc4a1e89ca9383178fe7/lib/gitlab/background_migration/batching_strategies/primary_key_batching_strategy.rb#L38-52) to apply additional filtering on the initial `EachBatch`.
-
-We can accomplish this by:
-
-1. Create a new class that inherits from `PrimaryKeyBatchingStrategy` and overrides the method using the desired filter (this may be the same filter used in the sub-batch):
-
- ```ruby
- # frozen_string_literal: true
-
- module GitLab
- module BackgroundMigration
- module BatchingStrategies
- class FooStrategy < PrimaryKeyBatchingStrategy
- def apply_additional_filters(relation, job_arguments: [], job_class: nil)
- relation.where(foo: nil)
- end
- end
- end
- end
- end
- ```
-
-1. In the post-deployment migration that queues the batched background migration, specify the new batching strategy using the `batch_class_name` parameter:
-
- ```ruby
- class BackfillProjectsFoo < Gitlab::Database::Migration[2.0]
- MIGRATION = 'BackfillProjectsFoo'
- DELAY_INTERVAL = 2.minutes
- BATCH_CLASS_NAME = 'FooStrategy'
-
- restrict_gitlab_migration gitlab_schema: :gitlab_main
-
- def up
- queue_batched_background_migration(
- MIGRATION,
- :routes,
- :id,
- job_interval: DELAY_INTERVAL,
- batch_class_name: BATCH_CLASS_NAME
- )
- end
-
- def down
- delete_batched_background_migration(MIGRATION, :routes, :id, [])
- end
- end
- ```
-
-When applying a batching strategy, it is important to ensure the filter properly covered by an index to optimize `EachBatch` performance. See [the `EachBatch` docs for more information](../iterating_tables_in_batches.md).
+NOTE:
+[Additional filters](#additional-filters) defined with `scope_to` will be ignored by `LooseIndexScanBatchingStrategy` and `distinct_each_batch`.
## Testing
diff --git a/doc/development/database/ci_mirrored_tables.md b/doc/development/database/ci_mirrored_tables.md
new file mode 100644
index 00000000000..06f0087fafe
--- /dev/null
+++ b/doc/development/database/ci_mirrored_tables.md
@@ -0,0 +1,156 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# CI mirrored tables
+
+## Problem statement
+
+As part of the database [decomposition work](https://gitlab.com/groups/gitlab-org/-/epics/6168),
+which had the goal of splitting the single database GitLab is using, into two databases: `main` and
+`ci`, came the big challenge of
+[removing all joins between the `main` and the `ci` tables](multiple_databases.md#removing-joins-between-ci-and-non-ci-tables).
+That is because PostgreSQL doesn't support joins between tables that belong to different databases.
+However, some core application models in the main database are queried very often by the CI side.
+For example:
+
+- `Namespace`, in the `namespaces` table.
+- `Project`, in the `projects` table.
+
+Not being able to do `joins` on these tables brings a great challenge. The team chose to perform logical
+replication of those tables from the main database to the CI database, in the new tables:
+
+- `ci_namespace_mirrors`, as a mirror of the `namespaces` table
+- `ci_project_mirrors`, as a mirror of the `projects` table
+
+This logical replication means two things:
+
+1. The `main` database tables can be queried and joined to the `namespaces` and `projects` tables.
+1. The `ci` database tables can be joined with the `ci_namespace_mirrors` and `ci_project_mirrors` tables.
+
+```mermaid
+graph LR
+
+ subgraph "Main database (tables)"
+ A[namespaces] -->|updates| B[namespaces_sync_events]
+ A -->|deletes| C[loose_foreign_keys_deleted_records]
+ D[projects] -->|deletes| C
+ D -->|updates| E[projects_sync_events]
+ end
+
+ B --> F
+ C --> G
+ E --> H
+
+ subgraph "Sidekiq worker jobs"
+ F[Namespaces::ProcessSyncEventsWorker]
+ G[LooseForeignKeys::CleanupWorker]
+ H[Projects::ProcessSyncEventsWorker]
+ end
+
+ F -->|do update| I
+ G -->|delete records| I
+ G -->|delete records| J
+ H -->|do update| J
+
+ subgraph "CI database (tables)"
+ I[ci_namespace_mirrors]
+ J[ci_project_mirrors]
+ end
+```
+
+This replication was restricted only to a few attributes that are needed from each model:
+
+- From `Namespace` we replicate `traversal_ids`.
+- From `Project` we replicate only the `namespace_id`, which represents the group which the project belongs to.
+
+## Keeping the CI mirrored tables in sync with the source tables
+
+We must care about two type 3 events to keep
+the source and the target tables in sync:
+
+1. Creation of new namespaces or projects.
+1. Updating the namespaces or projects.
+1. Deleting namespaces/projects.
+
+```mermaid
+graph TD
+
+ subgraph "CI database (tables)"
+ E[other CI tables]
+ F{queries with joins allowed}
+ G[ci_project_mirrors]
+ H[ci_namespace_mirrors]
+
+ E---F
+ F---G
+ F---H
+ end
+
+ A---B
+ B---C
+ B---D
+
+L["⛔ ← Joins are not allowed → ⛔"]
+
+ subgraph "Main database (tables)"
+ A[other main tables]
+ B{queries with joins allowed}
+ C[projects]
+ D[namespaces]
+ end
+```
+
+### Create and update
+
+Syncing the data of newly created or updated namespaces or projects happens in this
+order:
+
+1. **On the `main` database**: Any `INSERT` or `UPDATE` on the `namespaces` or `projects` tables
+ adds an entry to the tables `namespaces_sync_events`, and `projects_sync_events`. These tables
+ also exist on the `main` database. These entries are added by triggers on both of the tables.
+1. **On the model level**: After a commit happens on either of the source models `Namespace` or
+ `Project`, it schedules the corresponding Sidekiq jobs `Namespaces::ProcessSyncEventsWorker`
+ or `Projects::ProcessSyncEventsWorker` to run.
+1. These workers then:
+ 1. Read the entries from the tables `(namespaces/project)_sync_events`
+ from the `main` database, to check which namespaces or projects to sync.
+ 1. Copy the data for any updated records into the target
+ tables `ci_namespace_mirrors`, `ci_project_mirrors`.
+
+### Delete
+
+When any of `namespaces` or `projects` are deleted, the target records on the mirrored
+CI tables are deleted using the [loose foreign keys](loose_foreign_keys.md) (LFK) mechanism.
+
+By having these items in the `config/gitlab_loose_foreign_keys.yml`, the LFK mechanism
+was already working as expected. It deleted any records on the CI mirrored
+tables that mapped to deleted `namespaces` or `projects` in the `main` database.
+
+```yaml
+ci_namespace_mirrors:
+ - table: namespaces
+ column: namespace_id
+ on_delete: async_delete
+ci_project_mirrors:
+ - table: projects
+ column: project_id
+ on_delete: async_delete
+```
+
+## Consistency Checking
+
+To make sure that both syncing mechanisms work as expected, we deploy
+two extra worker jobs, triggered by cron jobs every few minutes:
+
+1. `Database::CiNamespaceMirrorsConsistencyCheckWorker`
+1. `Database::CiProjectMirrorsConsistencyCheckWorker`
+
+These jobs:
+
+1. Scan both of the source tables on the `main` database, using a cursor.
+1. Compare the items in the `namespaces` and `projects` with the target tables on the `ci` database.
+1. Report the items that are not in sync to Kibana and Prometheus.
+1. Corrects any discrepancies.
diff --git a/doc/development/database/client_side_connection_pool.md b/doc/development/database/client_side_connection_pool.md
index dc52a551407..3cd0e836a8d 100644
--- a/doc/development/database/client_side_connection_pool.md
+++ b/doc/development/database/client_side_connection_pool.md
@@ -10,8 +10,8 @@ Ruby processes accessing the database through
ActiveRecord, automatically calculate the connection-pool size for the
process based on the concurrency.
-Because of the way [Ruby on Rails manages database
-connections](#connection-lifecycle), it is important that we have at
+Because of the way [Ruby on Rails manages database connections](#connection-lifecycle),
+it is important that we have at
least as many connections as we have threads. While there is a 'pool'
setting in [`database.yml`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/database.yml.postgresql), it is not very practical because you need to
maintain it in tandem with the number of application threads. For this
@@ -28,9 +28,8 @@ because connections are instantiated lazily.
## Troubleshooting connection-pool issues
-The connection-pool usage can be seen per environment in the [connection-pool
-saturation
-dashboard](https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1).
+The connection-pool usage can be seen per environment in the
+[connection-pool saturation dashboard](https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1).
If the connection-pool is too small, this would manifest in
`ActiveRecord::ConnectionTimeoutError`s from the application. Because we alert
@@ -41,8 +40,8 @@ hardcoded value (10).
At this point, we need to investigate what is using more connections
than we anticipated. To do that, we can use the
-`gitlab_ruby_threads_running_threads` metric. For example, [this
-graph](https://thanos.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(thread_name)%20(%20gitlab_ruby_threads_running_threads%7Buses_db_connection%3D%22yes%22%7D%20)&g0.tab=0)
+`gitlab_ruby_threads_running_threads` metric. For example,
+[this graph](https://thanos.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(thread_name)%20(%20gitlab_ruby_threads_running_threads%7Buses_db_connection%3D%22yes%22%7D%20)&g0.tab=0)
shows all running threads that connect to the database by their
name. Threads labeled `puma worker` or `sidekiq_worker_thread` are
the threads that define `Gitlab::Runtime.max_threads` so those are
diff --git a/doc/development/database/creating_enums.md b/doc/development/database/creating_enums.md
new file mode 100644
index 00000000000..450cb97d978
--- /dev/null
+++ b/doc/development/database/creating_enums.md
@@ -0,0 +1,154 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Creating enums
+
+When creating a new enum, it should use the database type `SMALLINT`.
+The `SMALLINT` type size is 2 bytes, which is sufficient for an enum.
+This would help to save space in the database.
+
+To use this type, add `limit: 2` to the migration that creates the column.
+
+Example:
+
+```ruby
+def change
+ add_column :ci_job_artifacts, :file_format, :integer, limit: 2
+end
+```
+
+## All of the key/value pairs should be defined in FOSS
+
+**Summary:** All enums needs to be defined in FOSS, if a model is also part of the FOSS.
+
+```ruby
+class Model < ApplicationRecord
+ enum platform: {
+ aws: 0,
+ gcp: 1 # EE-only
+ }
+end
+```
+
+When you add a new key/value pair to a `enum` and if it's EE-specific, you might be
+tempted to organize the `enum` as the following:
+
+```ruby
+# Define `failure_reason` enum in `Pipeline` model:
+class Pipeline < ApplicationRecord
+ enum failure_reason: Enums::Pipeline.failure_reasons
+end
+```
+
+```ruby
+# Define key/value pairs that used in FOSS and EE:
+module Enums
+ module Pipeline
+ def self.failure_reasons
+ { unknown_failure: 0, config_error: 1 }
+ end
+ end
+end
+
+Enums::Pipeline.prepend_mod_with('Enums::Pipeline')
+```
+
+```ruby
+# Define key/value pairs that used in EE only:
+module EE
+ module Enums
+ module Pipeline
+ override :failure_reasons
+ def failure_reasons
+ super.merge(activity_limit_exceeded: 2)
+ end
+ end
+ end
+end
+```
+
+This works as-is, however, it has a couple of downside that:
+
+- Someone could define a key/value pair in EE that is **conflicted** with a value defined in FOSS.
+ For example, define `activity_limit_exceeded: 1` in `EE::Enums::Pipeline`.
+- When it happens, the feature works totally different.
+ For example, we cannot figure out `failure_reason` is either `config_error` or `activity_limit_exceeded`.
+- When it happens, we have to ship a database migration to fix the data integrity,
+ which might be impossible if you cannot recover the original value.
+
+Also, you might observe a workaround for this concern by setting an offset in EE's values.
+For example, this example sets `1000` as the offset:
+
+```ruby
+module EE
+ module Enums
+ module Pipeline
+ override :failure_reasons
+ def failure_reasons
+ super.merge(activity_limit_exceeded: 1_000, size_limit_exceeded: 1_001)
+ end
+ end
+ end
+end
+```
+
+This looks working as a workaround, however, this approach has some downsides that:
+
+- Features could move from EE to FOSS or vice versa. Therefore, the offset might be mixed between FOSS and EE in the future.
+ For example, when you move `activity_limit_exceeded` to FOSS, you see `{ unknown_failure: 0, config_error: 1, activity_limit_exceeded: 1_000 }`.
+- The integer column for the `enum` is likely created [as `SMALLINT`](#creating-enums).
+ Therefore, you need to be careful of that the offset doesn't exceed the maximum value of 2 bytes integer.
+
+As a conclusion, you should define all of the key/value pairs in FOSS.
+For example, you can simply write the following code in the above case:
+
+```ruby
+class Pipeline < ApplicationRecord
+ enum failure_reason: {
+ unknown_failure: 0,
+ config_error: 1,
+ activity_limit_exceeded: 2
+ }
+end
+```
+
+## Add new values in the gap
+
+After merging some EE and FOSS enums, there might be a gap between the two groups of values:
+
+```ruby
+module Enums
+ module Ci
+ module CommitStatus
+ def self.failure_reasons
+ {
+ # ...
+ data_integrity_failure: 12,
+ forward_deployment_failure: 13,
+ insufficient_bridge_permissions: 1_001,
+ downstream_bridge_project_not_found: 1_002,
+ # ...
+ }
+ end
+ end
+ end
+end
+```
+
+To add new values, you should fill the gap first.
+In the example above add `14` instead of `1_003`:
+
+```ruby
+{
+ # ...
+ data_integrity_failure: 12,
+ forward_deployment_failure: 13,
+ a_new_value: 14,
+ insufficient_bridge_permissions: 1_001,
+ downstream_bridge_project_not_found: 1_002,
+ # ...
+}
+```
diff --git a/doc/development/database/database_debugging.md b/doc/development/database/database_debugging.md
new file mode 100644
index 00000000000..5921dc942f2
--- /dev/null
+++ b/doc/development/database/database_debugging.md
@@ -0,0 +1,177 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Troubleshooting and Debugging Database
+
+This section is to help give some copy-pasta you can use as a reference when you
+run into some head-banging database problems.
+
+A first step is to search for your error in Slack, or search for `GitLab <my error>` with Google.
+
+Available `RAILS_ENV`:
+
+- `production` (generally not for your main GDK database, but you may need this for other installations such as Omnibus).
+- `development` (this is your main GDK db).
+- `test` (used for tests like RSpec).
+
+## Delete everything and start over
+
+If you just want to delete everything and start over with an empty DB (approximately 1 minute):
+
+```shell
+bundle exec rake db:reset RAILS_ENV=development
+```
+
+If you want to seed the empty DB with sample data (approximately 4 minutes):
+
+```shell
+bundle exec rake dev:setup
+```
+
+If you just want to delete everything and start over with sample data (approximately 4 minutes). This
+also does `db:reset` and runs DB-specific migrations:
+
+```shell
+bundle exec rake db:setup RAILS_ENV=development
+```
+
+If your test DB is giving you problems, it is safe to delete everything because it doesn't contain important
+data:
+
+```shell
+bundle exec rake db:reset RAILS_ENV=test
+```
+
+## Migration wrangling
+
+- `bundle exec rake db:migrate RAILS_ENV=development`: Execute any pending migrations that you may have picked up from a MR
+- `bundle exec rake db:migrate:status RAILS_ENV=development`: Check if all migrations are `up` or `down`
+- `bundle exec rake db:migrate:down VERSION=20170926203418 RAILS_ENV=development`: Tear down a migration
+- `bundle exec rake db:migrate:up VERSION=20170926203418 RAILS_ENV=development`: Set up a migration
+- `bundle exec rake db:migrate:redo VERSION=20170926203418 RAILS_ENV=development`: Re-run a specific migration
+
+## Manually access the database
+
+Access the database via one of these commands (they all get you to the same place)
+
+```shell
+gdk psql -d gitlabhq_development
+bundle exec rails dbconsole -e development
+bundle exec rails db -e development
+```
+
+- `\q`: Quit/exit
+- `\dt`: List all tables
+- `\d+ issues`: List columns for `issues` table
+- `CREATE TABLE board_labels();`: Create a table called `board_labels`
+- `SELECT * FROM schema_migrations WHERE version = '20170926203418';`: Check if a migration was run
+- `DELETE FROM schema_migrations WHERE version = '20170926203418';`: Manually remove a migration
+
+## Access the database with a GUI
+
+Most GUIs (DataGrip, RubyMine, DBeaver) require a TCP connection to the database, but by default
+the database runs on a UNIX socket. To be able to access the database from these tools, some steps
+are needed:
+
+1. On the GDK root directory, run:
+
+ ```shell
+ gdk config set postgresql.host localhost
+ ```
+
+1. Open your `gdk.yml`, and confirm that it has the following lines:
+
+ ```yaml
+ postgresql:
+ host: localhost
+ ```
+
+1. Reconfigure GDK:
+
+ ```shell
+ gdk reconfigure
+ ```
+
+1. On your database GUI, select `localhost` as host, `5432` as port and `gitlabhq_development` as database.
+ Alternatively, you can use the connection string `postgresql://localhost:5432/gitlabhq_development`.
+
+The new connection should be working now.
+
+## Access the GDK database with Visual Studio Code
+
+Use these instructions for exploring the GitLab database while developing with the GDK:
+
+1. Install or open [Visual Studio Code](https://code.visualstudio.com/download).
+1. Install the [PostgreSQL VSCode Extension](https://marketplace.visualstudio.com/items?itemName=ckolkman.vscode-postgres).
+1. In Visual Studio Code select **PostgreSQL Explorer** in the left toolbar.
+1. In the top bar of the new window, select `+` to **Add Database Connection**, and follow the prompts to fill in the details:
+ 1. **Hostname**: the path to the PostgreSQL folder in your GDK directory (for example `/dev/gitlab-development-kit/postgresql`).
+ 1. **PostgreSQL user to authenticate as**: usually your local username, unless otherwise specified during PostgreSQL installation.
+ 1. **Password of the PostgreSQL user**: the password you set when installing PostgreSQL.
+ 1. **Port number to connect to**: `5432` (default).
+ 1. **Use an SSL connection?** This depends on your installation. Options are:
+ - **Use Secure Connection**
+ - **Standard Connection** (default)
+ 1. **Optional. The database to connect to**: `gitlabhq_development`.
+ 1. **The display name for the database connection**: `gitlabhq_development`.
+
+Your database connection should now be displayed in the PostgreSQL Explorer pane and
+you can explore the `gitlabhq_development` database. If you cannot connect, ensure
+that GDK is running. For further instructions on how to use the PostgreSQL Explorer
+Extension for Visual Studio Code, read the [usage section](https://marketplace.visualstudio.com/items?itemName=ckolkman.vscode-postgres#usage)
+of the extension documentation.
+
+## FAQ
+
+### `ActiveRecord::PendingMigrationError` with Spring
+
+When running specs with the [Spring pre-loader](../rake_tasks.md#speed-up-tests-rake-tasks-and-migrations),
+the test database can get into a corrupted state. Trying to run the migration or
+dropping/resetting the test database has no effect.
+
+```shell
+$ bundle exec spring rspec some_spec.rb
+...
+Failure/Error: ActiveRecord::Migration.maintain_test_schema!
+
+ActiveRecord::PendingMigrationError:
+
+
+ Migrations are pending. To resolve this issue, run:
+
+ bin/rake db:migrate RAILS_ENV=test
+# ~/.rvm/gems/ruby-2.3.3/gems/activerecord-4.2.10/lib/active_record/migration.rb:392:in `check_pending!'
+...
+0 examples, 0 failures, 1 error occurred outside of examples
+```
+
+To resolve, you can kill the spring server and app that lives between spec runs.
+
+```shell
+$ ps aux | grep spring
+eric 87304 1.3 2.9 3080836 482596 ?? Ss 10:12AM 4:08.36 spring app | gitlab | started 6 hours ago | test mode
+eric 37709 0.0 0.0 2518640 7524 s006 S Wed11AM 0:00.79 spring server | gitlab | started 29 hours ago
+$ kill 87304
+$ kill 37709
+```
+
+### db:migrate `database version is too old to be migrated` error
+
+Users receive this error when `db:migrate` detects that the current schema version
+is older than the `MIN_SCHEMA_VERSION` defined in the `Gitlab::Database` library
+module.
+
+Over time we cleanup/combine old migrations in the codebase, so it is not always
+possible to migrate GitLab from every previous version.
+
+In some cases you may want to bypass this check. For example, if you were on a version
+of GitLab schema later than the `MIN_SCHEMA_VERSION`, and then rolled back the
+to an older migration, from before. In this case, to migrate forward again,
+you should set the `SKIP_SCHEMA_VERSION_CHECK` environment variable.
+
+```shell
+bundle exec rake db:migrate SKIP_SCHEMA_VERSION_CHECK=true
+```
diff --git a/doc/development/database/database_dictionary.md b/doc/development/database/database_dictionary.md
new file mode 100644
index 00000000000..c330c5e67bd
--- /dev/null
+++ b/doc/development/database/database_dictionary.md
@@ -0,0 +1,51 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Database Dictionary
+
+This page documents the database schema for GitLab, so data analysts and other groups can
+locate the feature categories responsible for specific database tables.
+
+## Location
+
+Database dictionary metadata files are stored in the `gitlab` project under `db/docs/`.
+
+## Example dictionary file
+
+```yaml
+---
+table_name: terraform_states
+classes:
+- Terraform::State
+feature_categories:
+- infrastructure_as_code
+description: Represents a Terraform state backend
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/26619
+milestone: '13.0'
+```
+
+## Schema
+
+| Attribute | Type | Required | Description |
+|----------------------|---------------|----------|--------------------------------------------------------------------------|
+| `table_name` | String | yes | Database table name |
+| `classes` | Array(String) | no | List of classes that respond to `.table_name` with the `table_name` |
+| `feature_categories` | Array(String) | yes | List of feature categories using this table |
+| `description` | String | no | Text description of the information stored in the table and it's purpose |
+| `introduced_by_url` | URL | no | URL to the merge request or commit which introduced this table |
+| `milestone` | String | no | The milestone that introduced this table |
+
+## Adding tables
+
+When adding a new table, create a new file under `db/docs/` named
+`<table_name>.yml` containing as much information as you know about the table.
+
+Include this file in the commit with the migration that creates the table.
+
+## Dropping tables
+
+When dropping a table, you must remove the metadata file from `db/docs/`
+in the same commit with the migration that drops the table.
diff --git a/doc/development/database/database_lab.md b/doc/development/database/database_lab.md
index 5346df2690d..1d584a4ec6f 100644
--- a/doc/development/database/database_lab.md
+++ b/doc/development/database/database_lab.md
@@ -95,7 +95,7 @@ To connect to a clone using `psql`:
1. In the **Clone details** page of the Postgres.ai web interface, copy and run
the command to start SSH port forwarding for the clone.
1. In the **Clone details** page of the Postgres.ai web interface, copy and run the `psql` connection string.
- Use the password provided at setup.
+ Use the password provided at setup and set the `dbname` to `gitlabhq_dblab` (or check what databases are available by using `psql -l` with the same query string but `dbname=postgres`).
After you connect, use clone like you would any `psql` console in production, but with
the added benefit and safety of an isolated writeable environment.
diff --git a/doc/development/database/database_query_comments.md b/doc/development/database/database_query_comments.md
new file mode 100644
index 00000000000..2798071bc06
--- /dev/null
+++ b/doc/development/database/database_query_comments.md
@@ -0,0 +1,62 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Database query comments with Marginalia
+
+The [Marginalia gem](https://github.com/basecamp/marginalia) is used to add
+query comments containing application related context information to PostgreSQL
+queries generated by ActiveRecord.
+
+It is very useful for tracing problematic queries back to the application source.
+
+An engineer during an on-call incident has the full context of a query
+and its application source from the comments.
+
+## Metadata information in comments
+
+Queries generated from **Rails** include the following metadata in comments:
+
+- `application`
+- `correlation_id`
+- `endpoint_id`
+- `line`
+
+Queries generated from **Sidekiq** workers include the following metadata
+in comments:
+
+- `application`
+- `jid`
+- `correlation_id`
+- `endpoint_id`
+- `line`
+
+`endpoint_id` is a single field that can represent any endpoint in the application:
+
+- For Rails controllers, it's the controller and action. For example, `Projects::BlobController#show`.
+- For Grape API endpoints, it's the route. For example, `/api/:version/users/:id`.
+- For Sidekiq workers, it's the worker class name. For example, `UserStatusCleanup::BatchWorker`.
+
+`line` is not present in production logs due to the additional overhead required.
+
+Examples of queries with comments:
+
+- Rails:
+
+ ```sql
+ /*application:web,controller:blob,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,endpoint_id:Projects::BlobController#show*/ SELECT "routes".* FROM "routes" WHERE "routes"."source_id" = 75 AND "routes"."source_type" = 'Namespace' LIMIT 1
+ ```
+
+- Grape:
+
+ ```sql
+ /*application:web,correlation_id:01EZVN0DAYGJF5XHG9N4VX8FAH,endpoint_id:/api/:version/users/:id*/ SELECT COUNT(*) FROM "users" INNER JOIN "user_follow_users" ON "users"."id" = "user_follow_users"."followee_id" WHERE "user_follow_users"."follower_id" = 1
+ ```
+
+- Sidekiq:
+
+ ```sql
+ /*application:sidekiq,correlation_id:df643992563683313bc0a0288fb55e23,jid:15fbc506590c625d7664b074,endpoint_id:UserStatusCleanup::BatchWorker,line:/app/workers/user_status_cleanup/batch_worker.rb:19:in `perform'*/ SELECT $1 AS one FROM "user_statuses" WHERE "user_statuses"."clear_status_at" <= $2 LIMIT $3
+ ```
diff --git a/doc/development/database/database_reviewer_guidelines.md b/doc/development/database/database_reviewer_guidelines.md
index b6bbfe690c1..a85f399a447 100644
--- a/doc/development/database/database_reviewer_guidelines.md
+++ b/doc/development/database/database_reviewer_guidelines.md
@@ -36,7 +36,7 @@ projects:
Create the merge request [using the "Database reviewer" template](https://gitlab.com/gitlab-com/www-gitlab-com/-/blob/master/.gitlab/merge_request_templates/Database%20reviewer.md),
adding your expertise your profile YAML file. Assign to a database maintainer or the
-[Database Team's Engineering Manager](https://about.gitlab.com/handbook/engineering/development/enablement/database/).
+[Database Team's Engineering Manager](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/).
After the `team.yml` update is merged, the [Reviewer roulette](../code_review.md#reviewer-roulette)
may recommend you as a database reviewer.
@@ -53,17 +53,17 @@ that require a more in-depth discussion between the database reviewers and maint
- [Database Office Hours Agenda](https://docs.google.com/document/d/1wgfmVL30F8SdMg-9yY6Y8djPSxWNvKmhR5XmsvYX1EI/edit).
- <i class="fa fa-youtube-play youtube" aria-hidden="true"></i> [YouTube playlist with past recordings](https://www.youtube.com/playlist?list=PL05JrBw4t0Kp-kqXeiF7fF7cFYaKtdqXM).
-You should also join the [#database-lab](../understanding_explain_plans.md#database-lab-engine)
+You should also join the [#database-lab](understanding_explain_plans.md#database-lab-engine)
Slack channel and get familiar with how to use Joe, the Slackbot that provides developers
with their own clone of the production database.
Understanding and efficiently using `EXPLAIN` plans is at the core of the database review process.
The following guides provide a quick introduction and links to follow on more advanced topics:
-- Guide on [understanding EXPLAIN plans](../understanding_explain_plans.md).
+- Guide on [understanding EXPLAIN plans](understanding_explain_plans.md).
- [Explaining the unexplainable series in `depesz`](https://www.depesz.com/tag/unexplainable/).
-We also have licensed access to The Art of PostgreSQL available, if you are interested in getting access please check out the
+We also have licensed access to The Art of PostgreSQL. If you are interested in getting access, check out the
[issue (confidential)](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/23).
Finally, you can find various guides in the [Database guides](index.md) page that cover more specific
@@ -95,7 +95,7 @@ are three times as likely to be picked by the [Danger bot](../dangerbot.md) as o
## What to do if you feel overwhelmed
Similar to all types of reviews, [unblocking others is always a top priority](https://about.gitlab.com/handbook/values/#global-optimization).
-Database reviewers are expected to [review assigned merge requests in a timely manner](../code_review.md#review-turnaround-time)
+Database reviewers are expected to [review assigned merge requests in a timely manner](https://about.gitlab.com/handbook/engineering/workflow/code-review/#review-turnaround-time)
or let the author know as soon as possible and help them find another reviewer or maintainer.
We are doing reviews to help the rest of the GitLab team and, at the same time, get exposed
diff --git a/doc/development/database/db_dump.md b/doc/development/database/db_dump.md
new file mode 100644
index 00000000000..f2076cbc410
--- /dev/null
+++ b/doc/development/database/db_dump.md
@@ -0,0 +1,56 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Importing a database dump into a staging environment
+
+Sometimes it is useful to import the database from a production environment
+into a staging environment for testing. The procedure below assumes you have
+SSH and `sudo` access to both the production environment and the staging VM.
+
+**Destroy your staging VM** when you are done with it. It is important to avoid
+data leaks.
+
+On the staging VM, add the following line to `/etc/gitlab/gitlab.rb` to speed up
+large database imports.
+
+```shell
+# On STAGING
+echo "postgresql['checkpoint_segments'] = 64" | sudo tee -a /etc/gitlab/gitlab.rb
+sudo touch /etc/gitlab/skip-auto-reconfigure
+sudo gitlab-ctl reconfigure
+sudo gitlab-ctl stop puma
+sudo gitlab-ctl stop sidekiq
+```
+
+Next, we let the production environment stream a compressed SQL dump to our
+local machine via SSH, and redirect this stream to a `psql` client on the staging
+VM.
+
+```shell
+# On LOCAL MACHINE
+ssh -C gitlab.example.com sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_dump -Cc gitlabhq_production |\
+ ssh -C staging-vm sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -d template1
+```
+
+## Recreating directory structure
+
+If you need to re-create some directory structure on the staging server you can
+use this procedure.
+
+First, on the production server, create a list of directories you want to
+re-create.
+
+```shell
+# On PRODUCTION
+(umask 077; sudo find /var/opt/gitlab/git-data/repositories -maxdepth 1 -type d -print0 > directories.txt)
+```
+
+Copy `directories.txt` to the staging server and create the directories there.
+
+```shell
+# On STAGING
+sudo -u git xargs -0 mkdir -p < directories.txt
+```
diff --git a/doc/development/database/filtering_by_label.md b/doc/development/database/filtering_by_label.md
new file mode 100644
index 00000000000..29b0c75298e
--- /dev/null
+++ b/doc/development/database/filtering_by_label.md
@@ -0,0 +1,179 @@
+---
+stage: Plan
+group: Project Management
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+# Filtering by label
+
+## Introduction
+
+GitLab has [labels](../../user/project/labels.md) that can be assigned to issues,
+merge requests, and epics. Labels on those objects are a many-to-many relation
+through the polymorphic `label_links` table.
+
+To filter these objects by multiple labels - for instance, 'all open
+issues with the label ~Plan and the label ~backend' - we generate a
+query containing a `GROUP BY` clause. In a simple form, this looks like:
+
+```sql
+SELECT
+ issues.*
+FROM
+ issues
+ INNER JOIN label_links ON label_links.target_id = issues.id
+ AND label_links.target_type = 'Issue'
+ INNER JOIN labels ON labels.id = label_links.label_id
+WHERE
+ issues.project_id = 13083
+ AND (issues.state IN ('opened'))
+ AND labels.title IN ('Plan',
+ 'backend')
+GROUP BY
+ issues.id
+HAVING (COUNT(DISTINCT labels.title) = 2)
+ORDER BY
+ issues.updated_at DESC,
+ issues.id DESC
+LIMIT 20 OFFSET 0
+```
+
+In particular, note that:
+
+1. We `GROUP BY issues.id` so that we can ...
+1. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
+ all matched issues have both labels.
+
+This is more complicated than is ideal. It makes the query construction more
+prone to errors (such as
+[issue #15557](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/15557)).
+
+## Attempt A: `WHERE EXISTS`
+
+### Attempt A1: use multiple subqueries with `WHERE EXISTS`
+
+In [issue #37137](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/37137)
+and its associated [merge request](https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/14022),
+we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
+example above, this would give:
+
+```sql
+WHERE (EXISTS (
+ SELECT
+ TRUE
+ FROM
+ label_links
+ INNER JOIN labels ON labels.id = label_links.label_id
+ WHERE
+ labels.title = 'Plan'
+ AND target_type = 'Issue'
+ AND target_id = issues.id))
+AND (EXISTS (
+ SELECT
+ TRUE
+ FROM
+ label_links
+ INNER JOIN labels ON labels.id = label_links.label_id
+ WHERE
+ labels.title = 'backend'
+ AND target_type = 'Issue'
+ AND target_id = issues.id))
+```
+
+While this worked without schema changes, and did improve readability somewhat,
+it did not improve query performance.
+
+### Attempt A2: use label IDs in the `WHERE EXISTS` clause
+
+In [merge request #34503](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/34503), we followed a similar approach to A1. But this time, we
+did a separate query to fetch the IDs of the labels used in the filter so that we avoid the `JOIN` in the `EXISTS` clause and filter directly by
+`label_links.label_id`. We also added a new index on `label_links` for the `target_id`, `label_id`, and `target_type` columns to speed up this query.
+
+Finding the label IDs wasn't straightforward because there could be multiple labels with the same title within a single root namespace. We solved
+this by grouping the label IDs by title and then using the array of IDs in the `EXISTS` clauses.
+
+This resulted in a significant performance improvement. However, this optimization could not be applied to the dashboard pages
+where we do not have a project or group context. We could not easily search for the label IDs here because that would mean searching across all
+projects and groups that the user has access to.
+
+## Attempt B: Denormalize using an array column
+
+Having [removed MySQL support in GitLab 12.1](https://about.gitlab.com/blog/2019/06/27/removing-mysql-support/),
+using [PostgreSQL's arrays](https://www.postgresql.org/docs/11/arrays.html) became more
+tractable as we didn't have to support two databases. We discussed denormalizing
+the `label_links` table for querying in
+[issue #49651](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/49651),
+with two options: label IDs and titles.
+
+We can think of both of those as array columns on `issues`, `merge_requests`,
+and `epics`: `issues.label_ids` would be an array column of label IDs, and
+`issues.label_titles` would be an array of label titles.
+
+These array columns can be complemented with
+[GIN indexes](https://www.postgresql.org/docs/11/gin-intro.html) to improve
+matching.
+
+### Attempt B1: store label IDs for each object
+
+This has some strong advantages over titles:
+
+1. Unless a label is deleted, or a project is moved, we never need to
+ bulk-update the denormalized column.
+1. It uses less storage than the titles.
+
+Unfortunately, our application design makes this hard. If we were able to query
+just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
+query at the start of this document. GitLab allows users to filter by label
+title across projects and even across groups, so a filter by the label ~Plan may
+include labels with multiple distinct IDs.
+
+We do not want users to have to know about the different IDs, which means that
+given this data set:
+
+| Project | ~Plan label ID | ~backend label ID |
+| ------- | -------------- | ----------------- |
+| A | 11 | 12 |
+| B | 21 | 22 |
+| C | 31 | 32 |
+
+We would need something like:
+
+```sql
+WHERE
+ label_ids @> ARRAY[11, 12]
+ OR label_ids @> ARRAY[21, 22]
+ OR label_ids @> ARRAY[31, 32]
+```
+
+This can get even more complicated when we consider that in some cases, there
+might be two ~backend labels - with different IDs - that could apply to the same
+object, so the number of combinations would balloon further.
+
+### Attempt B2: store label titles for each object
+
+From the perspective of updating the object, this is the worst
+option. We have to bulk update the objects when:
+
+1. The objects are moved from one project to another.
+1. The project is moved from one group to another.
+1. The label is renamed.
+1. The label is deleted.
+
+It also uses much more storage. Querying is simple, though:
+
+```sql
+WHERE
+ label_titles @> ARRAY['Plan', 'backend']
+```
+
+And our [tests in issue #49651](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/49651#note_188777346)
+showed that this could be fast.
+
+However, at present, the disadvantages outweigh the advantages.
+
+## Conclusion
+
+We found a method A2 that does not need denormalization and improves the query performance significantly. This
+did not apply to all cases, but we were able to apply method A1 to the rest of the cases so that we remove the
+`GROUP BY` and `HAVING` clauses in all scenarios.
+
+This simplified the query and improved the performance in the most common cases.
diff --git a/doc/development/database/foreign_keys.md b/doc/development/database/foreign_keys.md
new file mode 100644
index 00000000000..7834e7d53c3
--- /dev/null
+++ b/doc/development/database/foreign_keys.md
@@ -0,0 +1,199 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Foreign Keys & Associations
+
+When adding an association to a model you must also add a foreign key. For
+example, say you have the following model:
+
+```ruby
+class User < ActiveRecord::Base
+ has_many :posts
+end
+```
+
+Add a foreign key here on column `posts.user_id`. This ensures
+that data consistency is enforced on database level. Foreign keys also mean that
+the database can very quickly remove associated data (for example, when removing a
+user), instead of Rails having to do this.
+
+## Adding Foreign Keys In Migrations
+
+Foreign keys can be added concurrently using `add_concurrent_foreign_key` as
+defined in `Gitlab::Database::MigrationHelpers`. See the
+[Migration Style Guide](../migration_style_guide.md) for more information.
+
+Keep in mind that you can only safely add foreign keys to existing tables after
+you have removed any orphaned rows. The method `add_concurrent_foreign_key`
+does not take care of this so you must do so manually. See
+[adding foreign key constraint to an existing column](add_foreign_key_to_existing_column.md).
+
+## Updating Foreign Keys In Migrations
+
+Sometimes a foreign key constraint must be changed, preserving the column
+but updating the constraint condition. For example, moving from
+`ON DELETE CASCADE` to `ON DELETE SET NULL` or vice-versa.
+
+PostgreSQL does not prevent you from adding overlapping foreign keys. It
+honors the most recently added constraint. This allows us to replace foreign keys without
+ever losing foreign key protection on a column.
+
+To replace a foreign key:
+
+1. [Add the new foreign key without validation](add_foreign_key_to_existing_column.md#prevent-invalid-records)
+
+ The name of the foreign key constraint must be changed to add a new
+ foreign key before removing the old one.
+
+ ```ruby
+ class ReplaceFkOnPackagesPackagesProjectId < Gitlab::Database::Migration[2.0]
+ disable_ddl_transaction!
+
+ NEW_CONSTRAINT_NAME = 'fk_new'
+
+ def up
+ add_concurrent_foreign_key(:packages_packages, :projects, column: :project_id, on_delete: :nullify, validate: false, name: NEW_CONSTRAINT_NAME)
+ end
+
+ def down
+ with_lock_retries do
+ remove_foreign_key_if_exists(:packages_packages, column: :project_id, on_delete: :nullify, name: NEW_CONSTRAINT_NAME)
+ end
+ end
+ end
+ ```
+
+1. [Validate the new foreign key](add_foreign_key_to_existing_column.md#validate-the-foreign-key)
+
+ ```ruby
+ class ValidateFkNew < Gitlab::Database::Migration[2.0]
+ NEW_CONSTRAINT_NAME = 'fk_new'
+
+ # foreign key added in <link to MR or path to migration adding new FK>
+ def up
+ validate_foreign_key(:packages_packages, name: NEW_CONSTRAINT_NAME)
+ end
+
+ def down
+ # no-op
+ end
+ end
+ ```
+
+1. Remove the old foreign key:
+
+ ```ruby
+ class RemoveFkOld < Gitlab::Database::Migration[2.0]
+ OLD_CONSTRAINT_NAME = 'fk_old'
+
+ # new foreign key added in <link to MR or path to migration adding new FK>
+ # and validated in <link to MR or path to migration validating new FK>
+ def up
+ remove_foreign_key_if_exists(:packages_packages, column: :project_id, on_delete: :cascade, name: OLD_CONSTRAINT_NAME)
+ end
+
+ def down
+ # Validation is skipped here, so if rolled back, this will need to be revalidated in a separate migration
+ add_concurrent_foreign_key(:packages_packages, :projects, column: :project_id, on_delete: :cascade, validate: false, name: OLD_CONSTRAINT_NAME)
+ end
+ end
+ ```
+
+## Cascading Deletes
+
+Every foreign key must define an `ON DELETE` clause, and in 99% of the cases
+this should be set to `CASCADE`.
+
+## Indexes
+
+When adding a foreign key in PostgreSQL the column is not indexed automatically,
+thus you must also add a concurrent index. Not doing so results in cascading
+deletes being very slow.
+
+## Naming foreign keys
+
+By default Ruby on Rails uses the `_id` suffix for foreign keys. So we should
+only use this suffix for associations between two tables. If you want to
+reference an ID on a third party platform the `_xid` suffix is recommended.
+
+The spec `spec/db/schema_spec.rb` tests if all columns with the `_id` suffix
+have a foreign key constraint. So if that spec fails, don't add the column to
+`IGNORED_FK_COLUMNS`, but instead add the FK constraint, or consider naming it
+differently.
+
+## Dependent Removals
+
+Don't define options such as `dependent: :destroy` or `dependent: :delete` when
+defining an association. Defining these options means Rails handles the
+removal of data, instead of letting the database handle this in the most
+efficient way possible.
+
+In other words, this is bad and should be avoided at all costs:
+
+```ruby
+class User < ActiveRecord::Base
+ has_many :posts, dependent: :destroy
+end
+```
+
+Should you truly have a need for this it should be approved by a database
+specialist first.
+
+You should also not define any `before_destroy` or `after_destroy` callbacks on
+your models _unless_ absolutely required and only when approved by database
+specialists. For example, if each row in a table has a corresponding file on a
+file system it may be tempting to add a `after_destroy` hook. This however
+introduces non database logic to a model, and means we can no longer rely on
+foreign keys to remove the data as this would result in the file system data
+being left behind. In such a case you should use a service class instead that
+takes care of removing non database data.
+
+In cases where the relation spans multiple databases you have even
+further problems using `dependent: :destroy` or the above hooks. You can
+read more about alternatives at
+[Avoid `dependent: :nullify` and `dependent: :destroy` across databases](multiple_databases.md#avoid-dependent-nullify-and-dependent-destroy-across-databases).
+
+## Alternative primary keys with `has_one` associations
+
+Sometimes a `has_one` association is used to create a one-to-one relationship:
+
+```ruby
+class User < ActiveRecord::Base
+ has_one :user_config
+end
+
+class UserConfig < ActiveRecord::Base
+ belongs_to :user
+end
+```
+
+In these cases, there may be an opportunity to remove the unnecessary `id`
+column on the associated table, `user_config.id` in this example. Instead,
+the originating table ID can be used as the primary key for the associated
+table:
+
+```ruby
+create_table :user_configs, id: false do |t|
+ t.references :users, primary_key: true, default: nil, index: false, foreign_key: { on_delete: :cascade }
+ ...
+end
+```
+
+Setting `default: nil` ensures a primary key sequence is not created, and since the primary key
+automatically gets an index, we set `index: false` to avoid creating a duplicate.
+You also need to add the new primary key to the model:
+
+```ruby
+class UserConfig < ActiveRecord::Base
+ self.primary_key = :user_id
+
+ belongs_to :user
+end
+```
+
+Using a foreign key as primary key saves space but can make
+[batch counting](../service_ping/implement.md#batch-counters) in [Service Ping](../service_ping/index.md) less efficient.
+Consider using a regular `id` column if the table is relevant for Service Ping.
diff --git a/doc/development/database/hash_indexes.md b/doc/development/database/hash_indexes.md
new file mode 100644
index 00000000000..731639b6f06
--- /dev/null
+++ b/doc/development/database/hash_indexes.md
@@ -0,0 +1,26 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Hash Indexes
+
+PostgreSQL supports hash indexes besides the regular B-tree
+indexes. Hash indexes however are to be avoided at all costs. While they may
+_sometimes_ provide better performance the cost of rehashing can be very high.
+More importantly: at least until PostgreSQL 10.0 hash indexes are not
+WAL-logged, meaning they are not replicated to any replicas. From the PostgreSQL
+documentation:
+
+> Hash index operations are not presently WAL-logged, so hash indexes might need
+> to be rebuilt with REINDEX after a database crash if there were unwritten
+> changes. Also, changes to hash indexes are not replicated over streaming or
+> file-based replication after the initial base backup, so they give wrong
+> answers to queries that subsequently use them. For these reasons, hash index
+> use is presently discouraged.
+
+RuboCop is configured to register an offense when it detects the use of a hash
+index.
+
+Instead of using hash indexes you should use regular B-tree indexes.
diff --git a/doc/development/database/index.md b/doc/development/database/index.md
index b427f54ff3c..8cf9a2eec04 100644
--- a/doc/development/database/index.md
+++ b/doc/development/database/index.md
@@ -16,52 +16,62 @@ info: To determine the technical writer assigned to the Stage/Group associated w
## Tooling
-- [Understanding EXPLAIN plans](../understanding_explain_plans.md)
+- [Understanding EXPLAIN plans](understanding_explain_plans.md)
- [explain.depesz.com](https://explain.depesz.com/) or [explain.dalibo.com](https://explain.dalibo.com/) for visualizing the output of `EXPLAIN`
- [pgFormatter](https://sqlformat.darold.net/) a PostgreSQL SQL syntax beautifier
- [db:check-migrations job](dbcheck-migrations-job.md)
## Migrations
+- [Different types of migrations](../migration_style_guide.md#choose-an-appropriate-migration-type)
+- [Create a regular migration](../migration_style_guide.md#create-a-regular-schema-migration), including creating new models
+- [Post-deployment migrations guidelines](post_deployment_migrations.md) and [how to create one](post_deployment_migrations.md#creating-migrations)
+- [Background migrations guidelines](background_migrations.md)
+- [Batched background migrations guidelines](batched_background_migrations.md)
+- [Deleting migrations](deleting_migrations.md)
+- [Running database migrations](database_debugging.md#migration-wrangling)
- [Migrations for multiple databases](migrations_for_multiple_databases.md)
- [Avoiding downtime in migrations](avoiding_downtime_in_migrations.md)
-- [SQL guidelines](../sql.md) for working with SQL queries
+- [When and how to write Rails migrations tests](../testing_guide/testing_migrations_guide.md)
- [Migrations style guide](../migration_style_guide.md) for creating safe SQL migrations
- [Testing Rails migrations](../testing_guide/testing_migrations_guide.md) guide
- [Post deployment migrations](post_deployment_migrations.md)
- [Background migrations](background_migrations.md)
-- [Swapping tables](../swapping_tables.md)
+- [Swapping tables](swapping_tables.md)
- [Deleting migrations](deleting_migrations.md)
+- [SQL guidelines](../sql.md) for working with SQL queries
- [Partitioning tables](table_partitioning.md)
## Debugging
-- Tracing the source of an SQL query using query comments with [Marginalia](../database_query_comments.md)
+- [Resetting the database](database_debugging.md#delete-everything-and-start-over)
+- [Accessing the database](database_debugging.md#manually-access-the-database)
+- [Troubleshooting and debugging the database](database_debugging.md)
+- Tracing the source of an SQL query using query comments with [Marginalia](database_query_comments.md)
- Tracing the source of an SQL query in Rails console using [Verbose Query Logs](https://guides.rubyonrails.org/debugging_rails_applications.html#verbose-query-logs)
## Best practices
-- [Adding database indexes](../adding_database_indexes.md)
-- [Foreign keys & associations](../foreign_keys.md)
+- [Adding database indexes](adding_database_indexes.md)
+- [Foreign keys & associations](foreign_keys.md)
- [Adding a foreign key constraint to an existing column](add_foreign_key_to_existing_column.md)
- [`NOT NULL` constraints](not_null_constraints.md)
- [Strings and the Text data type](strings_and_the_text_data_type.md)
-- [Single table inheritance](../single_table_inheritance.md)
-- [Polymorphic associations](../polymorphic_associations.md)
-- [Serializing data](../serializing_data.md)
-- [Hash indexes](../hash_indexes.md)
-- [Storing SHA1 hashes as binary](../sha1_as_binary.md)
-- [Iterating tables in batches](../iterating_tables_in_batches.md)
-- [Insert into tables in batches](../insert_into_tables_in_batches.md)
-- [Ordering table columns](../ordering_table_columns.md)
-- [Verifying database capabilities](../verifying_database_capabilities.md)
-- [Database Debugging and Troubleshooting](../database_debugging.md)
-- [Query Count Limits](../query_count_limits.md)
-- [Creating enums](../creating_enums.md)
+- [Single table inheritance](single_table_inheritance.md)
+- [Polymorphic associations](polymorphic_associations.md)
+- [Serializing data](serializing_data.md)
+- [Hash indexes](hash_indexes.md)
+- [Storing SHA1 hashes as binary](sha1_as_binary.md)
+- [Iterating tables in batches](iterating_tables_in_batches.md)
+- [Insert into tables in batches](insert_into_tables_in_batches.md)
+- [Ordering table columns](ordering_table_columns.md)
+- [Verifying database capabilities](verifying_database_capabilities.md)
+- [Query Count Limits](query_count_limits.md)
+- [Creating enums](creating_enums.md)
- [Client-side connection-pool](client_side_connection_pool.md)
- [Updating multiple values](setting_multiple_values.md)
- [Constraints naming conventions](constraint_naming_convention.md)
-- [Query performance guidelines](../query_performance.md)
+- [Query performance guidelines](query_performance.md)
- [Pagination guidelines](pagination_guidelines.md)
- [Pagination performance guidelines](pagination_performance_guidelines.md)
- [Efficient `IN` operator queries](efficient_in_operator_queries.md)
@@ -69,8 +79,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w
## Case studies
-- [Database case study: Filtering by label](../filtering_by_label.md)
-- [Database case study: Namespaces storage statistics](../namespaces_storage_statistics.md)
+- [Database case study: Filtering by label](filtering_by_label.md)
+- [Database case study: Namespaces storage statistics](namespaces_storage_statistics.md)
## Miscellaneous
diff --git a/doc/development/database/insert_into_tables_in_batches.md b/doc/development/database/insert_into_tables_in_batches.md
new file mode 100644
index 00000000000..ebed3d16319
--- /dev/null
+++ b/doc/development/database/insert_into_tables_in_batches.md
@@ -0,0 +1,196 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+description: "Sometimes it is necessary to store large amounts of records at once, which can be inefficient
+when iterating collections and performing individual `save`s. With the arrival of `insert_all`
+in Rails 6, which operates at the row level (that is, using `Hash`es), GitLab has added a set
+of APIs that make it safe and simple to insert ActiveRecord objects in bulk."
+---
+
+# Insert into tables in batches
+
+Sometimes it is necessary to store large amounts of records at once, which can be inefficient
+when iterating collections and saving each record individually. With the arrival of
+[`insert_all`](https://apidock.com/rails/ActiveRecord/Persistence/ClassMethods/insert_all)
+in Rails 6, which operates at the row level (that is, using `Hash` objects), GitLab has added a set
+of APIs that make it safe and simple to insert `ActiveRecord` objects in bulk.
+
+## Prepare `ApplicationRecord`s for bulk insertion
+
+In order for a model class to take advantage of the bulk insertion API, it has to include the
+`BulkInsertSafe` concern first:
+
+```ruby
+class MyModel < ApplicationRecord
+ # other includes here
+ # ...
+ include BulkInsertSafe # include this last
+
+ # ...
+end
+```
+
+The `BulkInsertSafe` concern has two functions:
+
+- It performs checks against your model class to ensure that it does not use ActiveRecord
+ APIs that are not safe to use with respect to bulk insertions (more on that below).
+- It adds new class methods `bulk_insert!` and `bulk_upsert!`, which you can use to insert many records at once.
+
+## Insert records with `bulk_insert!` and `bulk_upsert!`
+
+If the target class passes the checks performed by `BulkInsertSafe`, you can insert an array of
+ActiveRecord model objects as follows:
+
+```ruby
+records = [MyModel.new, ...]
+
+MyModel.bulk_insert!(records)
+```
+
+Calls to `bulk_insert!` always attempt to insert _new records_. If instead
+you would like to replace existing records with new values, while still inserting those
+that do not already exist, then you can use `bulk_upsert!`:
+
+```ruby
+records = [MyModel.new, existing_model, ...]
+
+MyModel.bulk_upsert!(records, unique_by: [:name])
+```
+
+In this example, `unique_by` specifies the columns by which records are considered to be
+unique and as such are updated if they existed prior to insertion. For example, if
+`existing_model` has a `name` attribute, and if a record with the same `name` value already
+exists, its fields are updated with those of `existing_model`.
+
+The `unique_by` parameter can also be passed as a `Symbol`, in which case it specifies
+a database index by which a column is considered unique:
+
+```ruby
+MyModel.bulk_insert!(records, unique_by: :index_on_name)
+```
+
+### Record validation
+
+The `bulk_insert!` method guarantees that `records` are inserted transactionally, and
+runs validations on each record prior to insertion. If any record fails to validate,
+an error is raised and the transaction is rolled back. You can turn off validations via
+the `:validate` option:
+
+```ruby
+MyModel.bulk_insert!(records, validate: false)
+```
+
+### Batch size configuration
+
+In those cases where the number of `records` is above a given threshold, insertions
+occur in multiple batches. The default batch size is defined in
+[`BulkInsertSafe::DEFAULT_BATCH_SIZE`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb).
+Assuming a default threshold of 500, inserting 950 records
+would result in two batches being written sequentially (of size 500 and 450 respectively.)
+You can override the default batch size via the `:batch_size` option:
+
+```ruby
+MyModel.bulk_insert!(records, batch_size: 100)
+```
+
+Assuming the same number of 950 records, this would result in 10 batches being written instead.
+Since this also affects the number of `INSERT` statements that occur, make sure you measure the
+performance impact this might have on your code. There is a trade-off between the number of
+`INSERT` statements the database has to process and the size and cost of each `INSERT`.
+
+### Handling duplicate records
+
+NOTE:
+This parameter applies only to `bulk_insert!`. If you intend to update existing
+records, use `bulk_upsert!` instead.
+
+It may happen that some records you are trying to insert already exist, which would result in
+primary key conflicts. There are two ways to address this problem: failing fast by raising an
+error or skipping duplicate records. The default behavior of `bulk_insert!` is to fail fast
+and raise an `ActiveRecord::RecordNotUnique` error.
+
+If this is undesirable, you can instead skip duplicate records with the `skip_duplicates` flag:
+
+```ruby
+MyModel.bulk_insert!(records, skip_duplicates: true)
+```
+
+### Requirements for safe bulk insertions
+
+Large parts of ActiveRecord's persistence API are built around the notion of callbacks. Many
+of these callbacks fire in response to model life cycle events such as `save` or `create`.
+These callbacks cannot be used with bulk insertions, since they are meant to be called for
+every instance that is saved or created. Since these events do not fire when
+records are inserted in bulk, we currently prevent their use.
+
+The specifics around which callbacks are explicitly allowed are defined in
+[`BulkInsertSafe`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb).
+Consult the module source code for details. If your class uses callbacks that are not explicitly designated
+safe and you `include BulkInsertSafe` the application fails with an error.
+
+### `BulkInsertSafe` versus `InsertAll`
+
+Internally, `BulkInsertSafe` is based on `InsertAll`, and you may wonder when to choose
+the former over the latter. To help you make the decision,
+the key differences between these classes are listed in the table below.
+
+| | Input type | Validates input | Specify batch size | Can bypass callbacks | Transactional |
+|--------------- | -------------------- | --------------- | ------------------ | --------------------------------- | ------------- |
+| `bulk_insert!` | ActiveRecord objects | Yes (optional) | Yes (optional) | No (prevents unsafe callback use) | Yes |
+| `insert_all!` | Attribute hashes | No | No | Yes | Yes |
+
+To summarize, `BulkInsertSafe` moves bulk inserts closer to how ActiveRecord objects
+and inserts would normally behave. However, if all you need is to insert raw data in bulk, then
+`insert_all` is more efficient.
+
+## Insert `has_many` associations in bulk
+
+A common use case is to save collections of associated relations through the owner side of the relation,
+where the owned relation is associated to the owner through the `has_many` class method:
+
+```ruby
+owner = OwnerModel.new(owned_relations: array_of_owned_relations)
+# saves all `owned_relations` one-by-one
+owner.save!
+```
+
+This issues a single `INSERT`, and transaction, for every record in `owned_relations`, which is inefficient if
+`array_of_owned_relations` is large. To remedy this, the `BulkInsertableAssociations` concern can be
+used to declare that the owner defines associations that are safe for bulk insertion:
+
+```ruby
+class OwnerModel < ApplicationRecord
+ # other includes here
+ # ...
+ include BulkInsertableAssociations # include this last
+
+ has_many :my_models
+end
+```
+
+Here `my_models` must be declared `BulkInsertSafe` (as described previously) for bulk insertions
+to happen. You can now insert any yet unsaved records as follows:
+
+```ruby
+BulkInsertableAssociations.with_bulk_insert do
+ owner = OwnerModel.new(my_models: array_of_my_model_instances)
+ # saves `my_models` using a single bulk insert (possibly via multiple batches)
+ owner.save!
+end
+```
+
+You can still save relations that are not `BulkInsertSafe` in this block; they
+simply are treated as if you had invoked `save` from outside the block.
+
+## Known limitations
+
+There are a few restrictions to how these APIs can be used:
+
+- `BulkInsertableAssociations`:
+ - It is currently only compatible with `has_many` relations.
+ - It does not yet support `has_many through: ...` relations.
+
+Moreover, input data should either be limited to around 1000 records at most,
+or already batched prior to calling bulk insert. The `INSERT` statement runs in a single
+transaction, so for large amounts of records it may negatively affect database stability.
diff --git a/doc/development/database/iterating_tables_in_batches.md b/doc/development/database/iterating_tables_in_batches.md
new file mode 100644
index 00000000000..6d7a57ecacb
--- /dev/null
+++ b/doc/development/database/iterating_tables_in_batches.md
@@ -0,0 +1,598 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Iterating tables in batches
+
+Rails provides a method called `in_batches` that can be used to iterate over
+rows in batches. For example:
+
+```ruby
+User.in_batches(of: 10) do |relation|
+ relation.update_all(updated_at: Time.now)
+end
+```
+
+Unfortunately, this method is implemented in a way that is not very efficient,
+both query and memory usage wise.
+
+To work around this you can include the `EachBatch` module into your models,
+then use the `each_batch` class method. For example:
+
+```ruby
+class User < ActiveRecord::Base
+ include EachBatch
+end
+
+User.each_batch(of: 10) do |relation|
+ relation.update_all(updated_at: Time.now)
+end
+```
+
+This produces queries such as:
+
+```plaintext
+User Load (0.7ms) SELECT "users"."id" FROM "users" WHERE ("users"."id" >= 41654) ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000
+ (0.7ms) SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687)
+```
+
+The API of this method is similar to `in_batches`, though it doesn't support
+all of the arguments that `in_batches` supports. You should always use
+`each_batch` _unless_ you have a specific need for `in_batches`.
+
+## Iterating over non-unique columns
+
+One should proceed with extra caution. When you iterate over an attribute that is not unique,
+even with the applied max batch size, there is no guarantee that the resulting batches do not
+surpass it. The following snippet demonstrates this situation when one attempt to select
+`Ci::Build` entries for users with `id` between `1` and `10,000`, the database returns
+`1 215 178` matching rows.
+
+```ruby
+[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size
+=> 1215178
+```
+
+This happens because the built relation is translated into the following query:
+
+```ruby
+[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql
+SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000
+=> nil
+```
+
+`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`,
+even though the range size is limited to a certain threshold (`10,000` in the previous example) this
+threshold does not translate to the size of the returned dataset. That happens because when taking
+`n` possible values of attributes, one can't tell for sure that the number of records that contains
+them is less than `n`.
+
+### Loose-index scan with `distinct_each_batch`
+
+When iterating over a non-unique column is necessary, use the `distinct_each_batch` helper
+method. The helper uses the [loose-index scan technique](https://wiki.postgresql.org/wiki/Loose_indexscan)
+(skip-index scan) to skip duplicated values within a database index.
+
+Example: iterating over distinct `author_id` in the Issue model
+
+```ruby
+Issue.distinct_each_batch(column: :author_id, of: 1000) do |relation|
+ users = User.where(id: relation.select(:author_id)).to_a
+end
+```
+
+The technique provides stable performance between the batches regardless of the data distribution.
+The `relation` object returns an ActiveRecord scope where only the given `column` is available.
+Other columns are not loaded.
+
+The underlying database queries use recursive CTEs, which adds extra overhead. We therefore advise to use
+smaller batch sizes than those used for a standard `each_batch` iteration.
+
+## Column definition
+
+`EachBatch` uses the primary key of the model by default for the iteration. This works most of the
+cases, however in some cases, you might want to use a different column for the iteration.
+
+```ruby
+Project.distinct.each_batch(column: :creator_id, of: 10) do |relation|
+ puts User.where(id: relation.select(:creator_id)).map(&:id)
+end
+```
+
+The query above iterates over the project creators and prints them out without duplications.
+
+NOTE:
+In case the column is not unique (no unique index definition), calling the `distinct` method on
+the relation is necessary. Using not unique column without `distinct` may result in `each_batch`
+falling into an endless loop as described in following
+[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097).
+
+## `EachBatch` in data migrations
+
+When dealing with data migrations the preferred way to iterate over a large volume of data is using
+`EachBatch`.
+
+A special case of data migration is a [background migration](background_migrations.md#scheduling)
+where the actual data modification is executed in a background job. The migration code that
+determines the data ranges (slices) and schedules the background jobs uses `each_batch`.
+
+## Efficient usage of `each_batch`
+
+`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch`
+does not magically solve all iteration-related performance problems, and it might not help at
+all in some scenarios. From the database point of view, correctly configured database indexes are
+also necessary to make `EachBatch` perform well.
+
+### Example 1: Simple iteration
+
+Let's consider that we want to iterate over the `users` table and print the `User` records to the
+standard output. The `users` table contains millions of records, thus running one query to fetch
+the users likely times out.
+
+![Users table overview](../img/each_batch_users_table_v13_7.png)
+
+This is a simplified version of the `users` table which contains several rows. We have a few
+smaller gaps in the `id` column to make the example a bit more realistic (a few records were
+already deleted). Currently, we have one index on the `id` field.
+
+Loading all users into memory (avoid):
+
+```ruby
+users = User.all
+
+users.each { |user| puts user.inspect }
+```
+
+Use `each_batch`:
+
+```ruby
+# Note: for this example I picked 5 as the batch size, the default is 1_000
+User.each_batch(of: 5) do |relation|
+ relation.each { |user| puts user.inspect }
+end
+```
+
+#### How `each_batch` works
+
+As the first step, it finds the lowest `id` (start `id`) in the table by executing the following
+database query:
+
+```sql
+SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1
+```
+
+![Reading the start ID value](../img/each_batch_users_table_iteration_1_v13_7.png)
+
+Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not
+accessed. Database indexes are sorted so taking out the first item is a very cheap operation.
+
+The next step is to find the next `id` (end `id`) which should respect the batch size
+configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause
+to get a "shifted" `id` value.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
+```
+
+![Reading the end ID value](../img/each_batch_users_table_iteration_2_v13_7.png)
+
+Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this
+query reads a maximum of six items from the index regardless of the table size or the iteration
+count.
+
+At this point, we know the `id` range for the first batch. Now it's time to construct the query
+for the `relation` block.
+
+```sql
+SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302
+```
+
+![Reading the rows from the `users` table](../img/each_batch_users_table_iteration_3_v13_7.png)
+
+Notice the `<` sign. Previously six items were read from the index and in this query, the last
+value is "excluded". The query looks at the index to get the location of the five `user`
+rows on the disk and read the rows from the table. The returned array is processed in Ruby.
+
+The first iteration is done. For the next iteration, the last `id` value is reused from the
+previous iteration in order to find out the next end `id` value.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
+```
+
+![Reading the second end ID value](../img/each_batch_users_table_iteration_4_v13_7.png)
+
+Now we can easily construct the `users` query for the second iteration.
+
+```sql
+SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353
+```
+
+![Reading the rows for the second iteration from the users table](../img/each_batch_users_table_iteration_5_v13_7.png)
+
+### Example 2: Iteration with filters
+
+Building on top of the previous example, we want to print users with zero sign-in count. We keep
+track of the number of sign-ins in the `sign_in_count` column so we write the following code:
+
+```ruby
+users = User.where(sign_in_count: 0)
+
+users.each_batch(of: 5) do |relation|
+ relation.each { |user| puts user.inspect }
+end
+```
+
+`each_batch` produces the following SQL query for the start `id` value:
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
+```
+
+Selecting only the `id` column and ordering by `id` forces the database to use the
+index on the `id` (primary key index) column however, we also have an extra condition on the
+`sign_in_count` column. The column is not part of the index, so the database needs to look into
+the actual table to find the first matching row.
+
+![Reading the index with extra filter](../img/each_batch_users_table_filter_v13_7.png)
+
+NOTE:
+The number of scanned rows depends on the data distribution in the table.
+
+- Best case scenario: the first user was never logged in. The database reads only one row.
+- Worst case scenario: all users were logged in at least once. The database reads all rows.
+
+In this particular example, the database had to read 10 rows (regardless of our batch size setting)
+to determine the first `id` value. In a "real-world" application it's hard to predict whether the
+filtering causes problems or not. In the case of GitLab, verifying the data on a
+production replica is a good start, but keep in mind that data distribution on GitLab.com can be
+different from self-managed instances.
+
+#### Improve filtering with `each_batch`
+
+##### Specialized conditional index
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0
+```
+
+This is how our table and the newly created index looks like:
+
+![Reading the specialized index](../img/each_batch_users_table_filtered_index_v13_7.png)
+
+This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the
+`each_batch` queries very effective (similar to the simple iteration example).
+
+It's rare when a user was never signed in so we a anticipate small index size. Including only the
+`id` in the index definition also helps to keep the index size small.
+
+##### Index on columns
+
+Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in
+those cases we cannot use the previously suggested conditional index because the `WHERE` condition
+does not match with our new filter (`sign_in_count > 10`).
+
+To address this problem, we have two options:
+
+- Create another, conditional index to cover the new query.
+- Replace the index with a more generalized configuration.
+
+NOTE:
+Having multiple indexes on the same table and on the same columns could be a performance bottleneck
+when writing data.
+
+Let's consider the following index (avoid):
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count)
+```
+
+The index definition starts with the `id` column which makes the index very inefficient from data
+selectivity point of view.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
+```
+
+Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to
+iterate over an unknown number of entries in the index, and then find the first item where the
+`sign_in_count` is `0`.
+
+![Reading an ineffective index](../img/each_batch_users_table_bad_index_v13_7.png)
+
+We can improve the query significantly by swapping the columns in the index definition (prefer).
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id)
+```
+
+![Reading a good index](../img/each_batch_users_table_good_index_v13_7.png)
+
+The following index definition does not work well with `each_batch` (avoid).
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count)
+```
+
+Since `each_batch` builds range queries based on the `id` column, this index cannot be used
+efficiently. The DB reads the rows from the table or uses a bitmap search where the primary
+key index is also read.
+
+##### "Slow" iteration
+
+Slow iteration means that we use a good index configuration to iterate over the table and
+apply filtering on the yielded relation.
+
+```ruby
+User.each_batch(of: 5) do |relation|
+ relation.where(sign_in_count: 0).each { |user| puts user inspect }
+end
+```
+
+The iteration uses the primary key index (on the `id` column) which makes it safe from statement
+timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already
+constrained (range). The number of rows is limited.
+
+Slow iteration generally takes more time to finish. The iteration count is higher and
+one iteration could yield fewer records than the batch size. Iterations may even yield
+0 records. This is not an optimal solution; however, in some cases (especially when
+dealing with large tables) this is the only viable option.
+
+### Using Subqueries
+
+Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example:
+
+```ruby
+projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id))
+
+projects.each_batch do |relation|
+ # do something
+end
+```
+
+The iteration uses the `id` column of the `projects` table. The batching does not affect the
+subquery. This means for each iteration, the subquery is executed by the database. This adds a
+constant "load" on the query which often ends up in statement timeouts. We have an unknown number
+of [confidential issues](../../user/project/issues/confidential_issues.md), the execution time
+and the accessed database rows depend on the data distribution in the `issues` table.
+
+NOTE:
+Using subqueries works only when the subquery returns a small number of rows.
+
+#### Improving Subqueries
+
+When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id`
+can be part of the generated `relation` object.
+
+```ruby
+projects = Project.all
+
+projects.each_batch do |relation|
+ relation.where(creator_id: Issue.where(confidential: true).select(:author_id))
+end
+```
+
+If the query on the `issues` table itself is not performant enough, a nested loop could be
+constructed. Try to avoid it when possible.
+
+```ruby
+projects = Project.all
+
+projects.each_batch do |relation|
+ issues = Issue.where(confidential: true)
+
+ issues.each_batch do |issues_relation|
+ relation.where(creator_id: issues_relation.select(:author_id))
+ end
+end
+```
+
+If we know that the `issues` table has many more rows than `projects`, it would make sense to flip
+the queries, where the `issues` table is batched first.
+
+### Using `JOIN` and `EXISTS`
+
+When to use `JOINS`:
+
+- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record
+(almost) always exists. This works well for "extension-like" tables:
+ - `projects` - `project_settings`
+ - `users` - `user_details`
+ - `users` - `user_statuses`
+- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded
+relation so the iteration is not affected by the data distribution in the joined table.
+
+Example:
+
+```ruby
+users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id")
+
+users.each_batch do |relation|
+ relation.where("personal_access_tokens.name = 'name'")
+end
+```
+
+`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query:
+
+```ruby
+User.each_batch do |relation|
+ relation.where("EXISTS (SELECT 1 FROM ...")
+end
+```
+
+### Complex queries on the relation object
+
+When the `relation` object has several extra conditions, the execution plans might become
+"unstable".
+
+Example:
+
+```ruby
+Issue.each_batch do |relation|
+ relation
+ .joins(:metrics)
+ .joins(:merge_requests_closing_issues)
+ .where("id IN (SELECT ...)")
+ .where(confidential: true)
+end
+```
+
+Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then
+filters down the results according to the provided queries. The planner might decide that
+using a bitmap index lookup with the index on the `confidential` column is a better way to
+execute the query. This can cause an unexpectedly high amount of rows to be read and the
+query could time out.
+
+Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records
+however, the planner does not know this.
+
+Common table expression (CTE) trick to force the range query to execute first:
+
+```ruby
+Issue.each_batch(of: 1000) do |relation|
+ cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000))
+
+ scope = cte
+ .apply_to(Issue.all)
+ .joins(:metrics)
+ .joins(:merge_requests_closing_issues)
+ .where("id IN (SELECT ...)")
+ .where(confidential: true)
+
+ puts scope.to_a
+end
+```
+
+### `EachBatch` vs `BatchCount`
+
+When adding new counters for Service Ping, the preferred way to count records is using the
+`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount`
+has similar performance characteristics like `EachBatch`. Most of the tips and suggestions
+for improving `BatchCount` mentioned above applies to `BatchCount` as well.
+
+## Iterate with keyset pagination
+
+There are a few special cases where iterating with `EachBatch` does not work. `EachBatch`
+requires one distinct column (usually the primary key), which makes the iteration impossible
+for timestamp columns and tables with composite primary keys.
+
+Where `EachBatch` does not work, you can use
+[keyset pagination](pagination_guidelines.md#keyset-pagination) to iterate over the
+table or a range of rows. The scaling and performance characteristics are very similar to
+`EachBatch`.
+
+Examples:
+
+- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker
+if column user to sort by does not contain unique values.
+- Iterate over the table with composite primary keys.
+
+### Iterate over the issues in a project by creation date
+
+You can use keyset pagination to iterate over any database column in a specific order (for example,
+`created_at DESC`). To ensure consistent order of the returned records with the same values for
+`created_at`, use a tie-breaker column with unique values (for example, `id`).
+
+Assume you have the following index in the `issues` table:
+
+```sql
+idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id)
+```
+
+### Fetching records for further processing
+
+The following snippet iterates over issue records within the project using the specified order
+(`created_at, id`).
+
+```ruby
+scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+ puts records.map(&:id)
+end
+```
+
+You can add extra filters to the query. This example only lists the issue IDs created in the last
+30 days:
+
+```ruby
+scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+ puts records.map(&:id)
+end
+```
+
+### Updating records in the batch
+
+For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it
+generates an incorrect `UPDATE` statement.
+You can use raw SQL for updating records in batches:
+
+```ruby
+scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+ ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})")
+end
+```
+
+NOTE:
+To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause.
+
+### Iterate over the `merge_request_diff_commits` table
+
+The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id,
+relative_order`), which makes `EachBatch` impossible to use efficiently.
+
+To paginate over the `merge_request_diff_commits` table, you can use the following snippet:
+
+```ruby
+# Custom order object configuration:
+order = Gitlab::Pagination::Keyset::Order.build([
+ Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
+ attribute_name: 'merge_request_diff_id',
+ order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc,
+ nullable: :not_nullable,
+ distinct: false,
+ ),
+ Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
+ attribute_name: 'relative_order',
+ order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc,
+ nullable: :not_nullable,
+ distinct: false,
+ )
+])
+MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries
+
+scope = MergeRequestDiffCommit.order(order)
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+ puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect
+end
+```
+
+### Order object configuration
+
+Keyset pagination works well with simple `ActiveRecord` `order` scopes
+([first example](#iterate-over-the-issues-in-a-project-by-creation-date).
+However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example)
+for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be
+automatically determined by the keyset pagination library, an error is raised.
+
+The code comments of the
+[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb)
+and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb)
+classes give an overview of the possible options for configuring the `ORDER BY` clause. You can
+also find a few code examples in the
+[keyset pagination](keyset_pagination.md#complex-order-configuration) documentation.
diff --git a/doc/development/database/loose_foreign_keys.md b/doc/development/database/loose_foreign_keys.md
index 6aa1b9c40ff..8dbccf048d7 100644
--- a/doc/development/database/loose_foreign_keys.md
+++ b/doc/development/database/loose_foreign_keys.md
@@ -10,7 +10,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w
In relational databases (including PostgreSQL), foreign keys provide a way to link
two database tables together, and ensure data-consistency between them. In GitLab,
-[foreign keys](../foreign_keys.md) are vital part of the database design process.
+[foreign keys](foreign_keys.md) are vital part of the database design process.
Most of our database tables have foreign keys.
With the ongoing database [decomposition work](https://gitlab.com/groups/gitlab-org/-/epics/6168),
@@ -221,8 +221,8 @@ ON DELETE CASCADE;
```
The migration must run after the `DELETE` trigger is installed and the loose
-foreign key definition is deployed. As such, it must be a [post-deployment
-migration](post_deployment_migrations.md) dated after the migration for the
+foreign key definition is deployed. As such, it must be a
+[post-deployment migration](post_deployment_migrations.md) dated after the migration for the
trigger. If the foreign key is deleted earlier, there is a good chance of
introducing data inconsistency which needs manual cleanup:
@@ -251,8 +251,8 @@ When the loose foreign key definition is no longer needed (parent table is remov
we need to remove the definition from the YAML file and ensure that we don't leave pending deleted
records in the database.
-1. Remove the loose foreign key definition from the configuration (`config/gitlab_loose_foreign_keys.yml`).
1. Remove the deletion tracking trigger from the parent table (if the parent table is still there).
+1. Remove the loose foreign key definition from the configuration (`config/gitlab_loose_foreign_keys.yml`).
1. Remove leftover deleted records from the `loose_foreign_keys_deleted_records` table.
Migration for removing the trigger:
@@ -395,8 +395,7 @@ We considered using these Rails features as an alternative to foreign keys but t
For non-trivial objects that need to clean up data outside the
database (for example, object storage) where you might wish to use `dependent: :destroy`,
see alternatives in
-[Avoid `dependent: :nullify` and `dependent: :destroy` across
-databases](./multiple_databases.md#avoid-dependent-nullify-and-dependent-destroy-across-databases).
+[Avoid `dependent: :nullify` and `dependent: :destroy` across databases](multiple_databases.md#avoid-dependent-nullify-and-dependent-destroy-across-databases).
## Risks of loose foreign keys and possible mitigations
diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md
index 9641ea37002..31fc454f8a7 100644
--- a/doc/development/database/multiple_databases.md
+++ b/doc/development/database/multiple_databases.md
@@ -1,15 +1,14 @@
---
stage: Data Stores
-group: Sharding
+group: Pods
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Multiple Databases
To allow GitLab to scale further we
-[decomposed the GitLab application database into multiple
-databases](https://gitlab.com/groups/gitlab-org/-/epics/6168). The two databases
-are `main` and `ci`. GitLab supports being run with either one database or two databases.
+[decomposed the GitLab application database into multiple databases](https://gitlab.com/groups/gitlab-org/-/epics/6168).
+The two databases are `main` and `ci`. GitLab supports being run with either one database or two databases.
On GitLab.com we are using two separate databases.
## GitLab Schema
@@ -246,7 +245,7 @@ where projects_with_ci_feature_usage.ci_feature = 'code_coverage'
```
The above example uses as a text column for simplicity but we should probably
-use an [enum](../creating_enums.md) to save space.
+use an [enum](creating_enums.md) to save space.
The downside of this new design is that this may need to be
updated (removed if the `ci_daily_build_group_report_results` is deleted).
diff --git a/doc/development/database/namespaces_storage_statistics.md b/doc/development/database/namespaces_storage_statistics.md
new file mode 100644
index 00000000000..702129b9ddb
--- /dev/null
+++ b/doc/development/database/namespaces_storage_statistics.md
@@ -0,0 +1,193 @@
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Database case study: Namespaces storage statistics
+
+## Introduction
+
+On [Storage and limits management for groups](https://gitlab.com/groups/gitlab-org/-/epics/886),
+we want to facilitate a method for easily viewing the amount of
+storage consumed by a group, and allow easy management.
+
+## Proposal
+
+1. Create a new ActiveRecord model to hold the namespaces' statistics in an aggregated form (only for root [namespaces](../../user/namespace/index.md)).
+1. Refresh the statistics in this model every time a project belonging to this namespace is changed.
+
+## Problem
+
+In GitLab, we update the project storage statistics through a
+[callback](https://gitlab.com/gitlab-org/gitlab/-/blob/4ab54c2233e91f60a80e5b6fa2181e6899fdcc3e/app/models/project.rb#L97)
+every time the project is saved.
+
+The summary of those statistics per namespace is then retrieved
+by [`Namespaces#with_statistics`](https://gitlab.com/gitlab-org/gitlab/-/blob/4ab54c2233e91f60a80e5b6fa2181e6899fdcc3e/app/models/namespace.rb#L70) scope. Analyzing this query we noticed that:
+
+- It takes up to `1.2` seconds for namespaces with over `15k` projects.
+- It can't be analyzed with [ChatOps](../chatops_on_gitlabcom.md), as it times out.
+
+Additionally, the pattern that is currently used to update the project statistics
+(the callback) doesn't scale adequately. It is currently one of the largest
+[database queries transactions on production](https://gitlab.com/gitlab-org/gitlab/-/issues/29070)
+that takes the most time overall. We can't add one more query to it as
+it increases the transaction's length.
+
+Because of all of the above, we can't apply the same pattern to store
+and update the namespaces statistics, as the `namespaces` table is one
+of the largest tables on GitLab.com. Therefore we needed to find a performant and
+alternative method.
+
+## Attempts
+
+### Attempt A: PostgreSQL materialized view
+
+Model can be updated through a refresh strategy based on a project routes SQL and a [materialized view](https://www.postgresql.org/docs/11/rules-materializedviews.html):
+
+```sql
+SELECT split_part("rs".path, '/', 1) as root_path,
+ COALESCE(SUM(ps.storage_size), 0) AS storage_size,
+ COALESCE(SUM(ps.repository_size), 0) AS repository_size,
+ COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
+ COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
+ COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
+ COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
+ COALESCE(SUM(ps.packages_size), 0) AS packages_size,
+ COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
+ COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
+FROM "projects"
+ INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
+ INNER JOIN project_statistics ps ON ps.project_id = projects.id
+GROUP BY root_path
+```
+
+We could then execute the query with:
+
+```sql
+REFRESH MATERIALIZED VIEW root_namespace_storage_statistics;
+```
+
+While this implied a single query update (and probably a fast one), it has some downsides:
+
+- Materialized views syntax varies from PostgreSQL and MySQL. While this feature was worked on, MySQL was still supported by GitLab.
+- Rails does not have native support for materialized views. We'd need to use a specialized gem to take care of the management of the database views, which implies additional work.
+
+### Attempt B: An update through a CTE
+
+Similar to Attempt A: Model update done through a refresh strategy with a [Common Table Expression](https://www.postgresql.org/docs/9.1/queries-with.html)
+
+```sql
+WITH refresh AS (
+ SELECT split_part("rs".path, '/', 1) as root_path,
+ COALESCE(SUM(ps.storage_size), 0) AS storage_size,
+ COALESCE(SUM(ps.repository_size), 0) AS repository_size,
+ COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
+ COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
+ COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
+ COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
+ COALESCE(SUM(ps.packages_size), 0) AS packages_size,
+ COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
+ COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
+ FROM "projects"
+ INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
+ INNER JOIN project_statistics ps ON ps.project_id = projects.id
+ GROUP BY root_path)
+UPDATE namespace_storage_statistics
+SET storage_size = refresh.storage_size,
+ repository_size = refresh.repository_size,
+ wiki_size = refresh.wiki_size,
+ lfs_objects_size = refresh.lfs_objects_size,
+ build_artifacts_size = refresh.build_artifacts_size,
+ pipeline_artifacts_size = refresh.pipeline_artifacts_size,
+ packages_size = refresh.packages_size,
+ snippets_size = refresh.snippets_size,
+ uploads_size = refresh.uploads_size
+FROM refresh
+ INNER JOIN routes rs ON rs.path = refresh.root_path AND rs.source_type = 'Namespace'
+WHERE namespace_storage_statistics.namespace_id = rs.source_id
+```
+
+Same benefits and downsides as attempt A.
+
+### Attempt C: Get rid of the model and store the statistics on Redis
+
+We could get rid of the model that stores the statistics in aggregated form and instead use a Redis Set.
+This would be the [boring solution](https://about.gitlab.com/handbook/values/#boring-solutions) and the fastest one
+to implement, as GitLab already includes Redis as part of its [Architecture](../architecture.md#redis).
+
+The downside of this approach is that Redis does not provide the same persistence/consistency guarantees as PostgreSQL,
+and this is information we can't afford to lose in a Redis failure.
+
+### Attempt D: Tag the root namespace and its child namespaces
+
+Directly relate the root namespace to its child namespaces, so
+whenever a namespace is created without a parent, this one is tagged
+with the root namespace ID:
+
+| ID | root ID | parent ID |
+|:---|:--------|:----------|
+| 1 | 1 | NULL |
+| 2 | 1 | 1 |
+| 3 | 1 | 2 |
+
+To aggregate the statistics inside a namespace we'd execute something like:
+
+```sql
+SELECT COUNT(...)
+FROM projects
+WHERE namespace_id IN (
+ SELECT id
+ FROM namespaces
+ WHERE root_id = X
+)
+```
+
+Even though this approach would make aggregating much easier, it has some major downsides:
+
+- We'd have to migrate **all namespaces** by adding and filling a new column. Because of the size of the table, dealing with time/cost would be significant. The background migration would take approximately `153h`, see <https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/29772>.
+- Background migration has to be shipped one release before, delaying the functionality by another milestone.
+
+### Attempt E (final): Update the namespace storage statistics asynchronously
+
+This approach consists of continuing to use the incremental statistics updates we already have,
+but we refresh them through Sidekiq jobs and in different transactions:
+
+1. Create a second table (`namespace_aggregation_schedules`) with two columns `id` and `namespace_id`.
+1. Whenever the statistics of a project changes, insert a row into `namespace_aggregation_schedules`
+ - We don't insert a new row if there's already one related to the root namespace.
+ - Keeping in mind the length of the transaction that involves updating `project_statistics`(<https://gitlab.com/gitlab-org/gitlab/-/issues/29070>), the insertion should be done in a different transaction and through a Sidekiq Job.
+1. After inserting the row, we schedule another worker to be executed asynchronously at two different moments:
+ - One enqueued for immediate execution and another one scheduled in `1.5h` hours.
+ - We only schedule the jobs, if we can obtain a `1.5h` lease on Redis on a key based on the root namespace ID.
+ - If we can't obtain the lease, it indicates there's another aggregation already in progress, or scheduled in no more than `1.5h`.
+1. This worker will:
+ - Update the root namespace storage statistics by querying all the namespaces through a service.
+ - Delete the related `namespace_aggregation_schedules` after the update.
+1. Another Sidekiq job is also included to traverse any remaining rows on the `namespace_aggregation_schedules` table and schedule jobs for every pending row.
+ - This job is scheduled with cron to run every night (UTC).
+
+This implementation has the following benefits:
+
+- All the updates are done asynchronously, so we're not increasing the length of the transactions for `project_statistics`.
+- We're doing the update in a single SQL query.
+- It is compatible with PostgreSQL and MySQL.
+- No background migration required.
+
+The only downside of this approach is that namespaces' statistics are updated up to `1.5` hours after the change is done,
+which means there's a time window in which the statistics are inaccurate. Because we're still not
+[enforcing storage limits](https://gitlab.com/gitlab-org/gitlab/-/issues/17664), this is not a major problem.
+
+## Conclusion
+
+Updating the storage statistics asynchronously, was the less problematic and
+performant approach of aggregating the root namespaces.
+
+All the details regarding this use case can be found on:
+
+- <https://gitlab.com/gitlab-org/gitlab-foss/-/issues/62214>
+- Merge Request with the implementation: <https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/28996>
+
+Performance of the namespace storage statistics were measured in staging and production (GitLab.com). All results were posted
+on <https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64092>: No problem has been reported so far.
diff --git a/doc/development/database/not_null_constraints.md b/doc/development/database/not_null_constraints.md
index 3962307f80d..9b3d017b09f 100644
--- a/doc/development/database/not_null_constraints.md
+++ b/doc/development/database/not_null_constraints.md
@@ -135,7 +135,7 @@ post-deployment migration or a background data migration:
- If the data volume is less than `1000` records, then the data migration can be executed within the post-migration.
- If the data volume is higher than `1000` records, it's advised to create a background migration.
-When unsure about which option to use, please contact the Database team for advice.
+When unsure about which option to use, contact the Database team for advice.
Back to our example, the epics table is not considerably large nor frequently accessed,
so we add a post-deployment migration for the 13.0 milestone (current),
@@ -206,6 +206,6 @@ In that rare case you need 3 releases end-to-end:
1. Release `N.M+1` - Cleanup the background migration.
1. Release `N.M+2` - Validate the `NOT NULL` constraint.
-For these cases, please consult the database team early in the update cycle. The `NOT NULL`
+For these cases, consult the database team early in the update cycle. The `NOT NULL`
constraint may not be required or other options could exist that do not affect really large
or frequently accessed tables.
diff --git a/doc/development/database/ordering_table_columns.md b/doc/development/database/ordering_table_columns.md
new file mode 100644
index 00000000000..7cd3d4fb208
--- /dev/null
+++ b/doc/development/database/ordering_table_columns.md
@@ -0,0 +1,152 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Ordering Table Columns in PostgreSQL
+
+For GitLab we require that columns of new tables are ordered to use the
+least amount of space. An easy way of doing this is to order them based on the
+type size in descending order with variable sizes (`text`, `varchar`, arrays,
+`json`, `jsonb`, and so on) at the end.
+
+Similar to C structures the space of a table is influenced by the order of
+columns. This is because the size of columns is aligned depending on the type of
+the following column. Let's consider an example:
+
+- `id` (integer, 4 bytes)
+- `name` (text, variable)
+- `user_id` (integer, 4 bytes)
+
+The first column is a 4-byte integer. The next is text of variable length. The
+`text` data type requires 1-word alignment, and on 64-bit platform, 1 word is 8
+bytes. To meet the alignment requirements, four zeros are to be added right
+after the first column, so `id` occupies 4 bytes, then 4 bytes of alignment
+padding, and only next `name` is being stored. Therefore, in this case, 8 bytes
+are spent for storing a 4-byte integer.
+
+The space between rows is also subject to alignment padding. The `user_id`
+column takes only 4 bytes, and on 64-bit platform, 4 zeroes are added for
+alignment padding, to allow storing the next row beginning with the "clear" word.
+
+As a result, the actual size of each column would be (omitting variable length
+data and 24-byte tuple header): 8 bytes, variable, 8 bytes. This means that
+each row requires at least 16 bytes for the two 4-byte integers. If a table
+has a few rows this is not an issue. However, once you start storing millions of
+rows you can save space by using a different order. For the above example, the
+ideal column order would be the following:
+
+- `id` (integer, 4 bytes)
+- `user_id` (integer, 4 bytes)
+- `name` (text, variable)
+
+or
+
+- `name` (text, variable)
+- `id` (integer, 4 bytes)
+- `user_id` (integer, 4 bytes)
+
+In these examples, the `id` and `user_id` columns are packed together, which
+means we only need 8 bytes to store _both_ of them. This in turn means each row
+requires 8 bytes less space.
+
+Since Ruby on Rails 5.1, the default data type for IDs is `bigint`, which uses 8 bytes.
+We are using `integer` in the examples to showcase a more realistic reordering scenario.
+
+## Type Sizes
+
+While the [PostgreSQL documentation](https://www.postgresql.org/docs/current/datatype.html) contains plenty
+of information we list the sizes of common types here so it's easier to
+look them up. Here "word" refers to the word size, which is 4 bytes for a 32
+bits platform and 8 bytes for a 64 bits platform.
+
+| Type | Size | Alignment needed |
+|:-----------------|:-------------------------------------|:-----------|
+| `smallint` | 2 bytes | 1 word |
+| `integer` | 4 bytes | 1 word |
+| `bigint` | 8 bytes | 8 bytes |
+| `real` | 4 bytes | 1 word |
+| `double precision` | 8 bytes | 8 bytes |
+| `boolean` | 1 byte | not needed |
+| `text` / `string` | variable, 1 byte plus the data | 1 word |
+| `bytea` | variable, 1 or 4 bytes plus the data | 1 word |
+| `timestamp` | 8 bytes | 8 bytes |
+| `timestamptz` | 8 bytes | 8 bytes |
+| `date` | 4 bytes | 1 word |
+
+A "variable" size means the actual size depends on the value being stored. If
+PostgreSQL determines this can be embedded directly into a row it may do so, but
+for very large values it stores the data externally and store a pointer (of
+1 word in size) in the column. Because of this variable sized columns should
+always be at the end of a table.
+
+## Real Example
+
+Let's use the `events` table as an example, which currently has the following
+layout:
+
+| Column | Type | Size |
+|:--------------|:----------------------------|:---------|
+| `id` | integer | 4 bytes |
+| `target_type` | character varying | variable |
+| `target_id` | integer | 4 bytes |
+| `title` | character varying | variable |
+| `data` | text | variable |
+| `project_id` | integer | 4 bytes |
+| `created_at` | timestamp without time zone | 8 bytes |
+| `updated_at` | timestamp without time zone | 8 bytes |
+| `action` | integer | 4 bytes |
+| `author_id` | integer | 4 bytes |
+
+After adding padding to align the columns this would translate to columns being
+divided into fixed size chunks as follows:
+
+| Chunk Size | Columns |
+|:-----------|:----------------------|
+| 8 bytes | `id` |
+| variable | `target_type` |
+| 8 bytes | `target_id` |
+| variable | `title` |
+| variable | `data` |
+| 8 bytes | `project_id` |
+| 8 bytes | `created_at` |
+| 8 bytes | `updated_at` |
+| 8 bytes | `action`, `author_id` |
+
+This means that excluding the variable sized data and tuple header, we need at
+least 8 * 6 = 48 bytes per row.
+
+We can optimise this by using the following column order instead:
+
+| Column | Type | Size |
+|:--------------|:----------------------------|:---------|
+| `created_at` | timestamp without time zone | 8 bytes |
+| `updated_at` | timestamp without time zone | 8 bytes |
+| `id` | integer | 4 bytes |
+| `target_id` | integer | 4 bytes |
+| `project_id` | integer | 4 bytes |
+| `action` | integer | 4 bytes |
+| `author_id` | integer | 4 bytes |
+| `target_type` | character varying | variable |
+| `title` | character varying | variable |
+| `data` | text | variable |
+
+This would produce the following chunks:
+
+| Chunk Size | Columns |
+|:-----------|:-----------------------|
+| 8 bytes | `created_at` |
+| 8 bytes | `updated_at` |
+| 8 bytes | `id`, `target_id` |
+| 8 bytes | `project_id`, `action` |
+| 8 bytes | `author_id` |
+| variable | `target_type` |
+| variable | `title` |
+| variable | `data` |
+
+Here we only need 40 bytes per row excluding the variable sized data and 24-byte
+tuple header. 8 bytes being saved may not sound like much, but for tables as
+large as the `events` table it does begin to matter. For example, when storing
+80 000 000 rows this translates to a space saving of at least 610 MB, all by
+just changing the order of a few columns.
diff --git a/doc/development/database/pagination_guidelines.md b/doc/development/database/pagination_guidelines.md
index 1641708ce01..fe2e3b46939 100644
--- a/doc/development/database/pagination_guidelines.md
+++ b/doc/development/database/pagination_guidelines.md
@@ -192,7 +192,7 @@ The query execution plan shows that this query is efficient, the database only r
(6 rows)
```
-See the [Understanding EXPLAIN plans](../understanding_explain_plans.md) to find more information about reading execution plans.
+See the [Understanding EXPLAIN plans](understanding_explain_plans.md) to find more information about reading execution plans.
Let's visit the 50_000th page:
diff --git a/doc/development/database/pagination_performance_guidelines.md b/doc/development/database/pagination_performance_guidelines.md
index b5040e499e4..0fef246f133 100644
--- a/doc/development/database/pagination_performance_guidelines.md
+++ b/doc/development/database/pagination_performance_guidelines.md
@@ -12,11 +12,11 @@ The following document gives a few ideas for improving the pagination (sorting)
When ordering the columns it's advised to order by distinct columns only. Consider the following example:
-|`id`|`created_at`|
-|-|-|
-|1|2021-01-04 14:13:43|
-|2|2021-01-05 19:03:12|
-|3|2021-01-05 19:03:12|
+| `id` | `created_at` |
+|------|-----------------------|
+| `1` | `2021-01-04 14:13:43` |
+| `2` | `2021-01-05 19:03:12` |
+| `3` | `2021-01-05 19:03:12` |
If we order by `created_at`, the result would likely depend on how the records are located on the disk.
diff --git a/doc/development/database/polymorphic_associations.md b/doc/development/database/polymorphic_associations.md
new file mode 100644
index 00000000000..ac4dc7720a5
--- /dev/null
+++ b/doc/development/database/polymorphic_associations.md
@@ -0,0 +1,152 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Polymorphic Associations
+
+**Summary:** always use separate tables instead of polymorphic associations.
+
+Rails makes it possible to define so called "polymorphic associations". This
+usually works by adding two columns to a table: a target type column, and a
+target ID. For example, at the time of writing we have such a setup for
+`members` with the following columns:
+
+- `source_type`: a string defining the model to use, can be either `Project` or
+ `Namespace`.
+- `source_id`: the ID of the row to retrieve based on `source_type`. For
+ example, when `source_type` is `Project` then `source_id` contains a
+ project ID.
+
+While such a setup may appear to be useful, it comes with many drawbacks; enough
+that you should avoid this at all costs.
+
+## Space Wasted
+
+Because this setup relies on string values to determine the model to use, it
+wastes a lot of space. For example, for `Project` and `Namespace` the
+maximum size is 9 bytes, plus 1 extra byte for every string when using
+PostgreSQL. While this may only be 10 bytes per row, given enough tables and
+rows using such a setup we can end up wasting quite a bit of disk space and
+memory (for any indexes).
+
+## Indexes
+
+Because our associations are broken up into two columns this may result in
+requiring composite indexes for queries to be performed efficiently. While
+composite indexes are not wrong at all, they can be tricky to set up as the
+ordering of columns in these indexes is important to ensure optimal performance.
+
+## Consistency
+
+One really big problem with polymorphic associations is being unable to enforce
+data consistency on the database level using foreign keys. For consistency to be
+enforced on the database level one would have to write their own foreign key
+logic to support polymorphic associations.
+
+Enforcing consistency on the database level is absolutely crucial for
+maintaining a healthy environment, and thus is another reason to avoid
+polymorphic associations.
+
+## Query Overhead
+
+When using polymorphic associations you always need to filter using both
+columns. For example, you may end up writing a query like this:
+
+```sql
+SELECT *
+FROM members
+WHERE source_type = 'Project'
+AND source_id = 13083;
+```
+
+Here PostgreSQL can perform the query quite efficiently if both columns are
+indexed. As the query gets more complex, it may not be able to use these
+indexes effectively.
+
+## Mixed Responsibilities
+
+Similar to functions and classes, a table should have a single responsibility:
+storing data with a certain set of pre-defined columns. When using polymorphic
+associations, you are storing different types of data (possibly with
+different columns set) in the same table.
+
+## The Solution
+
+Fortunately, there is a solution to these problems: use a
+separate table for every type you would otherwise store in the same table. Using
+a separate table allows you to use everything a database may provide to ensure
+consistency and query data efficiently, without any additional application logic
+being necessary.
+
+Let's say you have a `members` table storing both approved and pending members,
+for both projects and groups, and the pending state is determined by the column
+`requested_at` being set or not. Schema wise such a setup can lead to various
+columns only being set for certain rows, wasting space. It's also possible that
+certain indexes are only set for certain rows, again wasting space. Finally,
+querying such a table requires less than ideal queries. For example:
+
+```sql
+SELECT *
+FROM members
+WHERE requested_at IS NULL
+AND source_type = 'GroupMember'
+AND source_id = 4
+```
+
+Instead such a table should be broken up into separate tables. For example, you
+may end up with 4 tables in this case:
+
+- project_members
+- group_members
+- pending_project_members
+- pending_group_members
+
+This makes querying data trivial. For example, to get the members of a group
+you'd run:
+
+```sql
+SELECT *
+FROM group_members
+WHERE group_id = 4
+```
+
+To get all the pending members of a group in turn you'd run:
+
+```sql
+SELECT *
+FROM pending_group_members
+WHERE group_id = 4
+```
+
+If you want to get both you can use a `UNION`, though you need to be explicit
+about what columns you want to `SELECT` as otherwise the result set uses the
+columns of the first query. For example:
+
+```sql
+SELECT id, 'Group' AS target_type, group_id AS target_id
+FROM group_members
+
+UNION ALL
+
+SELECT id, 'Project' AS target_type, project_id AS target_id
+FROM project_members
+```
+
+The above example is perhaps a bit silly, but it shows that there's nothing
+stopping you from merging the data together and presenting it on the same page.
+Selecting columns explicitly can also speed up queries as the database has to do
+less work to get the data (compared to selecting all columns, even ones you're
+not using).
+
+Our schema also becomes easier. No longer do we need to both store and index the
+`source_type` column, we can define foreign keys easily, and we don't need to
+filter rows using the `IS NULL` condition.
+
+To summarize: using separate tables allows us to use foreign keys effectively,
+create indexes only where necessary, conserve space, query data more
+efficiently, and scale these tables more easily (for example, by storing them on
+separate disks). A nice side effect of this is that code can also become easier,
+as a single model isn't responsible for handling different kinds of
+data.
diff --git a/doc/development/database/post_deployment_migrations.md b/doc/development/database/post_deployment_migrations.md
index a49c77ca047..8166fcc8905 100644
--- a/doc/development/database/post_deployment_migrations.md
+++ b/doc/development/database/post_deployment_migrations.md
@@ -25,6 +25,10 @@ This however skips post deployment migrations:
SKIP_POST_DEPLOYMENT_MIGRATIONS=true bundle exec rake db:migrate
```
+For GitLab.com, these migrations are executed on a daily basis at the discretion of
+release managers through the
+[post-deploy migration pipeline](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/post_deploy_migration/readme.md).
+
## Deployment Integration
Say you're using Chef for deploying new versions of GitLab and you'd like to run
diff --git a/doc/development/database/query_count_limits.md b/doc/development/database/query_count_limits.md
new file mode 100644
index 00000000000..a888bbfc6e7
--- /dev/null
+++ b/doc/development/database/query_count_limits.md
@@ -0,0 +1,70 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Query Count Limits
+
+Each controller or API endpoint is allowed to execute up to 100 SQL queries and
+in test environments we raise an error when this threshold is exceeded.
+
+## Solving Failing Tests
+
+When a test fails because it executes more than 100 SQL queries there are two
+solutions to this problem:
+
+- Reduce the number of SQL queries that are executed.
+- Disable query limiting for the controller or API endpoint.
+
+You should only resort to disabling query limits when an existing controller or endpoint
+is to blame as in this case reducing the number of SQL queries can take a lot of
+effort. Newly added controllers and endpoints are not allowed to execute more
+than 100 SQL queries and no exceptions are made for this rule. _If_ a large
+number of SQL queries is necessary to perform certain work it's best to have
+this work performed by Sidekiq instead of doing this directly in a web request.
+
+## Disable query limiting
+
+In the event that you _have_ to disable query limits for a controller, you must first
+create an issue. This issue should (preferably in the title) mention the
+controller or endpoint and include the appropriate labels (`database`,
+`performance`, and at least a team specific label such as `Discussion`).
+
+After the issue has been created, you can disable query limits on the code in question. For
+Rails controllers it's best to create a `before_action` hook that runs as early
+as possible. The called method in turn should call
+`Gitlab::QueryLimiting.disable!('issue URL here')`. For example:
+
+```ruby
+class MyController < ApplicationController
+ before_action :disable_query_limiting, only: [:show]
+
+ def index
+ # ...
+ end
+
+ def show
+ # ...
+ end
+
+ def disable_query_limiting
+ Gitlab::QueryLimiting.disable!('https://gitlab.com/gitlab-org/...')
+ end
+end
+```
+
+By using a `before_action` you don't have to modify the controller method in
+question, reducing the likelihood of merge conflicts.
+
+For Grape API endpoints there unfortunately is not a reliable way of running a
+hook before a specific endpoint. This means that you have to add the allowlist
+call directly into the endpoint like so:
+
+```ruby
+get '/projects/:id/foo' do
+ Gitlab::QueryLimiting.disable!('...')
+
+ # ...
+end
+```
diff --git a/doc/development/database/query_performance.md b/doc/development/database/query_performance.md
new file mode 100644
index 00000000000..41dbd08d726
--- /dev/null
+++ b/doc/development/database/query_performance.md
@@ -0,0 +1,74 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Query performance guidelines
+
+This document describes various guidelines to follow when optimizing SQL queries.
+
+When you are optimizing your SQL queries, there are two dimensions to pay attention to:
+
+1. The query execution time. This is paramount as it reflects how the user experiences GitLab.
+1. The query plan. Optimizing the query plan is important in allowing queries to independently scale over time. Realizing that an index keeps a query performing well as the table grows before the query degrades is an example of why we analyze these plans.
+
+## Timing guidelines for queries
+
+| Query Type | Maximum Query Time | Notes |
+|----|----|---|
+| General queries | `100ms` | This is not a hard limit, but if a query is getting above it, it is important to spend time understanding why it can or cannot be optimized. |
+| Queries in a migration | `100ms` | This is different than the total [migration time](../migration_style_guide.md#how-long-a-migration-should-take). |
+| Concurrent operations in a migration | `5min` | Concurrent operations do not block the database, but they block the GitLab update. This includes operations such as `add_concurrent_index` and `add_concurrent_foreign_key`. |
+| Background migrations | `1s` | |
+| Service Ping | `1s` | See the [Service Ping docs](../service_ping/implement.md) for more details. |
+
+- When analyzing your query's performance, pay attention to if the time you are seeing is on a [cold or warm cache](#cold-and-warm-cache). These guidelines apply for both cache types.
+- When working with batched queries, change the range and batch size to see how it effects the query timing and caching.
+- If an existing query is not performing well, make an effort to improve it. If it is too complex or would stall development, create a follow-up so it can be addressed in a timely manner. You can always ask the database reviewer or maintainer for help and guidance.
+
+## Cold and warm cache
+
+When evaluating query performance it is important to understand the difference between
+cold and warm cached queries.
+
+The first time a query is made, it is made on a "cold cache". Meaning it needs
+to read from disk. If you run the query again, the data can be read from the
+cache, or what PostgreSQL calls shared buffers. This is the "warm cache" query.
+
+When analyzing an [`EXPLAIN` plan](understanding_explain_plans.md), you can see
+the difference not only in the timing, but by looking at the output for `Buffers`
+by running your explain with `EXPLAIN(analyze, buffers)`. [Database Lab](understanding_explain_plans.md#database-lab-engine)
+automatically includes these options.
+
+If you are making a warm cache query, you see only the `shared hits`.
+
+For example in #database-lab:
+
+```plaintext
+Shared buffers:
+ - hits: 36467 (~284.90 MiB) from the buffer pool
+ - reads: 0 from the OS file cache, including disk I/O
+```
+
+Or in the explain plan from `psql`:
+
+```sql
+Buffers: shared hit=7323
+```
+
+If the cache is cold, you also see `reads`.
+
+In #database-lab:
+
+```plaintext
+Shared buffers:
+ - hits: 17204 (~134.40 MiB) from the buffer pool
+ - reads: 15229 (~119.00 MiB) from the OS file cache, including disk I/O
+```
+
+In `psql`:
+
+```sql
+Buffers: shared hit=7202 read=121
+```
diff --git a/doc/development/database/query_recorder.md b/doc/development/database/query_recorder.md
new file mode 100644
index 00000000000..da5c6c8e6cb
--- /dev/null
+++ b/doc/development/database/query_recorder.md
@@ -0,0 +1,145 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# QueryRecorder
+
+QueryRecorder is a tool for detecting the [N+1 queries problem](https://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations) from tests.
+
+> Implemented in [spec/support/query_recorder.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/master/spec/support/helpers/query_recorder.rb) via [9c623e3e](https://gitlab.com/gitlab-org/gitlab-foss/commit/9c623e3e5d7434f2e30f7c389d13e5af4ede770a)
+
+As a rule, merge requests [should not increase query counts](../merge_request_performance_guidelines.md#query-counts). If you find yourself adding something like `.includes(:author, :assignee)` to avoid having `N+1` queries, consider using QueryRecorder to enforce this with a test. Without this, a new feature which causes an additional model to be accessed can silently reintroduce the problem.
+
+## How it works
+
+This style of test works by counting the number of SQL queries executed by ActiveRecord. First a control count is taken, then you add new records to the database and rerun the count. If the number of queries has significantly increased then an `N+1` queries problem exists.
+
+```ruby
+it "avoids N+1 database queries" do
+ control = ActiveRecord::QueryRecorder.new { visit_some_page }
+ create_list(:issue, 5)
+ expect { visit_some_page }.not_to exceed_query_limit(control)
+end
+```
+
+You can if you wish, have both the expectation and the control as
+`QueryRecorder` instances:
+
+```ruby
+it "avoids N+1 database queries" do
+ control = ActiveRecord::QueryRecorder.new { visit_some_page }
+ create_list(:issue, 5)
+ action = ActiveRecord::QueryRecorder.new { visit_some_page }
+
+ expect(action).not_to exceed_query_limit(control)
+end
+```
+
+As an example you might create 5 issues in between counts, which would cause the query count to increase by 5 if an N+1 problem exists.
+
+In some cases the query count might change slightly between runs for unrelated reasons. In this case you might need to test `exceed_query_limit(control_count + acceptable_change)`, but this should be avoided if possible.
+
+If this test fails, and the control was passed as a `QueryRecorder`, then the
+failure message indicates where the extra queries are by matching queries on
+the longest common prefix, grouping similar queries together.
+
+## Cached queries
+
+By default, QueryRecorder ignores [cached queries](../merge_request_performance_guidelines.md#cached-queries) in the count. However, it may be better to count
+all queries to avoid introducing an N+1 query that may be masked by the statement cache.
+To do this, this requires the `:use_sql_query_cache` flag to be set.
+You should pass the `skip_cached` variable to `QueryRecorder` and use the `exceed_all_query_limit` matcher:
+
+```ruby
+it "avoids N+1 database queries", :use_sql_query_cache do
+ control = ActiveRecord::QueryRecorder.new(skip_cached: false) { visit_some_page }
+ create_list(:issue, 5)
+ expect { visit_some_page }.not_to exceed_all_query_limit(control)
+end
+```
+
+## Use request specs instead of controller specs
+
+Use a [request spec](https://gitlab.com/gitlab-org/gitlab-foss/tree/master/spec/requests) when writing a N+1 test on the controller level.
+
+Controller specs should not be used to write N+1 tests as the controller is only initialized once per example.
+This could lead to false successes where subsequent "requests" could have queries reduced (for example, because of memoization).
+
+## Finding the source of the query
+
+There are multiple ways to find the source of queries.
+
+- Inspect the `QueryRecorder` `data` attribute. It stores queries by `file_name:line_number:method_name`.
+ Each entry is a `hash` with the following fields:
+
+ - `count`: the number of times a query from this `file_name:line_number:method_name` was called
+ - `occurrences`: the actual `SQL` of each call
+ - `backtrace`: the stack trace of each call (if either of the two following options were enabled)
+
+ `QueryRecorder#find_query` allows filtering queries by their `file_name:line_number:method_name` and
+ `count` attributes. For example:
+
+ ```ruby
+ control = ActiveRecord::QueryRecorder.new(skip_cached: false) { visit_some_page }
+ control.find_query(/.*note.rb.*/, 0, first_only: true)
+ ```
+
+ `QueryRecorder#occurrences_by_line_method` returns a sorted array based on `data`, sorted by `count`.
+
+- View the call backtrace for the specific `QueryRecorder` instance you want
+ by using `ActiveRecord::QueryRecorder.new(query_recorder_debug: true)`. The output
+ is stored in file `test.log`.
+
+- Enable the call backtrace for all tests using the `QUERY_RECORDER_DEBUG` environment variable.
+
+ To enable this, run the specs with the `QUERY_RECORDER_DEBUG` environment variable set. For example:
+
+ ```shell
+ QUERY_RECORDER_DEBUG=1 bundle exec rspec spec/requests/api/projects_spec.rb
+ ```
+
+ This logs calls to QueryRecorder into the `test.log` file. For example:
+
+ ```sql
+ QueryRecorder SQL: SELECT COUNT(*) FROM "issues" WHERE "issues"."deleted_at" IS NULL AND "issues"."project_id" = $1 AND ("issues"."state" IN ('opened')) AND "issues"."confidential" = $2
+ --> /home/user/gitlab/gdk/gitlab/spec/support/query_recorder.rb:19:in `callback'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/fanout.rb:127:in `finish'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/fanout.rb:46:in `block in finish'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/fanout.rb:46:in `each'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/fanout.rb:46:in `finish'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/instrumenter.rb:36:in `finish'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications/instrumenter.rb:25:in `instrument'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract_adapter.rb:478:in `log'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:601:in `exec_cache'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:585:in `execute_and_clear'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql/database_statements.rb:160:in `exec_query'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract/database_statements.rb:356:in `select'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract/database_statements.rb:32:in `select_all'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `block in select_all'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract/query_cache.rb:83:in `cache_sql'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract/query_cache.rb:68:in `select_all'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/relation/calculations.rb:270:in `execute_simple_calculation'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/relation/calculations.rb:227:in `perform_calculation'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/relation/calculations.rb:133:in `calculate'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activerecord-4.2.8/lib/active_record/relation/calculations.rb:48:in `count'
+ --> /home/user/gitlab/gdk/gitlab/app/services/base_count_service.rb:20:in `uncached_count'
+ --> /home/user/gitlab/gdk/gitlab/app/services/base_count_service.rb:12:in `block in count'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:299:in `block in fetch'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:585:in `block in save_block_result_to_cache'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:547:in `block in instrument'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/notifications.rb:166:in `instrument'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:547:in `instrument'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:584:in `save_block_result_to_cache'
+ --> /home/user/.rbenv/versions/2.3.5/lib/ruby/gems/2.3.0/gems/activesupport-4.2.8/lib/active_support/cache.rb:299:in `fetch'
+ --> /home/user/gitlab/gdk/gitlab/app/services/base_count_service.rb:12:in `count'
+ --> /home/user/gitlab/gdk/gitlab/app/models/project.rb:1296:in `open_issues_count'
+ ```
+
+## See also
+
+- [Bullet](../profiling.md#bullet) For finding `N+1` query problems
+- [Performance guidelines](../performance.md)
+- [Merge request performance guidelines - Query counts](../merge_request_performance_guidelines.md#query-counts)
+- [Merge request performance guidelines - Cached queries](../merge_request_performance_guidelines.md#cached-queries)
diff --git a/doc/development/database/rename_database_tables.md b/doc/development/database/rename_database_tables.md
index cbcbd507204..d6827cb9e03 100644
--- a/doc/development/database/rename_database_tables.md
+++ b/doc/development/database/rename_database_tables.md
@@ -81,10 +81,10 @@ Execute a standard migration (not a post-migration):
when naming indexes, so there is a possibility that not all indexes are properly renamed. After running
the migration locally, check if there are inconsistently named indexes (`db/structure.sql`). Those can be
renamed manually in a separate migration, which can be also part of the release M.N+1.
-- Foreign key columns might still contain the old table name. For smaller tables, follow our [standard column
-rename process](avoiding_downtime_in_migrations.md#renaming-columns)
+- Foreign key columns might still contain the old table name. For smaller tables, follow our
+ [standard column rename process](avoiding_downtime_in_migrations.md#renaming-columns)
- Avoid renaming database tables which are using with triggers.
-- Table modifications (add or remove columns) are not allowed during the rename process, please make sure that all changes to the table happen before the rename migration is started (or in the next release).
+- Table modifications (add or remove columns) are not allowed during the rename process. Make sure that all changes to the table happen before the rename migration is started (or in the next release).
- As the index names might change, verify that the model does not use bulk insert
(for example, `insert_all` and `upsert_all`) with the `unique_by: index_name` option.
Renaming an index while using these methods may break functionality.
diff --git a/doc/development/database/serializing_data.md b/doc/development/database/serializing_data.md
new file mode 100644
index 00000000000..97e6f665484
--- /dev/null
+++ b/doc/development/database/serializing_data.md
@@ -0,0 +1,90 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Serializing Data
+
+**Summary:** don't store serialized data in the database, use separate columns
+and/or tables instead. This includes storing of comma separated values as a
+string.
+
+Rails makes it possible to store serialized data in JSON, YAML or other formats.
+Such a field can be defined as follows:
+
+```ruby
+class Issue < ActiveRecord::Model
+ serialize :custom_fields
+end
+```
+
+While it may be tempting to store serialized data in the database there are many
+problems with this. This document outlines these problems and provide an
+alternative.
+
+## Serialized Data Is Less Powerful
+
+When using a relational database you have the ability to query individual
+fields, change the schema, index data, and so forth. When you use serialized data
+all of that becomes either very difficult or downright impossible. While
+PostgreSQL does offer the ability to query JSON fields it is mostly meant for
+very specialized use cases, and not for more general use. If you use YAML in
+turn there's no way to query the data at all.
+
+## Waste Of Space
+
+Storing serialized data such as JSON or YAML ends up wasting a lot of space.
+This is because these formats often include additional characters (for example, double
+quotes or newlines) besides the data that you are storing.
+
+## Difficult To Manage
+
+There comes a time where you must add a new field to the serialized
+data, or change an existing one. Using serialized data this becomes difficult
+and very time consuming as the only way of doing so is to re-write all the
+stored values. To do so you would have to:
+
+1. Retrieve the data
+1. Parse it into a Ruby structure
+1. Mutate it
+1. Serialize it back to a String
+1. Store it in the database
+
+On the other hand, if one were to use regular columns adding a column would be:
+
+```sql
+ALTER TABLE table_name ADD COLUMN column_name type;
+```
+
+Such a query would take very little to no time and would immediately apply to
+all rows, without having to re-write large JSON or YAML structures.
+
+Finally, there comes a time when the JSON or YAML structure is no longer
+sufficient and you must migrate away from it. When storing only a few rows
+this may not be a problem, but when storing millions of rows such a migration
+can take hours or even days to complete.
+
+## Relational Databases Are Not Document Stores
+
+When storing data as JSON or YAML you're essentially using your database as if
+it were a document store (for example, MongoDB), except you're not using any of the
+powerful features provided by a typical RDBMS _nor_ are you using any of the
+features provided by a typical document store (for example, the ability to index fields
+of documents with variable fields). In other words, it's a waste.
+
+## Consistent Fields
+
+One argument sometimes made in favour of serialized data is having to store
+widely varying fields and values. Sometimes this is truly the case, and then
+perhaps it might make sense to use serialized data. However, in 99% of the cases
+the fields and types stored tend to be the same for every row. Even if there is
+a slight difference you can still use separate columns and just not set the ones
+you don't need.
+
+## The Solution
+
+The solution is to use separate columns and/or separate tables.
+This allows you to use all the features provided by your database, it
+makes it easier to manage and migrate the data, you conserve space, you can
+index the data efficiently and so forth.
diff --git a/doc/development/database/sha1_as_binary.md b/doc/development/database/sha1_as_binary.md
new file mode 100644
index 00000000000..dab9b0fe72e
--- /dev/null
+++ b/doc/development/database/sha1_as_binary.md
@@ -0,0 +1,42 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Storing SHA1 Hashes As Binary
+
+Storing SHA1 hashes as strings is not very space efficient. A SHA1 as a string
+requires at least 40 bytes, an additional byte to store the encoding, and
+perhaps more space depending on the internals of PostgreSQL.
+
+On the other hand, if one were to store a SHA1 as binary one would only need 20
+bytes for the actual SHA1, and 1 or 4 bytes of additional space (again depending
+on database internals). This means that in the best case scenario we can reduce
+the space usage by 50%.
+
+To make this easier to work with you can include the concern `ShaAttribute` into
+a model and define a SHA attribute using the `sha_attribute` class method. For
+example:
+
+```ruby
+class Commit < ActiveRecord::Base
+ include ShaAttribute
+
+ sha_attribute :sha
+end
+```
+
+This allows you to use the value of the `sha` attribute as if it were a string,
+while storing it as binary. This means that you can do something like this,
+without having to worry about converting data to the right binary format:
+
+```ruby
+commit = Commit.find_by(sha: '88c60307bd1f215095834f09a1a5cb18701ac8ad')
+commit.sha = '971604de4cfa324d91c41650fabc129420c8d1cc'
+commit.save
+```
+
+There is however one requirement: the column used to store the SHA has _must_ be
+a binary type. For Rails this means you need to use the `:binary` type instead
+of `:text` or `:string`.
diff --git a/doc/development/database/single_table_inheritance.md b/doc/development/database/single_table_inheritance.md
new file mode 100644
index 00000000000..c8d082e8a67
--- /dev/null
+++ b/doc/development/database/single_table_inheritance.md
@@ -0,0 +1,63 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Single Table Inheritance
+
+**Summary:** don't use Single Table Inheritance (STI), use separate tables
+instead.
+
+Rails makes it possible to have multiple models stored in the same table and map
+these rows to the correct models using a `type` column. This can be used to for
+example store two different types of SSH keys in the same table.
+
+While tempting to use one should avoid this at all costs for the same reasons as
+outlined in the document ["Polymorphic Associations"](polymorphic_associations.md).
+
+## Solution
+
+The solution is very simple: just use a separate table for every type you'd
+otherwise store in the same table. For example, instead of having a `keys` table
+with `type` set to either `Key` or `DeployKey` you'd have two separate tables:
+`keys` and `deploy_keys`.
+
+## In migrations
+
+Whenever a model is used in a migration, single table inheritance should be disabled.
+Due to the way Rails loads associations (even in migrations), failing to disable STI
+could result in loading unexpected code or associations which may cause unintended
+side effects or failures during upgrades.
+
+```ruby
+class SomeMigration < Gitlab::Database::Migration[2.0]
+ class Services < MigrationRecord
+ self.table_name = 'services'
+ self.inheritance_column = :_type_disabled
+ end
+
+ def up
+ ...
+```
+
+If nothing needs to be added to the model other than disabling STI or `EachBatch`,
+use the helper `define_batchable_model` instead of defining the class.
+This ensures that the migration loads the columns for the migration in isolation,
+and the helper disables STI by default.
+
+```ruby
+class EnqueueSomeBackgroundMigration < Gitlab::Database::Migration[1.0]
+ disable_ddl_transaction!
+
+ def up
+ define_batchable_model('services').select(:id).in_batches do |relation|
+ jobs = relation.pluck(:id).map do |id|
+ ['ExtractServicesUrl', [id]]
+ end
+
+ BackgroundMigrationWorker.bulk_perform_async(jobs)
+ end
+ end
+ ...
+```
diff --git a/doc/development/database/strings_and_the_text_data_type.md b/doc/development/database/strings_and_the_text_data_type.md
index 73e023f8d45..e2e1191018b 100644
--- a/doc/development/database/strings_and_the_text_data_type.md
+++ b/doc/development/database/strings_and_the_text_data_type.md
@@ -148,8 +148,9 @@ to update the `title_html` with a title that has more than 1024 characters, the
a database error.
Adding or removing a constraint to an existing attribute requires that any application changes are
-deployed _first_, [otherwise servers still in the old version of the application may try to update the
-attribute with invalid values](../multi_version_compatibility.md#ci-artifact-uploads-were-failing).
+deployed _first_,
+otherwise servers still in the old version of the application
+[may try to update the attribute with invalid values](../multi_version_compatibility.md#ci-artifact-uploads-were-failing).
For these reasons, `add_text_limit` should run in a post-deployment migration.
Still in our example, for the 13.0 milestone (current), consider that the following validation
@@ -188,7 +189,7 @@ migration or a background data migration:
- If the data volume is less than `1,000` records, then the data migration can be executed within the post-migration.
- If the data volume is higher than `1,000` records, it's advised to create a background migration.
-When unsure about which option to use, please contact the Database team for advice.
+When unsure about which option to use, contact the Database team for advice.
Back to our example, the issues table is considerably large and frequently accessed, so we are going
to add a background migration for the 13.0 milestone (current),
diff --git a/doc/development/database/swapping_tables.md b/doc/development/database/swapping_tables.md
new file mode 100644
index 00000000000..efb481ccf35
--- /dev/null
+++ b/doc/development/database/swapping_tables.md
@@ -0,0 +1,51 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Swapping Tables
+
+Sometimes you need to replace one table with another. For example, when
+migrating data in a very large table it's often better to create a copy of the
+table and insert & migrate the data into this new table in the background.
+
+Let's say you want to swap the table `events` with `events_for_migration`. In
+this case you need to follow 3 steps:
+
+1. Rename `events` to `events_temporary`
+1. Rename `events_for_migration` to `events`
+1. Rename `events_temporary` to `events_for_migration`
+
+Rails allows you to do this using the `rename_table` method:
+
+```ruby
+rename_table :events, :events_temporary
+rename_table :events_for_migration, :events
+rename_table :events_temporary, :events_for_migration
+```
+
+This does not require any downtime as long as the 3 `rename_table` calls are
+executed in the _same_ database transaction. Rails by default uses database
+transactions for migrations, but if it doesn't you need to start one
+manually:
+
+```ruby
+Event.transaction do
+ rename_table :events, :events_temporary
+ rename_table :events_for_migration, :events
+ rename_table :events_temporary, :events_for_migration
+end
+```
+
+Once swapped you _have to_ reset the primary key of the new table. For
+PostgreSQL you can use the `reset_pk_sequence!` method like so:
+
+```ruby
+reset_pk_sequence!('events')
+```
+
+Failure to reset the primary keys results in newly created rows starting
+with an ID value of 1. Depending on the existing data this can then lead to
+duplicate key constraints from popping up, preventing users from creating new
+data.
diff --git a/doc/development/database/transaction_guidelines.md b/doc/development/database/transaction_guidelines.md
index 255de19a420..1583bbc02c2 100644
--- a/doc/development/database/transaction_guidelines.md
+++ b/doc/development/database/transaction_guidelines.md
@@ -12,7 +12,7 @@ For further reference, check PostgreSQL documentation about [transactions](https
## Database decomposition and sharding
-The [sharding group](https://about.gitlab.com/handbook/engineering/development/enablement/sharding/) plans
+The [Pods Group](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/pods/) plans
to split the main GitLab database and move some of the database tables to other database servers.
We start decomposing the `ci_*`-related database tables first. To maintain the current application
diff --git a/doc/development/database/understanding_explain_plans.md b/doc/development/database/understanding_explain_plans.md
new file mode 100644
index 00000000000..446a84d5232
--- /dev/null
+++ b/doc/development/database/understanding_explain_plans.md
@@ -0,0 +1,829 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Understanding EXPLAIN plans
+
+PostgreSQL allows you to obtain query plans using the `EXPLAIN` command. This
+command can be invaluable when trying to determine how a query performs.
+You can use this command directly in your SQL query, as long as the query starts
+with it:
+
+```sql
+EXPLAIN
+SELECT COUNT(*)
+FROM projects
+WHERE visibility_level IN (0, 20);
+```
+
+When running this on GitLab.com, we are presented with the following output:
+
+```sql
+Aggregate (cost=922411.76..922411.77 rows=1 width=8)
+ -> Seq Scan on projects (cost=0.00..908044.47 rows=5746914 width=0)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+```
+
+When using _just_ `EXPLAIN`, PostgreSQL does not actually execute our query,
+instead it produces an _estimated_ execution plan based on the available
+statistics. This means the actual plan can differ quite a bit. Fortunately,
+PostgreSQL provides us with the option to execute the query as well. To do so,
+we need to use `EXPLAIN ANALYZE` instead of just `EXPLAIN`:
+
+```sql
+EXPLAIN ANALYZE
+SELECT COUNT(*)
+FROM projects
+WHERE visibility_level IN (0, 20);
+```
+
+This produces:
+
+```sql
+Aggregate (cost=922420.60..922420.61 rows=1 width=8) (actual time=3428.535..3428.535 rows=1 loops=1)
+ -> Seq Scan on projects (cost=0.00..908053.18 rows=5746969 width=0) (actual time=0.041..2987.606 rows=5746940 loops=1)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 65677
+Planning time: 2.861 ms
+Execution time: 3428.596 ms
+```
+
+As we can see this plan is quite different, and includes a lot more data. Let's
+discuss this step by step.
+
+Because `EXPLAIN ANALYZE` executes the query, care should be taken when using a
+query that writes data or might time out. If the query modifies data,
+consider wrapping it in a transaction that rolls back automatically like so:
+
+```sql
+BEGIN;
+EXPLAIN ANALYZE
+DELETE FROM users WHERE id = 1;
+ROLLBACK;
+```
+
+The `EXPLAIN` command also takes additional options, such as `BUFFERS`:
+
+```sql
+EXPLAIN (ANALYZE, BUFFERS)
+SELECT COUNT(*)
+FROM projects
+WHERE visibility_level IN (0, 20);
+```
+
+This then produces:
+
+```sql
+Aggregate (cost=922420.60..922420.61 rows=1 width=8) (actual time=3428.535..3428.535 rows=1 loops=1)
+ Buffers: shared hit=208846
+ -> Seq Scan on projects (cost=0.00..908053.18 rows=5746969 width=0) (actual time=0.041..2987.606 rows=5746940 loops=1)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 65677
+ Buffers: shared hit=208846
+Planning time: 2.861 ms
+Execution time: 3428.596 ms
+```
+
+For more information, refer to the official
+[`EXPLAIN` documentation](https://www.postgresql.org/docs/current/sql-explain.html)
+and [using `EXPLAIN` guide](https://www.postgresql.org/docs/current/using-explain.html).
+
+## Nodes
+
+Every query plan consists of nodes. Nodes can be nested, and are executed from
+the inside out. This means that the innermost node is executed before an outer
+node. This can be best thought of as nested function calls, returning their
+results as they unwind. For example, a plan starting with an `Aggregate`
+followed by a `Nested Loop`, followed by an `Index Only scan` can be thought of
+as the following Ruby code:
+
+```ruby
+aggregate(
+ nested_loop(
+ index_only_scan()
+ index_only_scan()
+ )
+)
+```
+
+Nodes are indicated using a `->` followed by the type of node taken. For
+example:
+
+```sql
+Aggregate (cost=922411.76..922411.77 rows=1 width=8)
+ -> Seq Scan on projects (cost=0.00..908044.47 rows=5746914 width=0)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+```
+
+Here the first node executed is `Seq scan on projects`. The `Filter:` is an
+additional filter applied to the results of the node. A filter is very similar
+to Ruby's `Array#select`: it takes the input rows, applies the filter, and
+produces a new list of rows. After the node is done, we perform the `Aggregate`
+above it.
+
+Nested nodes look like this:
+
+```sql
+Aggregate (cost=176.97..176.98 rows=1 width=8) (actual time=0.252..0.252 rows=1 loops=1)
+ Buffers: shared hit=155
+ -> Nested Loop (cost=0.86..176.75 rows=87 width=0) (actual time=0.035..0.249 rows=36 loops=1)
+ Buffers: shared hit=155
+ -> Index Only Scan using users_pkey on users users_1 (cost=0.43..4.95 rows=87 width=4) (actual time=0.029..0.123 rows=36 loops=1)
+ Index Cond: (id < 100)
+ Heap Fetches: 0
+ -> Index Only Scan using users_pkey on users (cost=0.43..1.96 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=36)
+ Index Cond: (id = users_1.id)
+ Heap Fetches: 0
+Planning time: 2.585 ms
+Execution time: 0.310 ms
+```
+
+Here we first perform two separate "Index Only" scans, followed by performing a
+"Nested Loop" on the result of these two scans.
+
+## Node statistics
+
+Each node in a plan has a set of associated statistics, such as the cost, the
+number of rows produced, the number of loops performed, and more. For example:
+
+```sql
+Seq Scan on projects (cost=0.00..908044.47 rows=5746914 width=0)
+```
+
+Here we can see that our cost ranges from `0.00..908044.47` (we cover this in
+a moment), and we estimate (since we're using `EXPLAIN` and not `EXPLAIN
+ANALYZE`) a total of 5,746,914 rows to be produced by this node. The `width`
+statistics describes the estimated width of each row, in bytes.
+
+The `costs` field specifies how expensive a node was. The cost is measured in
+arbitrary units determined by the query planner's cost parameters. What
+influences the costs depends on a variety of settings, such as `seq_page_cost`,
+`cpu_tuple_cost`, and various others.
+The format of the costs field is as follows:
+
+```sql
+STARTUP COST..TOTAL COST
+```
+
+The startup cost states how expensive it was to start the node, with the total
+cost describing how expensive the entire node was. In general: the greater the
+values, the more expensive the node.
+
+When using `EXPLAIN ANALYZE`, these statistics also include the actual time
+(in milliseconds) spent, and other runtime statistics (for example, the actual number of
+produced rows):
+
+```sql
+Seq Scan on projects (cost=0.00..908053.18 rows=5746969 width=0) (actual time=0.041..2987.606 rows=5746940 loops=1)
+```
+
+Here we can see we estimated 5,746,969 rows to be returned, but in reality we
+returned 5,746,940 rows. We can also see that _just_ this sequential scan took
+2.98 seconds to run.
+
+Using `EXPLAIN (ANALYZE, BUFFERS)` also gives us information about the
+number of rows removed by a filter, the number of buffers used, and more. For
+example:
+
+```sql
+Seq Scan on projects (cost=0.00..908053.18 rows=5746969 width=0) (actual time=0.041..2987.606 rows=5746940 loops=1)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 65677
+ Buffers: shared hit=208846
+```
+
+Here we can see that our filter has to remove 65,677 rows, and that we use
+208,846 buffers. Each buffer in PostgreSQL is 8 KB (8192 bytes), meaning our
+above node uses *1.6 GB of buffers*. That's a lot!
+
+Keep in mind that some statistics are per-loop averages, while others are total values:
+
+| Field name | Value type |
+| --- | --- |
+| Actual Total Time | per-loop average |
+| Actual Rows | per-loop average |
+| Buffers Shared Hit | total value |
+| Buffers Shared Read | total value |
+| Buffers Shared Dirtied | total value |
+| Buffers Shared Written | total value |
+| I/O Read Time | total value |
+| I/O Read Write | total value |
+
+For example:
+
+```sql
+ -> Index Scan using users_pkey on public.users (cost=0.43..3.44 rows=1 width=1318) (actual time=0.025..0.025 rows=1 loops=888)
+ Index Cond: (users.id = issues.author_id)
+ Buffers: shared hit=3543 read=9
+ I/O Timings: read=17.760 write=0.000
+```
+
+Here we can see that this node used 3552 buffers (3543 + 9), returned 888 rows (`888 * 1`), and the actual duration was 22.2 milliseconds (`888 * 0.025`).
+17.76 milliseconds of the total duration was spent in reading from disk, to retrieve data that was not in the cache.
+
+## Node types
+
+There are quite a few different types of nodes, so we only cover some of the
+more common ones here.
+
+A full list of all the available nodes and their descriptions can be found in
+the [PostgreSQL source file `plannodes.h`](https://gitlab.com/postgres/postgres/blob/master/src/include/nodes/plannodes.h).
+pgMustard's [EXPLAIN docs](https://www.pgmustard.com/docs/explain) also offer detailed look into nodes and their fields.
+
+### Seq Scan
+
+A sequential scan over (a chunk of) a database table. This is like using
+`Array#each`, but on a database table. Sequential scans can be quite slow when
+retrieving lots of rows, so it's best to avoid these for large tables.
+
+### Index Only Scan
+
+A scan on an index that did not require fetching anything from the table. In
+certain cases an index only scan may still fetch data from the table, in this
+case the node includes a `Heap Fetches:` statistic.
+
+### Index Scan
+
+A scan on an index that required retrieving some data from the table.
+
+### Bitmap Index Scan and Bitmap Heap scan
+
+Bitmap scans fall between sequential scans and index scans. These are typically
+used when we would read too much data from an index scan, but too little to
+perform a sequential scan. A bitmap scan uses what is known as a
+[bitmap index](https://en.wikipedia.org/wiki/Bitmap_index) to perform its work.
+
+The [source code of PostgreSQL](https://gitlab.com/postgres/postgres/blob/REL_11_STABLE/src/include/nodes/plannodes.h#L441)
+states the following on bitmap scans:
+
+> Bitmap Index Scan delivers a bitmap of potential tuple locations; it does not
+> access the heap itself. The bitmap is used by an ancestor Bitmap Heap Scan
+> node, possibly after passing through intermediate Bitmap And and/or Bitmap Or
+> nodes to combine it with the results of other Bitmap Index Scans.
+
+### Limit
+
+Applies a `LIMIT` on the input rows.
+
+### Sort
+
+Sorts the input rows as specified using an `ORDER BY` statement.
+
+### Nested Loop
+
+A nested loop executes its child nodes for every row produced by a node that
+precedes it. For example:
+
+```sql
+-> Nested Loop (cost=0.86..176.75 rows=87 width=0) (actual time=0.035..0.249 rows=36 loops=1)
+ Buffers: shared hit=155
+ -> Index Only Scan using users_pkey on users users_1 (cost=0.43..4.95 rows=87 width=4) (actual time=0.029..0.123 rows=36 loops=1)
+ Index Cond: (id < 100)
+ Heap Fetches: 0
+ -> Index Only Scan using users_pkey on users (cost=0.43..1.96 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=36)
+ Index Cond: (id = users_1.id)
+ Heap Fetches: 0
+```
+
+Here the first child node (`Index Only Scan using users_pkey on users users_1`)
+produces 36 rows, and is executed once (`rows=36 loops=1`). The next node
+produces 1 row (`rows=1`), but is repeated 36 times (`loops=36`). This is
+because the previous node produced 36 rows.
+
+This means that nested loops can quickly slow the query down if the various
+child nodes keep producing many rows.
+
+## Optimising queries
+
+With that out of the way, let's see how we can optimise a query. Let's use the
+following query as an example:
+
+```sql
+SELECT COUNT(*)
+FROM users
+WHERE twitter != '';
+```
+
+This query counts the number of users that have a Twitter profile set.
+Let's run this using `EXPLAIN (ANALYZE, BUFFERS)`:
+
+```sql
+EXPLAIN (ANALYZE, BUFFERS)
+SELECT COUNT(*)
+FROM users
+WHERE twitter != '';
+```
+
+This produces the following plan:
+
+```sql
+Aggregate (cost=845110.21..845110.22 rows=1 width=8) (actual time=1271.157..1271.158 rows=1 loops=1)
+ Buffers: shared hit=202662
+ -> Seq Scan on users (cost=0.00..844969.99 rows=56087 width=0) (actual time=0.019..1265.883 rows=51833 loops=1)
+ Filter: ((twitter)::text <> ''::text)
+ Rows Removed by Filter: 2487813
+ Buffers: shared hit=202662
+Planning time: 0.390 ms
+Execution time: 1271.180 ms
+```
+
+From this query plan we can see the following:
+
+1. We need to perform a sequential scan on the `users` table.
+1. This sequential scan filters out 2,487,813 rows using a `Filter`.
+1. We use 202,622 buffers, which equals 1.58 GB of memory.
+1. It takes us 1.2 seconds to do all of this.
+
+Considering we are just counting users, that's quite expensive!
+
+Before we start making any changes, let's see if there are any existing indexes
+on the `users` table that we might be able to use. We can obtain this
+information by running `\d users` in a `psql` console, then scrolling down to
+the `Indexes:` section:
+
+```sql
+Indexes:
+ "users_pkey" PRIMARY KEY, btree (id)
+ "index_users_on_confirmation_token" UNIQUE, btree (confirmation_token)
+ "index_users_on_email" UNIQUE, btree (email)
+ "index_users_on_reset_password_token" UNIQUE, btree (reset_password_token)
+ "index_users_on_static_object_token" UNIQUE, btree (static_object_token)
+ "index_users_on_unlock_token" UNIQUE, btree (unlock_token)
+ "index_on_users_name_lower" btree (lower(name::text))
+ "index_users_on_accepted_term_id" btree (accepted_term_id)
+ "index_users_on_admin" btree (admin)
+ "index_users_on_created_at" btree (created_at)
+ "index_users_on_email_trigram" gin (email gin_trgm_ops)
+ "index_users_on_feed_token" btree (feed_token)
+ "index_users_on_group_view" btree (group_view)
+ "index_users_on_incoming_email_token" btree (incoming_email_token)
+ "index_users_on_managing_group_id" btree (managing_group_id)
+ "index_users_on_name" btree (name)
+ "index_users_on_name_trigram" gin (name gin_trgm_ops)
+ "index_users_on_public_email" btree (public_email) WHERE public_email::text <> ''::text
+ "index_users_on_state" btree (state)
+ "index_users_on_state_and_user_type" btree (state, user_type)
+ "index_users_on_unconfirmed_email" btree (unconfirmed_email) WHERE unconfirmed_email IS NOT NULL
+ "index_users_on_user_type" btree (user_type)
+ "index_users_on_username" btree (username)
+ "index_users_on_username_trigram" gin (username gin_trgm_ops)
+ "tmp_idx_on_user_id_where_bio_is_filled" btree (id) WHERE COALESCE(bio, ''::character varying)::text IS DISTINCT FROM ''::text
+```
+
+Here we can see there is no index on the `twitter` column, which means
+PostgreSQL has to perform a sequential scan in this case. Let's try to fix this
+by adding the following index:
+
+```sql
+CREATE INDEX CONCURRENTLY twitter_test ON users (twitter);
+```
+
+If we now re-run our query using `EXPLAIN (ANALYZE, BUFFERS)` we get the
+following plan:
+
+```sql
+Aggregate (cost=61002.82..61002.83 rows=1 width=8) (actual time=297.311..297.312 rows=1 loops=1)
+ Buffers: shared hit=51854 dirtied=19
+ -> Index Only Scan using twitter_test on users (cost=0.43..60873.13 rows=51877 width=0) (actual time=279.184..293.532 rows=51833 loops=1)
+ Filter: ((twitter)::text <> ''::text)
+ Rows Removed by Filter: 2487830
+ Heap Fetches: 26037
+ Buffers: shared hit=51854 dirtied=19
+Planning time: 0.191 ms
+Execution time: 297.334 ms
+```
+
+Now it takes just under 300 milliseconds to get our data, instead of 1.2
+seconds. However, we still use 51,854 buffers, which is about 400 MB of memory.
+300 milliseconds is also quite slow for such a simple query. To understand why
+this query is still expensive, let's take a look at the following:
+
+```sql
+Index Only Scan using twitter_test on users (cost=0.43..60873.13 rows=51877 width=0) (actual time=279.184..293.532 rows=51833 loops=1)
+ Filter: ((twitter)::text <> ''::text)
+ Rows Removed by Filter: 2487830
+```
+
+We start with an index only scan on our index, but we somehow still apply a
+`Filter` that filters out 2,487,830 rows. Why is that? Well, let's look at how
+we created the index:
+
+```sql
+CREATE INDEX CONCURRENTLY twitter_test ON users (twitter);
+```
+
+We told PostgreSQL to index all possible values of the `twitter` column,
+even empty strings. Our query in turn uses `WHERE twitter != ''`. This means
+that the index does improve things, as we don't need to do a sequential scan,
+but we may still encounter empty strings. This means PostgreSQL _has_ to apply a
+Filter on the index results to get rid of those values.
+
+Fortunately, we can improve this even further using "partial indexes". Partial
+indexes are indexes with a `WHERE` condition that is applied when indexing data.
+For example:
+
+```sql
+CREATE INDEX CONCURRENTLY some_index ON users (email) WHERE id < 100
+```
+
+This index would only index the `email` value of rows that match `WHERE id <
+100`. We can use partial indexes to change our Twitter index to the following:
+
+```sql
+CREATE INDEX CONCURRENTLY twitter_test ON users (twitter) WHERE twitter != '';
+```
+
+After being created, if we run our query again we are given the following plan:
+
+```sql
+Aggregate (cost=1608.26..1608.27 rows=1 width=8) (actual time=19.821..19.821 rows=1 loops=1)
+ Buffers: shared hit=44036
+ -> Index Only Scan using twitter_test on users (cost=0.41..1479.71 rows=51420 width=0) (actual time=0.023..15.514 rows=51833 loops=1)
+ Heap Fetches: 1208
+ Buffers: shared hit=44036
+Planning time: 0.123 ms
+Execution time: 19.848 ms
+```
+
+That's _a lot_ better! Now it only takes 20 milliseconds to get the data, and we
+only use about 344 MB of buffers (instead of the original 1.58 GB). The reason
+this works is that now PostgreSQL no longer needs to apply a `Filter`, as the
+index only contains `twitter` values that are not empty.
+
+Keep in mind that you shouldn't just add partial indexes every time you want to
+optimise a query. Every index has to be updated for every write, and they may
+require quite a bit of space, depending on the amount of indexed data. As a
+result, first check if there are any existing indexes you may be able to reuse.
+If there aren't any, check if you can perhaps slightly change an existing one to
+fit both the existing and new queries. Only add a new index if none of the
+existing indexes can be used in any way.
+
+When comparing execution plans, don't take timing as the only important metric.
+Good timing is the main goal of any optimization, but it can be too volatile to
+be used for comparison (for example, it depends a lot on the state of cache).
+When optimizing a query, we usually need to reduce the amount of data we're
+dealing with. Indexes are the way to work with fewer pages (buffers) to get the
+result, so, during optimization, look at the number of buffers used (read and hit),
+and work on reducing these numbers. Reduced timing is the consequence of reduced
+buffer numbers. [Database Lab Engine](#database-lab-engine) guarantees that the plan is structurally
+identical to production (and overall number of buffers is the same as on production),
+but difference in cache state and I/O speed may lead to different timings.
+
+## Queries that can't be optimised
+
+Now that we have seen how to optimise a query, let's look at another query that
+we might not be able to optimise:
+
+```sql
+EXPLAIN (ANALYZE, BUFFERS)
+SELECT COUNT(*)
+FROM projects
+WHERE visibility_level IN (0, 20);
+```
+
+The output of `EXPLAIN (ANALYZE, BUFFERS)` is as follows:
+
+```sql
+Aggregate (cost=922420.60..922420.61 rows=1 width=8) (actual time=3428.535..3428.535 rows=1 loops=1)
+ Buffers: shared hit=208846
+ -> Seq Scan on projects (cost=0.00..908053.18 rows=5746969 width=0) (actual time=0.041..2987.606 rows=5746940 loops=1)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 65677
+ Buffers: shared hit=208846
+Planning time: 2.861 ms
+Execution time: 3428.596 ms
+```
+
+Looking at the output we see the following Filter:
+
+```sql
+Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+Rows Removed by Filter: 65677
+```
+
+Looking at the number of rows removed by the filter, we may be tempted to add an
+index on `projects.visibility_level` to somehow turn this Sequential scan +
+filter into an index-only scan.
+
+Unfortunately, doing so is unlikely to improve anything. Contrary to what some
+might believe, an index being present _does not guarantee_ that PostgreSQL
+actually uses it. For example, when doing a `SELECT * FROM projects` it is much
+cheaper to just scan the entire table, instead of using an index and then
+fetching data from the table. In such cases PostgreSQL may decide to not use an
+index.
+
+Second, let's think for a moment what our query does: it gets all projects with
+visibility level 0 or 20. In the above plan we can see this produces quite a lot
+of rows (5,745,940), but how much is that relative to the total? Let's find out
+by running the following query:
+
+```sql
+SELECT visibility_level, count(*) AS amount
+FROM projects
+GROUP BY visibility_level
+ORDER BY visibility_level ASC;
+```
+
+For GitLab.com this produces:
+
+```sql
+ visibility_level | amount
+------------------+---------
+ 0 | 5071325
+ 10 | 65678
+ 20 | 674801
+```
+
+Here the total number of projects is 5,811,804, and 5,746,126 of those are of
+level 0 or 20. That's 98% of the entire table!
+
+So no matter what we do, this query retrieves 98% of the entire table. Since
+most time is spent doing exactly that, there isn't really much we can do to
+improve this query, other than _not_ running it at all.
+
+What is important here is that while some may recommend to straight up add an
+index the moment you see a sequential scan, it is _much more important_ to first
+understand what your query does, how much data it retrieves, and so on. After
+all, you can not optimise something you do not understand.
+
+### Cardinality and selectivity
+
+Earlier we saw that our query had to retrieve 98% of the rows in the table.
+There are two terms commonly used for databases: cardinality, and selectivity.
+Cardinality refers to the number of unique values in a particular column in a
+table.
+
+Selectivity is the number of unique values produced by an operation (for example, an
+index scan or filter), relative to the total number of rows. The higher the
+selectivity, the more likely PostgreSQL is able to use an index.
+
+In the above example, there are only 3 unique values: 0, 10, and 20. This means
+the cardinality is 3. The selectivity in turn is also very low: 0.0000003% (2 /
+5,811,804), because our `Filter` only filters using two values (`0` and `20`).
+With such a low selectivity value it's not surprising that PostgreSQL decides
+using an index is not worth it, because it would produce almost no unique rows.
+
+## Rewriting queries
+
+So the above query can't really be optimised as-is, or at least not much. But
+what if we slightly change the purpose of it? What if instead of retrieving all
+projects with `visibility_level` 0 or 20, we retrieve those that a user
+interacted with somehow?
+
+Fortunately, GitLab has an answer for this, and it's a table called
+`user_interacted_projects`. This table has the following schema:
+
+```sql
+Table "public.user_interacted_projects"
+ Column | Type | Modifiers
+------------+---------+-----------
+ user_id | integer | not null
+ project_id | integer | not null
+Indexes:
+ "index_user_interacted_projects_on_project_id_and_user_id" UNIQUE, btree (project_id, user_id)
+ "index_user_interacted_projects_on_user_id" btree (user_id)
+Foreign-key constraints:
+ "fk_rails_0894651f08" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
+ "fk_rails_722ceba4f7" FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE
+```
+
+Let's rewrite our query to `JOIN` this table onto our projects, and get the
+projects for a specific user:
+
+```sql
+EXPLAIN ANALYZE
+SELECT COUNT(*)
+FROM projects
+INNER JOIN user_interacted_projects ON user_interacted_projects.project_id = projects.id
+WHERE projects.visibility_level IN (0, 20)
+AND user_interacted_projects.user_id = 1;
+```
+
+What we do here is the following:
+
+1. Get our projects.
+1. `INNER JOIN` `user_interacted_projects`, meaning we're only left with rows in
+ `projects` that have a corresponding row in `user_interacted_projects`.
+1. Limit this to the projects with `visibility_level` of 0 or 20, and to
+ projects that the user with ID 1 interacted with.
+
+If we run this query we get the following plan:
+
+```sql
+ Aggregate (cost=871.03..871.04 rows=1 width=8) (actual time=9.763..9.763 rows=1 loops=1)
+ -> Nested Loop (cost=0.86..870.52 rows=203 width=0) (actual time=1.072..9.748 rows=143 loops=1)
+ -> Index Scan using index_user_interacted_projects_on_user_id on user_interacted_projects (cost=0.43..160.71 rows=205 width=4) (actual time=0.939..2.508 rows=145 loops=1)
+ Index Cond: (user_id = 1)
+ -> Index Scan using projects_pkey on projects (cost=0.43..3.45 rows=1 width=4) (actual time=0.049..0.050 rows=1 loops=145)
+ Index Cond: (id = user_interacted_projects.project_id)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 0
+ Planning time: 2.614 ms
+ Execution time: 9.809 ms
+```
+
+Here it only took us just under 10 milliseconds to get the data. We can also see
+we're retrieving far fewer projects:
+
+```sql
+Index Scan using projects_pkey on projects (cost=0.43..3.45 rows=1 width=4) (actual time=0.049..0.050 rows=1 loops=145)
+ Index Cond: (id = user_interacted_projects.project_id)
+ Filter: (visibility_level = ANY ('{0,20}'::integer[]))
+ Rows Removed by Filter: 0
+```
+
+Here we see we perform 145 loops (`loops=145`), with every loop producing 1 row
+(`rows=1`). This is much less than before, and our query performs much better!
+
+If we look at the plan we also see our costs are very low:
+
+```sql
+Index Scan using projects_pkey on projects (cost=0.43..3.45 rows=1 width=4) (actual time=0.049..0.050 rows=1 loops=145)
+```
+
+Here our cost is only 3.45, and it takes us 7.25 milliseconds to do so (0.05 * 145).
+The next index scan is a bit more expensive:
+
+```sql
+Index Scan using index_user_interacted_projects_on_user_id on user_interacted_projects (cost=0.43..160.71 rows=205 width=4) (actual time=0.939..2.508 rows=145 loops=1)
+```
+
+Here the cost is 160.71 (`cost=0.43..160.71`), taking about 2.5 milliseconds
+(based on the output of `actual time=....`).
+
+The most expensive part here is the "Nested Loop" that acts upon the result of
+these two index scans:
+
+```sql
+Nested Loop (cost=0.86..870.52 rows=203 width=0) (actual time=1.072..9.748 rows=143 loops=1)
+```
+
+Here we had to perform 870.52 disk page fetches for 203 rows, 9.748
+milliseconds, producing 143 rows in a single loop.
+
+The key takeaway here is that sometimes you have to rewrite (parts of) a query
+to make it better. Sometimes that means having to slightly change your feature
+to accommodate for better performance.
+
+## What makes a bad plan
+
+This is a bit of a difficult question to answer, because the definition of "bad"
+is relative to the problem you are trying to solve. However, some patterns are
+best avoided in most cases, such as:
+
+- Sequential scans on large tables
+- Filters that remove a lot of rows
+- Performing a certain step that requires _a lot_ of
+ buffers (for example, an index scan for GitLab.com that requires more than 512 MB).
+
+As a general guideline, aim for a query that:
+
+1. Takes no more than 10 milliseconds. Our target time spent in SQL per request
+ is around 100 milliseconds, so every query should be as fast as possible.
+1. Does not use an excessive number of buffers, relative to the workload. For
+ example, retrieving ten rows shouldn't require 1 GB of buffers.
+1. Does not spend a long amount of time performing disk IO operations. The
+ setting `track_io_timing` must be enabled for this data to be included in the
+ output of `EXPLAIN ANALYZE`.
+1. Applies a `LIMIT` when retrieving rows without aggregating them, such as
+ `SELECT * FROM users`.
+1. Doesn't use a `Filter` to filter out too many rows, especially if the query
+ does not use a `LIMIT` to limit the number of returned rows. Filters can
+ usually be removed by adding a (partial) index.
+
+These are _guidelines_ and not hard requirements, as different needs may require
+different queries. The only _rule_ is that you _must always measure_ your query
+(preferably using a production-like database) using `EXPLAIN (ANALYZE, BUFFERS)`
+and related tools such as:
+
+- [`explain.depesz.com`](https://explain.depesz.com/).
+- [`explain.dalibo.com/`](https://explain.dalibo.com/).
+
+## Producing query plans
+
+There are a few ways to get the output of a query plan. Of course you
+can directly run the `EXPLAIN` query in the `psql` console, or you can
+follow one of the other options below.
+
+### Database Lab Engine
+
+GitLab team members can use [Database Lab Engine](https://gitlab.com/postgres-ai/database-lab), and the companion
+SQL optimization tool - [Joe Bot](https://gitlab.com/postgres-ai/joe).
+
+Database Lab Engine provides developers with their own clone of the production database, while Joe Bot helps with exploring execution plans.
+
+Joe Bot is available in the [`#database-lab`](https://gitlab.slack.com/archives/CLJMDRD8C) channel on Slack,
+and through its [web interface](https://console.postgres.ai/gitlab/joe-instances).
+
+With Joe Bot you can execute DDL statements (like creating indexes, tables, and columns) and get query plans for `SELECT`, `UPDATE`, and `DELETE` statements.
+
+For example, in order to test new index on a column that is not existing on production yet, you can do the following:
+
+Create the column:
+
+```sql
+exec ALTER TABLE projects ADD COLUMN last_at timestamp without time zone
+```
+
+Create the index:
+
+```sql
+exec CREATE INDEX index_projects_last_activity ON projects (last_activity_at) WHERE last_activity_at IS NOT NULL
+```
+
+Analyze the table to update its statistics:
+
+```sql
+exec ANALYZE projects
+```
+
+Get the query plan:
+
+```sql
+explain SELECT * FROM projects WHERE last_activity_at < CURRENT_DATE
+```
+
+Once done you can rollback your changes:
+
+```sql
+reset
+```
+
+For more information about the available options, run:
+
+```sql
+help
+```
+
+The web interface comes with the following execution plan visualizers included:
+
+- [Depesz](https://explain.depesz.com/)
+- [PEV2](https://github.com/dalibo/pev2)
+- [FlameGraph](https://github.com/mgartner/pg_flame)
+
+#### Tips & Tricks
+
+The database connection is now maintained during your whole session, so you can use `exec set ...` for any session variables (such as `enable_seqscan` or `work_mem`). These settings are applied to all subsequent commands until you reset them. For example you can disable parallel queries with
+
+```sql
+exec SET max_parallel_workers_per_gather = 0
+```
+
+### Rails console
+
+Using the [`activerecord-explain-analyze`](https://github.com/6/activerecord-explain-analyze)
+you can directly generate the query plan from the Rails console:
+
+```ruby
+pry(main)> require 'activerecord-explain-analyze'
+=> true
+pry(main)> Project.where('build_timeout > ?', 3600).explain(analyze: true)
+ Project Load (1.9ms) SELECT "projects".* FROM "projects" WHERE (build_timeout > 3600)
+ ↳ (pry):12
+=> EXPLAIN for: SELECT "projects".* FROM "projects" WHERE (build_timeout > 3600)
+Seq Scan on public.projects (cost=0.00..2.17 rows=1 width=742) (actual time=0.040..0.041 rows=0 loops=1)
+ Output: id, name, path, description, created_at, updated_at, creator_id, namespace_id, ...
+ Filter: (projects.build_timeout > 3600)
+ Rows Removed by Filter: 14
+ Buffers: shared hit=2
+Planning time: 0.411 ms
+Execution time: 0.113 ms
+```
+
+### ChatOps
+
+GitLab team members can also use our ChatOps solution, available in Slack
+using the [`/chatops` slash command](../chatops_on_gitlabcom.md).
+
+NOTE:
+While ChatOps is still available, the recommended way to generate execution plans is to use [Database Lab Engine](#database-lab-engine).
+
+You can use ChatOps to get a query plan by running the following:
+
+```sql
+/chatops run explain SELECT COUNT(*) FROM projects WHERE visibility_level IN (0, 20)
+```
+
+Visualising the plan using <https://explain.depesz.com/> is also supported:
+
+```sql
+/chatops run explain --visual SELECT COUNT(*) FROM projects WHERE visibility_level IN (0, 20)
+```
+
+Quoting the query is not necessary.
+
+For more information about the available options, run:
+
+```sql
+/chatops run explain --help
+```
+
+## Further reading
+
+A more extensive guide on understanding query plans can be found in
+the [presentation](https://public.dalibo.com/exports/conferences/_archives/_2012/201211_explain/understanding_explain.pdf)
+from [Dalibo.org](https://www.dalibo.com/en/).
+
+Depesz's blog also has a good [section](https://www.depesz.com/tag/unexplainable/) dedicated to query plans.
diff --git a/doc/development/database/verifying_database_capabilities.md b/doc/development/database/verifying_database_capabilities.md
new file mode 100644
index 00000000000..55347edf4ec
--- /dev/null
+++ b/doc/development/database/verifying_database_capabilities.md
@@ -0,0 +1,38 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Verifying Database Capabilities
+
+Sometimes certain bits of code may only work on a certain database
+version. While we try to avoid such code as much as possible sometimes it is
+necessary to add database (version) specific behavior.
+
+To facilitate this we have the following methods that you can use:
+
+- `ApplicationRecord.database.version`: returns the PostgreSQL version number as a string
+ in the format `X.Y.Z`.
+
+This allows you to write code such as:
+
+```ruby
+if ApplicationRecord.database.version.to_f >= 11.7
+ run_really_fast_query
+else
+ run_fast_query
+end
+```
+
+## Read-only database
+
+The database can be used in read-only mode. In this case we have to
+make sure all GET requests don't attempt any write operations to the
+database. If one of those requests wants to write to the database, it needs
+to be wrapped in a `Gitlab::Database.read_only?` or `Gitlab::Database.read_write?`
+guard, to make sure it doesn't for read-only databases.
+
+We have a Rails Middleware that filters any potentially writing
+operations (the `CUD` operations of CRUD) and prevent the user from trying
+to update the database and getting a 500 error (see `Gitlab::Middleware::ReadOnly`).