Add latest changes from gitlab-org/gitlab@14-4-stable-eev14.4.0-rc42

author: GitLab Bot <gitlab-bot@gitlab.com> 2021-10-20 11:43:02 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2021-10-20 11:43:02 +0300
commit: d9ab72d6080f594d0b3cae15f14b3ef2c6c638cb (patch)
tree: 2341ef426af70ad1e289c38036737e04b0aa5007 /doc/development/database
parent: d6e514dd13db8947884cd58fe2a9c2a063400a9b (diff)
6 files changed, 196 insertions, 46 deletions
diff --git a/doc/development/database/database_migration_pipeline.md b/doc/development/database/database_migration_pipeline.md
index 5a8ce89a362..ce7e1801abc 100644
--- a/doc/development/database/database_migration_pipeline.md
+++ b/doc/development/database/database_migration_pipeline.md
@@ -50,6 +50,6 @@ Some additional information is included at the bottom of the comment:
 
 | Result                           | Description                                                                                                                                                                                                                                                     |
 |----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Migrations pending on GitLab.com | A summary of migrations not deployed yet to GitLab.com. This info is useful when testing a migration that was merged but not deployed yet.                                                                                                                      |
+| Migrations pending on GitLab.com | A summary of migrations not deployed yet to GitLab.com. This information is useful when testing a migration that was merged but not deployed yet.                                                                                                                      |
 | Clone details                    | A link to the `Postgres.ai` thin clone created for this testing pipeline, along with information about its expiry. This can be used to further explore the results of running the migration. Only accessible by database maintainers or with an access request. |
 | Artifacts                        | A link to the pipeline's artifacts. Full query logs for each migration (ending in `.log`) are available there and only accessible by database maintainers or with an access request.                                                                            |
diff --git a/doc/development/database/database_reviewer_guidelines.md b/doc/development/database/database_reviewer_guidelines.md
index 59653c6dde3..bc18e606f21 100644
--- a/doc/development/database/database_reviewer_guidelines.md
+++ b/doc/development/database/database_reviewer_guidelines.md
@@ -71,6 +71,7 @@ topics and use cases. The most frequently required during database reviewing are
 - [Migrations style guide](../migration_style_guide.md) for creating safe SQL migrations.
 - [Avoiding downtime in migrations](../avoiding_downtime_in_migrations.md).
 - [SQL guidelines](../sql.md) for working with SQL queries.
+- [Guidelines for JiHu contributions with database migrations](https://about.gitlab.com/handbook/ceo/chief-of-staff-team/jihu-support/jihu-database-change-process.html)
 
 ## How to apply to become a database maintainer
 
diff --git a/doc/development/database/efficient_in_operator_queries.md b/doc/development/database/efficient_in_operator_queries.md
index bc72bce30bf..1e706890f64 100644
--- a/doc/development/database/efficient_in_operator_queries.md
+++ b/doc/development/database/efficient_in_operator_queries.md
@@ -66,9 +66,9 @@ The execution of the query can be largely broken down into three steps:
 1. The database sorts the `issues` rows in memory by `created_at` and returns `LIMIT 20` rows to
    the end-user. For large groups, this final step requires both large memory and CPU resources.
 
-<details>
-<summary>Expand this sentence to see the execution plan for this DB query.</summary>
-<pre><code>
+Execution plan for this DB query:
+
+```sql
  Limit  (cost=90170.07..90170.12 rows=20 width=1329) (actual time=967.597..967.607 rows=20 loops=1)
    Buffers: shared hit=239127 read=3060
    I/O Timings: read=336.879
@@ -106,8 +106,7 @@ The execution of the query can be largely broken down into three steps:
  Planning Time: 7.750 ms
  Execution Time: 967.973 ms
 (36 rows)
-</code></pre>
-</details>
+```
 
 The performance of the query depends on the number of rows in the database.
 On average, we can say the following:
@@ -226,7 +225,12 @@ Gitlab::Pagination::Keyset::InOperatorOptimization::QueryBuilder.new(
 - `finder_query` loads the actual record row from the database. It must also be a lambda, where
   the order by column expressions is available for locating the record. In this example, the
   yielded values are `created_at` and `id` SQL expressions. Finding a record is very fast via the
-  primary key, so we don't use the `created_at` value.
+  primary key, so we don't use the `created_at` value. Providing the `finder_query` lambda is optional.
+  If it's not given, the IN operator optimization will only make the ORDER BY columns available to
+  the end-user and not the full database row.
+
+  If it's not given, the IN operator optimization will only make the ORDER BY columns available to
+  the end-user and not the full database row.
 
 The following database index on the `issues` table must be present
 to make the query execute efficiently:
@@ -235,9 +239,9 @@ to make the query execute efficiently:
 "idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id)
 ```
 
-<details>
-<summary>Expand this sentence to see the SQL query.</summary>
-<pre><code>
+The SQL query:
+
+```sql
 SELECT "issues".*
 FROM
   (WITH RECURSIVE "array_cte" AS MATERIALIZED
@@ -348,8 +352,7 @@ SELECT (records).*
    FROM "recursive_keyset_cte" AS "issues"
    WHERE (COUNT <> 0)) issues -- filtering out the initializer row
 LIMIT 20
-</code></pre>
-</details>
+```
 
 ### Using the `IN` query optimization
 
@@ -461,9 +464,9 @@ Gitlab::Pagination::Keyset::InOperatorOptimization::QueryBuilder.new(
 ).execute.limit(20)
 ```
 
-<details>
-<summary>Expand this sentence to see the SQL query.</summary>
-<pre><code>
+The SQL query:
+
+```sql
 SELECT "issues".*
 FROM
   (WITH RECURSIVE "array_cte" AS MATERIALIZED
@@ -581,9 +584,7 @@ FROM
    FROM "recursive_keyset_cte" AS "issues"
    WHERE (COUNT <> 0)) issues
 LIMIT 20
-</code>
-</pre>
-</details>
+```
 
 NOTE:
 To make the query efficient, the following columns need to be covered with an index: `project_id`, `issue_type`, `created_at`, and `id`.
@@ -611,6 +612,32 @@ Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 10
 end
 ```
 
+NOTE:
+The query loads complete database rows from the disk. This may cause increased I/O and slower
+database queries. Depending on the use case, the primary key is often only
+needed for the batch query to invoke additional statements. For example, `UPDATE` or `DELETE`. The
+`id` column is included in the `ORDER BY` columns (`created_at` and `id`) and is already
+loaded. In this case, you can omit the `finder_query` parameter.
+
+Example for loading the `ORDER BY` columns only:
+
+```ruby
+scope = Issue.order(:created_at, :id)
+array_scope = Group.find(9970).all_projects.select(:id)
+array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
+
+opts = {
+  in_operator_optimization_options: {
+    array_scope: array_scope,
+    array_mapping_scope: array_mapping_scope
+  }
+}
+
+Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 100) do |records|
+  puts records.select(:id).map { |r| [r.id] } # only id and created_at are available
+end
+```
+
 #### Keyset pagination
 
 The optimization works out of the box with GraphQL and the `keyset_paginate` helper method.
diff --git a/doc/development/database/keyset_pagination.md b/doc/development/database/keyset_pagination.md
index fd62c36b753..4f0b353a37f 100644
--- a/doc/development/database/keyset_pagination.md
+++ b/doc/development/database/keyset_pagination.md
@@ -169,7 +169,7 @@ Consider the following scope:
 scope = Issue.where(project_id: 10).order(Gitlab::Database.nulls_last_order('relative_position', 'DESC'))
 # SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 10 ORDER BY relative_position DESC NULLS LAST
 
-scope.keyset_paginate # raises: Gitlab::Pagination::Keyset::Paginator::UnsupportedScopeOrder: The order on the scope does not support keyset pagination
+scope.keyset_paginate # raises: Gitlab::Pagination::Keyset::UnsupportedScopeOrder: The order on the scope does not support keyset pagination
 ```
 
 The `keyset_paginate` method raises an error because the order value on the query is a custom SQL string and not an [`Arel`](https://www.rubydoc.info/gems/arel) AST node. The keyset library cannot automatically infer configuration values from these kinds of queries.
diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md
index 0fd9f821fab..0ba752ba3a6 100644
--- a/doc/development/database/multiple_databases.md
+++ b/doc/development/database/multiple_databases.md
@@ -6,16 +6,14 @@ info: To determine the technical writer assigned to the Stage/Group associated w
 
 # Multiple Databases
 
-In order to scale GitLab, the GitLab application database
-will be [decomposed into multiple
-databases](https://gitlab.com/groups/gitlab-org/-/epics/6168).
+To scale GitLab, the we are
+[decomposing the GitLab application database into multiple databases](https://gitlab.com/groups/gitlab-org/-/epics/6168).
 
-## CI Database
+## CI/CD Database
 
-Support for configuring the GitLab Rails application to use a distinct
-database for CI tables was added in [GitLab
-14.1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64289). This
-feature is still under development, and is not ready for production use.
+> Support for configuring the GitLab Rails application to use a distinct
+database for CI/CD tables was [introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64289)
+in GitLab 14.1. This feature is still under development, and is not ready for production use.
 
 By default, GitLab is configured to use only one main database. To
 opt-in to use a main database, and CI database, modify the
@@ -92,8 +90,8 @@ test: &test
 
 ### Migrations
 
-Any migrations that affect `Ci::CiDatabaseRecord` models
-and their tables must be placed in two directories for now:
+Place any migrations that affect `Ci::CiDatabaseRecord` models
+and their tables in two directories:
 
 - `db/migrate`
 - `db/ci_migrate`
@@ -394,7 +392,8 @@ You can see a real example of using this method for fixing a cross-join in
 #### Allowlist for existing cross-joins
 
 A cross-join across databases can be explicitly allowed by wrapping the code in the
-`::Gitlab::Database.allow_cross_joins_across_databases` helper method.
+`::Gitlab::Database.allow_cross_joins_across_databases` helper method. Alternative
+way is to mark a given relation as `relation.allow_cross_joins_across_databases`.
 
 This method should only be used:
 
@@ -405,16 +404,113 @@ This method should only be used:
 The `allow_cross_joins_across_databases` helper method can be used as follows:
 
 ```ruby
+# Scope the block executing a object from database
 ::Gitlab::Database.allow_cross_joins_across_databases(url: 'https://gitlab.com/gitlab-org/gitlab/-/issues/336590') do
   subject.perform(1, 4)
 end
 ```
 
+```ruby
+# Mark a relation as allowed to cross-join databases
+def find_actual_head_pipeline
+  all_pipelines
+    .allow_cross_joins_across_databases(url: 'https://gitlab.com/gitlab-org/gitlab/-/issues/336891')
+    .for_sha_or_source_sha(diff_head_sha)
+    .first
+end
+```
+
 The `url` parameter should point to an issue with a milestone for when we intend
 to fix the cross-join. If the cross-join is being used in a migration, we do not
 need to fix the code. See <https://gitlab.com/gitlab-org/gitlab/-/issues/340017>
 for more details.
 
+### Removing cross-database transactions
+
+When dealing with multiple databases, it's important to pay close attention to data modification
+that affects more than one database.
+[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/339811) GitLab 14.4, an automated check
+prevents cross-database modifications.
+
+When at least two different databases are modified during a transaction initiated on any database
+server, the application triggers a cross-database modification error (only in test environment).
+
+Example:
+
+```ruby
+# Open transaction on Main DB
+ApplicationRecord.transaction do
+  ci_build.update!(updated_at: Time.current) # UPDATE on CI DB
+  ci_build.project.update!(updated_at: Time.current) # UPDATE on Main DB
+end
+# raises error: Cross-database data modification of 'main, ci' were detected within
+# a transaction modifying the 'ci_build, projects' tables
+```
+
+The code example above updates the timestamp for two records within a transaction. With the
+ongoing work on the CI database decomposition, we cannot ensure the schematics of a database
+transaction.
+If the second update query fails, the first update query will not be
+rolled back because the `ci_build` record is located on a different database server. For
+more information, look at the
+[transaction guidelines](transaction_guidelines.md#dangerous-example-third-party-api-calls)
+page.
+
+#### Fixing cross-database errors
+
+##### Removing the transaction block
+
+Without an open transaction, the cross-database modification check cannot raise an error.
+By making this change, we sacrifice consistency. In case of an application failure after the
+first `UPDATE` query, the second `UPDATE` query will never execute.
+
+The same code without the `transaction` block:
+
+```ruby
+ci_build.update!(updated_at: Time.current) # CI DB
+ci_build.project.update!(updated_at: Time.current) # Main DB
+```
+
+##### Async processing
+
+If we need more guarantee that an operation finishes the work consistently we can execute it
+within a background job. A background job is scheduled asynchronously and retried several times
+in case of an error. There is still a very small chance of introducing inconsistency.
+
+Example:
+
+```ruby
+current_time = Time.current
+
+MyAsyncConsistencyJob.perform_async(cu_build.id)
+
+ci_build.update!(updated_at: current_time)
+ci_build.project.update!(updated_at: current_time)
+```
+
+The `MyAsyncConsistencyJob` would also attempt to update the timestamp if they differ.
+
+##### Aiming for perfect consistency
+
+At this point, we don't have the tooling (we might not even need it) to ensure similar consistency
+characteristics as we had with one database. If you think that the code you're working on requires
+these properties, then you can disable the cross-database modification check by wrapping to
+offending database queries with a block and create a follow-up issue mentioning the sharding group
+(`gitlab-org/sharding-group`).
+
+```ruby
+Gitlab::Database.allow_cross_joins_across_databases(url: 'gitlab issue URL') do
+  ApplicationRecord.transaction do
+    ci_build.update!(updated_at: Time.current) # UPDATE on CI DB
+    ci_build.project.update!(updated_at: Time.current) # UPDATE on Main DB
+  end
+end
+```
+
+Don't hesitate to reach out to the
+[sharding group](https://about.gitlab.com/handbook/engineering/development/enablement/sharding/)
+for advice.
+
 ## `config/database.yml`
 
 GitLab will support running multiple databases in the future, for example to [separate tables for the continuous integration features](https://gitlab.com/groups/gitlab-org/-/epics/6167) from the main database. In order to prepare for this change, we [validate the structure of the configuration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67877) in `database.yml` to ensure that only known databases are used.
diff --git a/doc/development/database/transaction_guidelines.md b/doc/development/database/transaction_guidelines.md
index 4c586135015..2806bd217db 100644
--- a/doc/development/database/transaction_guidelines.md
+++ b/doc/development/database/transaction_guidelines.md
@@ -8,17 +8,21 @@ info: To determine the technical writer assigned to the Stage/Group associated w
 
 This document gives a few examples of the usage of database transactions in application code.
 
-For further reference please check PostgreSQL documentation about [transactions](https://www.postgresql.org/docs/current/tutorial-transactions.html).
+For further reference, check PostgreSQL documentation about [transactions](https://www.postgresql.org/docs/current/tutorial-transactions.html).
 
 ## Database decomposition and sharding
 
-The [sharding group](https://about.gitlab.com/handbook/engineering/development/enablement/sharding/) plans to split the main GitLab database and move some of the database tables to other database servers.
+The [sharding group](https://about.gitlab.com/handbook/engineering/development/enablement/sharding/) plans
+to split the main GitLab database and move some of the database tables to other database servers.
 
-The group will start decomposing the `ci_*` related database tables first. To maintain the current application development experience, tooling and static analyzers will be added to the codebase to ensure correct data access and data modification methods. By using the correct form for defining database transactions, we can save significant refactoring work in the future.
+We'll start decomposing the `ci_*`-related database tables first. To maintain the current application
+development experience, we'll add tooling and static analyzers to the codebase to ensure correct
+data access and data modification methods. By using the correct form for defining database transactions,
+we can save significant refactoring work in the future.
 
 ## The transaction block
 
-The `ActiveRecord` library provides a convenient way to group database statements into a transaction.
+The `ActiveRecord` library provides a convenient way to group database statements into a transaction:
 
 ```ruby
 issue = Issue.find(10)
@@ -30,16 +34,19 @@ ApplicationRecord.transaction do
 end
 ```
 
-This transaction involves two database tables, in case of an error, each `UPDATE` statement will be rolled back to the previous, consistent state.
+This transaction involves two database tables. In case of an error, each `UPDATE`
+statement rolls back to the previous consistent state.
 
 NOTE:
 Avoid referencing the `ActiveRecord::Base` class and use `ApplicationRecord` instead.
 
 ## Transaction and database locks
 
-When a transaction block is opened, the database will try to acquire the necessary locks on the resources. The type of locks will depend on the actual database statements.
+When a transaction block is opened, the database tries to acquire the necessary
+locks on the resources. The type of locks depend on the actual database statements.
 
-Consider a concurrent update scenario where the following code is executed at the same time from two different processes:
+Consider a concurrent update scenario where the following code is executed at the
+same time from two different processes:
 
 ```ruby
 issue = Issue.find(10)
@@ -51,15 +58,22 @@ ApplicationRecord.transaction do
 end
 ```
 
-The database will try to acquire the `FOR UPDATE` lock for the referenced `issue` and `project` records. In our case, we have two competing transactions for these locks, one of them will successfully acquire them. The other transaction will have to wait in the lock queue until the first transaction finishes. The execution of the second transaction is blocked at this point.
+The database tries to acquire the `FOR UPDATE` lock for the referenced `issue` and
+`project` records. In our case, we have two competing transactions for these locks,
+and only one of them will successfully acquire them. The other transaction will have
+to wait in the lock queue until the first transaction finishes. The execution of the
+second transaction is blocked at this point.
 
 ## Transaction speed
 
-To prevent lock contention and maintain stable application performance, the transaction block should finish as fast as possible. When a transaction acquires locks, it will hold on to them until the transaction finishes.
+To prevent lock contention and maintain stable application performance, the transaction
+block should finish as fast as possible. When a transaction acquires locks, it holds
+on to them until the transaction finishes.
 
-Apart from application performance, long-running transactions can also affect the application upgrade processes by blocking database migrations.
+Apart from application performance, long-running transactions can also affect application
+upgrade processes by blocking database migrations.
 
-### Dangerous example: 3rd party API calls
+### Dangerous example: third-party API calls
 
 Consider the following example:
 
@@ -73,20 +87,29 @@ Member.transaction do
 end
 ```
 
-Here, we ensure that the `notification_email_sent` column is updated only when the `send_notification_email` method succeeds. The `send_notification_email` method executes a network request to an email sending service. If the underlying infrastructure does not specify timeouts or the network call takes too long time, the database transaction will stay open.
+Here, we ensure that the `notification_email_sent` column is updated only when the
+`send_notification_email` method succeeds. The `send_notification_email` method
+executes a network request to an email sending service. If the underlying infrastructure
+does not specify timeouts or the network call takes too long time, the database transaction
+stays open.
 
 Ideally, a transaction should only contain database statements.
 
 Avoid doing in a `transaction` block:
 
-- External network requests such as: triggering Sidekiq jobs, sending emails, HTTP API calls and running database statements using a different connection.
+- External network requests such as:
+  - Triggering Sidekiq jobs.
+  - Sending emails.
+  - HTTP API calls.
+  - Running database statements using a different connection.
 - File system operations.
 - Long, CPU intensive computation.
 - Calling `sleep(n)`.
 
 ## Explicit model referencing
 
-If a transaction modifies records from the same database table, it's advised to use the `Model.transaction` block:
+If a transaction modifies records from the same database table, we advise to use the
+`Model.transaction` block:
 
 ```ruby
 build_1 = Ci::Build.find(1)
@@ -98,7 +121,8 @@ Ci::Build.transaction do
 end
 ```
 
-The transaction above will use the same database connection for the transaction as the models in the `transaction` block. In a multi-database environment the following example would be dangerous:
+The transaction above uses the same database connection for the transaction as the models
+in the `transaction` block. In a multi-database environment the following example is dangerous:
 
 ```ruby
 # `ci_builds` table is located on another database
@@ -114,4 +138,6 @@ ActiveRecord::Base.transaction do
 end
 ```
 
-The `ActiveRecord::Base` class uses a different database connection than the `Ci::Build` records. The two statements in the transaction block will not be part of the transaction and will not be rolled back in case something goes wrong. They act as 3rd part calls.
+The `ActiveRecord::Base` class uses a different database connection than the `Ci::Build` records.
+The two statements in the transaction block will not be part of the transaction and will not be
+rolled back in case something goes wrong. They act as 3rd part calls.
author	GitLab Bot <gitlab-bot@gitlab.com>	2021-10-20 11:43:02 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2021-10-20 11:43:02 +0300
commit	d9ab72d6080f594d0b3cae15f14b3ef2c6c638cb (patch)
tree	2341ef426af70ad1e289c38036737e04b0aa5007 /doc/development/database
parent	d6e514dd13db8947884cd58fe2a9c2a063400a9b (diff)