diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2022-08-18 11:17:02 +0300 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2022-08-18 11:17:02 +0300 |
commit | b39512ed755239198a9c294b6a45e65c05900235 (patch) | |
tree | d234a3efade1de67c46b9e5a38ce813627726aa7 /doc/development/iterating_tables_in_batches.md | |
parent | d31474cf3b17ece37939d20082b07f6657cc79a9 (diff) |
Add latest changes from gitlab-org/gitlab@15-3-stable-eev15.3.0-rc42
Diffstat (limited to 'doc/development/iterating_tables_in_batches.md')
-rw-r--r-- | doc/development/iterating_tables_in_batches.md | 601 |
1 files changed, 7 insertions, 594 deletions
diff --git a/doc/development/iterating_tables_in_batches.md b/doc/development/iterating_tables_in_batches.md index 1159e3755e5..589e38a5cb0 100644 --- a/doc/development/iterating_tables_in_batches.md +++ b/doc/development/iterating_tables_in_batches.md @@ -1,598 +1,11 @@ --- -stage: Data Stores -group: Database -info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +redirect_to: 'database/iterating_tables_in_batches.md' +remove_date: '2022-11-06' --- -# Iterating tables in batches +This document was moved to [another location](database/iterating_tables_in_batches.md). -Rails provides a method called `in_batches` that can be used to iterate over -rows in batches. For example: - -```ruby -User.in_batches(of: 10) do |relation| - relation.update_all(updated_at: Time.now) -end -``` - -Unfortunately, this method is implemented in a way that is not very efficient, -both query and memory usage wise. - -To work around this you can include the `EachBatch` module into your models, -then use the `each_batch` class method. For example: - -```ruby -class User < ActiveRecord::Base - include EachBatch -end - -User.each_batch(of: 10) do |relation| - relation.update_all(updated_at: Time.now) -end -``` - -This produces queries such as: - -```plaintext -User Load (0.7ms) SELECT "users"."id" FROM "users" WHERE ("users"."id" >= 41654) ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000 - (0.7ms) SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687) -``` - -The API of this method is similar to `in_batches`, though it doesn't support -all of the arguments that `in_batches` supports. You should always use -`each_batch` _unless_ you have a specific need for `in_batches`. - -## Iterating over non-unique columns - -One should proceed with extra caution. When you iterate over an attribute that is not unique, -even with the applied max batch size, there is no guarantee that the resulting batches do not -surpass it. The following snippet demonstrates this situation when one attempt to select -`Ci::Build` entries for users with `id` between `1` and `10,000`, the database returns -`1 215 178` matching rows. - -```ruby -[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size -=> 1215178 -``` - -This happens because the built relation is translated into the following query: - -```ruby -[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql -SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000 -=> nil -``` - -`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`, -even though the range size is limited to a certain threshold (`10,000` in the previous example) this -threshold does not translate to the size of the returned dataset. That happens because when taking -`n` possible values of attributes, one can't tell for sure that the number of records that contains -them is less than `n`. - -### Loose-index scan with `distinct_each_batch` - -When iterating over a non-unique column is necessary, use the `distinct_each_batch` helper -method. The helper uses the [loose-index scan technique](https://wiki.postgresql.org/wiki/Loose_indexscan) -(skip-index scan) to skip duplicated values within a database index. - -Example: iterating over distinct `author_id` in the Issue model - -```ruby -Issue.distinct_each_batch(column: :author_id, of: 1000) do |relation| - users = User.where(id: relation.select(:author_id)).to_a -end -``` - -The technique provides stable performance between the batches regardless of the data distribution. -The `relation` object returns an ActiveRecord scope where only the given `column` is available. -Other columns are not loaded. - -The underlying database queries use recursive CTEs, which adds extra overhead. We therefore advise to use -smaller batch sizes than those used for a standard `each_batch` iteration. - -## Column definition - -`EachBatch` uses the primary key of the model by default for the iteration. This works most of the -cases, however in some cases, you might want to use a different column for the iteration. - -```ruby -Project.distinct.each_batch(column: :creator_id, of: 10) do |relation| - puts User.where(id: relation.select(:creator_id)).map(&:id) -end -``` - -The query above iterates over the project creators and prints them out without duplications. - -NOTE: -In case the column is not unique (no unique index definition), calling the `distinct` method on -the relation is necessary. Using not unique column without `distinct` may result in `each_batch` -falling into an endless loop as described in following -[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097). - -## `EachBatch` in data migrations - -When dealing with data migrations the preferred way to iterate over a large volume of data is using -`EachBatch`. - -A special case of data migration is a [background migration](database/background_migrations.md#scheduling) -where the actual data modification is executed in a background job. The migration code that -determines the data ranges (slices) and schedules the background jobs uses `each_batch`. - -## Efficient usage of `each_batch` - -`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch` -does not magically solve all iteration-related performance problems, and it might not help at -all in some scenarios. From the database point of view, correctly configured database indexes are -also necessary to make `EachBatch` perform well. - -### Example 1: Simple iteration - -Let's consider that we want to iterate over the `users` table and print the `User` records to the -standard output. The `users` table contains millions of records, thus running one query to fetch -the users likely times out. - -![Users table overview](img/each_batch_users_table_v13_7.png) - -This is a simplified version of the `users` table which contains several rows. We have a few -smaller gaps in the `id` column to make the example a bit more realistic (a few records were -already deleted). Currently, we have one index on the `id` field. - -Loading all users into memory (avoid): - -```ruby -users = User.all - -users.each { |user| puts user.inspect } -``` - -Use `each_batch`: - -```ruby -# Note: for this example I picked 5 as the batch size, the default is 1_000 -User.each_batch(of: 5) do |relation| - relation.each { |user| puts user.inspect } -end -``` - -#### How `each_batch` works - -As the first step, it finds the lowest `id` (start `id`) in the table by executing the following -database query: - -```sql -SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1 -``` - -![Reading the start ID value](img/each_batch_users_table_iteration_1_v13_7.png) - -Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not -accessed. Database indexes are sorted so taking out the first item is a very cheap operation. - -The next step is to find the next `id` (end `id`) which should respect the batch size -configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause -to get a "shifted" `id` value. - -```sql -SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5 -``` - -![Reading the end ID value](img/each_batch_users_table_iteration_2_v13_7.png) - -Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this -query reads a maximum of six items from the index regardless of the table size or the iteration -count. - -At this point, we know the `id` range for the first batch. Now it's time to construct the query -for the `relation` block. - -```sql -SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302 -``` - -![Reading the rows from the `users` table](img/each_batch_users_table_iteration_3_v13_7.png) - -Notice the `<` sign. Previously six items were read from the index and in this query, the last -value is "excluded". The query looks at the index to get the location of the five `user` -rows on the disk and read the rows from the table. The returned array is processed in Ruby. - -The first iteration is done. For the next iteration, the last `id` value is reused from the -previous iteration in order to find out the next end `id` value. - -```sql -SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5 -``` - -![Reading the second end ID value](img/each_batch_users_table_iteration_4_v13_7.png) - -Now we can easily construct the `users` query for the second iteration. - -```sql -SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353 -``` - -![Reading the rows for the second iteration from the users table](img/each_batch_users_table_iteration_5_v13_7.png) - -### Example 2: Iteration with filters - -Building on top of the previous example, we want to print users with zero sign-in count. We keep -track of the number of sign-ins in the `sign_in_count` column so we write the following code: - -```ruby -users = User.where(sign_in_count: 0) - -users.each_batch(of: 5) do |relation| - relation.each { |user| puts user.inspect } -end -``` - -`each_batch` produces the following SQL query for the start `id` value: - -```sql -SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1 -``` - -Selecting only the `id` column and ordering by `id` forces the database to use the -index on the `id` (primary key index) column however, we also have an extra condition on the -`sign_in_count` column. The column is not part of the index, so the database needs to look into -the actual table to find the first matching row. - -![Reading the index with extra filter](img/each_batch_users_table_filter_v13_7.png) - -NOTE: -The number of scanned rows depends on the data distribution in the table. - -- Best case scenario: the first user was never logged in. The database reads only one row. -- Worst case scenario: all users were logged in at least once. The database reads all rows. - -In this particular example, the database had to read 10 rows (regardless of our batch size setting) -to determine the first `id` value. In a "real-world" application it's hard to predict whether the -filtering causes problems or not. In the case of GitLab, verifying the data on a -production replica is a good start, but keep in mind that data distribution on GitLab.com can be -different from self-managed instances. - -#### Improve filtering with `each_batch` - -##### Specialized conditional index - -```sql -CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0 -``` - -This is how our table and the newly created index looks like: - -![Reading the specialized index](img/each_batch_users_table_filtered_index_v13_7.png) - -This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the -`each_batch` queries very effective (similar to the simple iteration example). - -It's rare when a user was never signed in so we a anticipate small index size. Including only the -`id` in the index definition also helps to keep the index size small. - -##### Index on columns - -Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in -those cases we cannot use the previously suggested conditional index because the `WHERE` condition -does not match with our new filter (`sign_in_count > 10`). - -To address this problem, we have two options: - -- Create another, conditional index to cover the new query. -- Replace the index with a more generalized configuration. - -NOTE: -Having multiple indexes on the same table and on the same columns could be a performance bottleneck -when writing data. - -Let's consider the following index (avoid): - -```sql -CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count) -``` - -The index definition starts with the `id` column which makes the index very inefficient from data -selectivity point of view. - -```sql -SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1 -``` - -Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to -iterate over an unknown number of entries in the index, and then find the first item where the -`sign_in_count` is `0`. - -![Reading an ineffective index](img/each_batch_users_table_bad_index_v13_7.png) - -We can improve the query significantly by swapping the columns in the index definition (prefer). - -```sql -CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id) -``` - -![Reading a good index](img/each_batch_users_table_good_index_v13_7.png) - -The following index definition does not work well with `each_batch` (avoid). - -```sql -CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count) -``` - -Since `each_batch` builds range queries based on the `id` column, this index cannot be used -efficiently. The DB reads the rows from the table or uses a bitmap search where the primary -key index is also read. - -##### "Slow" iteration - -Slow iteration means that we use a good index configuration to iterate over the table and -apply filtering on the yielded relation. - -```ruby -User.each_batch(of: 5) do |relation| - relation.where(sign_in_count: 0).each { |user| puts user inspect } -end -``` - -The iteration uses the primary key index (on the `id` column) which makes it safe from statement -timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already -constrained (range). The number of rows is limited. - -Slow iteration generally takes more time to finish. The iteration count is higher and -one iteration could yield fewer records than the batch size. Iterations may even yield -0 records. This is not an optimal solution; however, in some cases (especially when -dealing with large tables) this is the only viable option. - -### Using Subqueries - -Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example: - -```ruby -projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id)) - -projects.each_batch do |relation| - # do something -end -``` - -The iteration uses the `id` column of the `projects` table. The batching does not affect the -subquery. This means for each iteration, the subquery is executed by the database. This adds a -constant "load" on the query which often ends up in statement timeouts. We have an unknown number -of [confidential issues](../user/project/issues/confidential_issues.md), the execution time -and the accessed database rows depend on the data distribution in the `issues` table. - -NOTE: -Using subqueries works only when the subquery returns a small number of rows. - -#### Improving Subqueries - -When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id` -can be part of the generated `relation` object. - -```ruby -projects = Project.all - -projects.each_batch do |relation| - relation.where(creator_id: Issue.where(confidential: true).select(:author_id)) -end -``` - -If the query on the `issues` table itself is not performant enough, a nested loop could be -constructed. Try to avoid it when possible. - -```ruby -projects = Project.all - -projects.each_batch do |relation| - issues = Issue.where(confidential: true) - - issues.each_batch do |issues_relation| - relation.where(creator_id: issues_relation.select(:author_id)) - end -end -``` - -If we know that the `issues` table has many more rows than `projects`, it would make sense to flip -the queries, where the `issues` table is batched first. - -### Using `JOIN` and `EXISTS` - -When to use `JOINS`: - -- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record -(almost) always exists. This works well for "extension-like" tables: - - `projects` - `project_settings` - - `users` - `user_details` - - `users` - `user_statuses` -- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded -relation so the iteration is not affected by the data distribution in the joined table. - -Example: - -```ruby -users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id") - -users.each_batch do |relation| - relation.where("personal_access_tokens.name = 'name'") -end -``` - -`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query: - -```ruby -User.each_batch do |relation| - relation.where("EXISTS (SELECT 1 FROM ...") -end -``` - -### Complex queries on the relation object - -When the `relation` object has several extra conditions, the execution plans might become -"unstable". - -Example: - -```ruby -Issue.each_batch do |relation| - relation - .joins(:metrics) - .joins(:merge_requests_closing_issues) - .where("id IN (SELECT ...)") - .where(confidential: true) -end -``` - -Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then -filters down the results according to the provided queries. The planner might decide that -using a bitmap index lookup with the index on the `confidential` column is a better way to -execute the query. This can cause an unexpectedly high amount of rows to be read and the -query could time out. - -Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records -however, the planner does not know this. - -Common table expression (CTE) trick to force the range query to execute first: - -```ruby -Issue.each_batch(of: 1000) do |relation| - cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000)) - - scope = cte - .apply_to(Issue.all) - .joins(:metrics) - .joins(:merge_requests_closing_issues) - .where("id IN (SELECT ...)") - .where(confidential: true) - - puts scope.to_a -end -``` - -### `EachBatch` vs `BatchCount` - -When adding new counters for Service Ping, the preferred way to count records is using the -`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount` -has similar performance characteristics like `EachBatch`. Most of the tips and suggestions -for improving `BatchCount` mentioned above applies to `BatchCount` as well. - -## Iterate with keyset pagination - -There are a few special cases where iterating with `EachBatch` does not work. `EachBatch` -requires one distinct column (usually the primary key), which makes the iteration impossible -for timestamp columns and tables with composite primary keys. - -Where `EachBatch` does not work, you can use -[keyset pagination](database/pagination_guidelines.md#keyset-pagination) to iterate over the -table or a range of rows. The scaling and performance characteristics are very similar to -`EachBatch`. - -Examples: - -- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker -if column user to sort by does not contain unique values. -- Iterate over the table with composite primary keys. - -### Iterate over the issues in a project by creation date - -You can use keyset pagination to iterate over any database column in a specific order (for example, -`created_at DESC`). To ensure consistent order of the returned records with the same values for -`created_at`, use a tie-breaker column with unique values (for example, `id`). - -Assume you have the following index in the `issues` table: - -```sql -idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id) -``` - -### Fetching records for further processing - -The following snippet iterates over issue records within the project using the specified order -(`created_at, id`). - -```ruby -scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker - -iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) - -iterator.each_batch(of: 100) do |records| - puts records.map(&:id) -end -``` - -You can add extra filters to the query. This example only lists the issue IDs created in the last -30 days: - -```ruby -scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker - -iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) - -iterator.each_batch(of: 100) do |records| - puts records.map(&:id) -end -``` - -### Updating records in the batch - -For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it -generates an incorrect `UPDATE` statement. -You can use raw SQL for updating records in batches: - -```ruby -scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker - -iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) - -iterator.each_batch(of: 100) do |records| - ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})") -end -``` - -NOTE: -To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause. - -### Iterate over the `merge_request_diff_commits` table - -The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id, -relative_order`), which makes `EachBatch` impossible to use efficiently. - -To paginate over the `merge_request_diff_commits` table, you can use the following snippet: - -```ruby -# Custom order object configuration: -order = Gitlab::Pagination::Keyset::Order.build([ - Gitlab::Pagination::Keyset::ColumnOrderDefinition.new( - attribute_name: 'merge_request_diff_id', - order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc, - nullable: :not_nullable, - distinct: false, - ), - Gitlab::Pagination::Keyset::ColumnOrderDefinition.new( - attribute_name: 'relative_order', - order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc, - nullable: :not_nullable, - distinct: false, - ) -]) -MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries - -scope = MergeRequestDiffCommit.order(order) - -iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) - -iterator.each_batch(of: 100) do |records| - puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect -end -``` - -### Order object configuration - -Keyset pagination works well with simple `ActiveRecord` `order` scopes -([first example](iterating_tables_in_batches.md#iterate-over-the-issues-in-a-project-by-creation-date). -However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example) -for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be -automatically determined by the keyset pagination library, an error is raised. - -The code comments of the -[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb) -and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb) -classes give an overview of the possible options for configuring the `ORDER BY` clause. You can -also find a few code examples in the -[keyset pagination](database/keyset_pagination.md#complex-order-configuration) documentation. +<!-- This redirect file can be deleted after <2022-11-06>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (for example, link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> |