Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database/batched_background_migrations.md')
-rw-r--r--doc/development/database/batched_background_migrations.md198
1 files changed, 123 insertions, 75 deletions
diff --git a/doc/development/database/batched_background_migrations.md b/doc/development/database/batched_background_migrations.md
index f3ea82b5c61..edb22fcf436 100644
--- a/doc/development/database/batched_background_migrations.md
+++ b/doc/development/database/batched_background_migrations.md
@@ -105,11 +105,16 @@ for more details.
## Batched background migrations for EE-only features
-All the background migration classes for EE-only features should be present in GitLab CE.
-For this purpose, create an empty class for GitLab CE, and extend it for GitLab EE
+All the background migration classes for EE-only features should be present in GitLab FOSS.
+For this purpose, create an empty class for GitLab FOSS, and extend it for GitLab EE
as explained in the guidelines for
[implementing Enterprise Edition features](../ee_features.md#code-in-libgitlabbackground_migration).
+NOTE:
+Background migration classes for EE-only features that use job arguments should define them
+in the GitLab FOSS class. This is required to prevent job arguments validation from failing when
+migration is scheduled in GitLab FOSS context.
+
Batched Background migrations are simple classes that define a `perform` method. A
Sidekiq worker then executes such a class, passing any arguments to it. All
migration classes must be defined in the namespace
@@ -132,6 +137,10 @@ queue_batched_background_migration(
)
```
+NOTE:
+This helper raises an error if the number of provided job arguments does not match
+the number of [job arguments](#job-arguments) defined in `JOB_CLASS_NAME`.
+
Make sure the newly-created data is either migrated, or
saved in both the old and new version upon creation. Removals in
turn can be handled by defining foreign keys with cascading deletes.
@@ -186,6 +195,115 @@ Bump to the [import/export version](../../user/project/settings/import_export.md
be required, if importing a project from a prior version of GitLab requires the
data to be in the new format.
+## Job arguments
+
+`BatchedMigrationJob` provides the `job_arguments` helper method for job classes to define the job arguments they need.
+
+Batched migrations scheduled with `queue_batched_background_migration` **must** use the helper to define the job arguments:
+
+```ruby
+queue_batched_background_migration(
+ 'CopyColumnUsingBackgroundMigrationJob',
+ TABLE_NAME,
+ 'name', 'name_convert_to_text',
+ job_interval: DELAY_INTERVAL
+)
+```
+
+NOTE:
+If the number of defined job arguments does not match the number of job arguments provided when
+scheduling the migration, `queue_batched_background_migration` raises an error.
+
+In this example, `copy_from` returns `name`, and `copy_to` returns `name_convert_to_text`:
+
+```ruby
+class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
+ job_arguments :copy_from, :copy_to
+
+ def perform
+ from_column = connection.quote_column_name(copy_from)
+ to_column = connection.quote_column_name(copy_to)
+
+ assignment_clause = "#{to_column} = #{from_column}"
+
+ each_sub_batch(operation_name: :update_all) do |relation|
+ relation.update_all(assignment_clause)
+ end
+ end
+end
+```
+
+### Additional filters
+
+By default, when creating background jobs to perform the migration, batched background migrations
+iterate over the full specified table. This iteration is done using the
+[`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). If the table has 1000 records
+and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes,
+`EachBatch` is used like this:
+
+```ruby
+# PrimaryKeyBatchingStrategy
+Namespace.each_batch(of: 100) do |relation|
+ relation.where(type: nil).update_all(type: 'User') # this happens in each background job
+end
+```
+
+In some cases, only a subset of records must be examined. If only 10% of the 1000 records
+need examination, apply a filter to the initial relation when the jobs are created:
+
+```ruby
+Namespace.where(type: nil).each_batch(of: 100) do |relation|
+ relation.update_all(type: 'User')
+end
+```
+
+In the first example, we don't know how many records will be updated in each batch.
+In the second (filtered) example, we know exactly 100 will be updated with each batch.
+
+`BatchedMigrationJob` provides a `scope_to` helper method to apply additional filters and achieve this:
+
+1. Create a new migration job class that inherits from `BatchedMigrationJob` and defines the additional filter:
+
+ ```ruby
+ class BackfillNamespaceType < BatchedMigrationJob
+ scope_to ->(relation) { relation.where(type: nil) }
+
+ def perform
+ each_sub_batch(operation_name: :update_all) do |sub_batch|
+ sub_batch.update_all(type: 'User')
+ end
+ end
+ end
+ ```
+
+1. In the post-deployment migration, enqueue the batched background migration:
+
+ ```ruby
+ class BackfillNamespaceType < Gitlab::Database::Migration[2.0]
+ MIGRATION = 'BackfillNamespaceType'
+ DELAY_INTERVAL = 2.minutes
+
+ restrict_gitlab_migration gitlab_schema: :gitlab_main
+
+ def up
+ queue_batched_background_migration(
+ MIGRATION,
+ :namespaces,
+ :id,
+ job_interval: DELAY_INTERVAL
+ )
+ end
+
+ def down
+ delete_batched_background_migration(MIGRATION, :namespaces, :id, [])
+ end
+ end
+ ```
+
+NOTE:
+When applying additional filters, it is important to ensure they are properly covered by an index to optimize `EachBatch` performance.
+In the example above we need an index on `(type, id)` to support the filters. See [the `EachBatch` docs for more information](../iterating_tables_in_batches.md).
+
## Example
The `routes` table has a `source_type` field that's used for a polymorphic relationship.
@@ -221,8 +339,6 @@ background migration.
correctly handled by the batched migration framework. Any subclass of
`BatchedMigrationJob` is initialized with necessary arguments to
execute the batch, as well as a connection to the tracking database.
- Additional `job_arguments` set on the migration are passed to the
- job's `perform` method.
1. Add a new trigger to the database to update newly created and updated routes,
similar to this example:
@@ -320,7 +436,7 @@ The default batching strategy provides an efficient way to iterate over primary
However, if you need to iterate over columns where values are not unique, you must use a
different batching strategy.
-The `LooseIndexScanBatchingStrategy` batching strategy uses a special version of [`EachBatch`](../iterating_tables_in_batches.md#loose-index-scan-with-distinct_each_batch)
+The `LooseIndexScanBatchingStrategy` batching strategy uses a special version of [`EachBatch`](iterating_tables_in_batches.md#loose-index-scan-with-distinct_each_batch)
to provide efficient and stable iteration over the distinct column values.
This example shows a batched background migration where the `issues.project_id` column is used as
@@ -374,76 +490,8 @@ module Gitlab
end
```
-### Adding filters to the initial batching
-
-By default, when creating background jobs to perform the migration, batched background migrations will iterate over the full specified table. This is done using the [`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). This means if there are 1000 records in the table and the batch size is 100, there will be 10 jobs. For illustrative purposes, `EachBatch` is used like this:
-
-```ruby
-# PrimaryKeyBatchingStrategy
-Projects.all.each_batch(of: 100) do |relation|
- relation.where(foo: nil).update_all(foo: 'bar') # this happens in each background job
-end
-```
-
-There are cases where we only need to look at a subset of records. Perhaps we only need to update 1 out of every 10 of those 1000 records. It would be best if we could apply a filter to the initial relation when the jobs are created:
-
-```ruby
-Projects.where(foo: nil).each_batch(of: 100) do |relation|
- relation.update_all(foo: 'bar')
-end
-```
-
-In the `PrimaryKeyBatchingStrategy` example, we do not know how many records will be updated in each batch. In the filtered example, we know exactly 100 will be updated with each batch.
-
-The `PrimaryKeyBatchingStrategy` contains [a method that can be overwritten](https://gitlab.com/gitlab-org/gitlab/-/blob/dd1e70d3676891025534dc4a1e89ca9383178fe7/lib/gitlab/background_migration/batching_strategies/primary_key_batching_strategy.rb#L38-52) to apply additional filtering on the initial `EachBatch`.
-
-We can accomplish this by:
-
-1. Create a new class that inherits from `PrimaryKeyBatchingStrategy` and overrides the method using the desired filter (this may be the same filter used in the sub-batch):
-
- ```ruby
- # frozen_string_literal: true
-
- module GitLab
- module BackgroundMigration
- module BatchingStrategies
- class FooStrategy < PrimaryKeyBatchingStrategy
- def apply_additional_filters(relation, job_arguments: [], job_class: nil)
- relation.where(foo: nil)
- end
- end
- end
- end
- end
- ```
-
-1. In the post-deployment migration that queues the batched background migration, specify the new batching strategy using the `batch_class_name` parameter:
-
- ```ruby
- class BackfillProjectsFoo < Gitlab::Database::Migration[2.0]
- MIGRATION = 'BackfillProjectsFoo'
- DELAY_INTERVAL = 2.minutes
- BATCH_CLASS_NAME = 'FooStrategy'
-
- restrict_gitlab_migration gitlab_schema: :gitlab_main
-
- def up
- queue_batched_background_migration(
- MIGRATION,
- :routes,
- :id,
- job_interval: DELAY_INTERVAL,
- batch_class_name: BATCH_CLASS_NAME
- )
- end
-
- def down
- delete_batched_background_migration(MIGRATION, :routes, :id, [])
- end
- end
- ```
-
-When applying a batching strategy, it is important to ensure the filter properly covered by an index to optimize `EachBatch` performance. See [the `EachBatch` docs for more information](../iterating_tables_in_batches.md).
+NOTE:
+[Additional filters](#additional-filters) defined with `scope_to` will be ignored by `LooseIndexScanBatchingStrategy` and `distinct_each_batch`.
## Testing