diff options
Diffstat (limited to 'doc/development/database/batched_background_migrations.md')
-rw-r--r-- | doc/development/database/batched_background_migrations.md | 371 |
1 files changed, 371 insertions, 0 deletions
diff --git a/doc/development/database/batched_background_migrations.md b/doc/development/database/batched_background_migrations.md new file mode 100644 index 00000000000..3a0fa77eff9 --- /dev/null +++ b/doc/development/database/batched_background_migrations.md @@ -0,0 +1,371 @@ +--- +type: reference, dev +stage: Enablement +group: Database +info: "See the Technical Writers assigned to Development Guidelines: https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments-to-development-guidelines" +--- + +# Batched background migrations + +Batched Background Migrations should be used to perform data migrations whenever a +migration exceeds [the time limits](../migration_style_guide.md#how-long-a-migration-should-take) +in our guidelines. For example, you can use batched background +migrations to migrate data that's stored in a single JSON column +to a separate table instead. + +## When to use batched background migrations + +Use a batched background migration when you migrate _data_ in tables containing +so many rows that the process would exceed +[the time limits in our guidelines](../migration_style_guide.md#how-long-a-migration-should-take) +if performed using a regular Rails migration. + +- Batched background migrations should be used when migrating data in + [high-traffic tables](../migration_style_guide.md#high-traffic-tables). +- Batched background migrations may also be used when executing numerous single-row queries + for every item on a large dataset. Typically, for single-record patterns, runtime is + largely dependent on the size of the dataset. Split the dataset accordingly, + and put it into background migrations. +- Don't use batched background migrations to perform schema migrations. + +Background migrations can help when: + +- Migrating events from one table to multiple separate tables. +- Populating one column based on JSON stored in another column. +- Migrating data that depends on the output of external services. (For example, an API.) + +NOTE: +If the batched background migration is part of an important upgrade, it must be announced +in the release post. Discuss with your Project Manager if you're unsure if the migration falls +into this category. + +## Isolation + +Batched background migrations must be isolated and can not use application code (for example, +models defined in `app/models` except the `ApplicationRecord` classes). +Because these migrations can take a long time to run, it's possible +for new versions to deploy while the migrations are still running. + +## Accessing data for multiple databases + +Background Migration contrary to regular migrations does have access to multiple databases +and can be used to efficiently access and update data across them. To properly indicate +a database to be used it is desired to create ActiveRecord model inline the migration code. +Such model should use a correct [`ApplicationRecord`](multiple_databases.md#gitlab-schema) +depending on which database the table is located. As such usage of `ActiveRecord::Base` +is disallowed as it does not describe a explicitly database to be used to access given table. + +```ruby +# good +class Gitlab::BackgroundMigration::ExtractIntegrationsUrl + class Project < ::ApplicationRecord + self.table_name = 'projects' + end + + class Build < ::Ci::ApplicationRecord + self.table_name = 'ci_builds' + end +end + +# bad +class Gitlab::BackgroundMigration::ExtractIntegrationsUrl + class Project < ActiveRecord::Base + self.table_name = 'projects' + end + + class Build < ActiveRecord::Base + self.table_name = 'ci_builds' + end +end +``` + +Similarly the usage of `ActiveRecord::Base.connection` is disallowed and needs to be +replaced preferably with the usage of model connection. + +```ruby +# good +Project.connection.execute("SELECT * FROM projects") + +# acceptable +ApplicationRecord.connection.execute("SELECT * FROM projects") + +# bad +ActiveRecord::Base.connection.execute("SELECT * FROM projects") +``` + +## Idempotence + +Batched background migrations are executed in a context of a Sidekiq process. +The usual Sidekiq rules apply, especially the rule that jobs should be small +and idempotent. Make sure that in case that your migration job is retried, data +integrity is guaranteed. + +See [Sidekiq best practices guidelines](https://github.com/mperham/sidekiq/wiki/Best-Practices) +for more details. + +## Batched background migrations for EE-only features + +All the background migration classes for EE-only features should be present in GitLab CE. +For this purpose, create an empty class for GitLab CE, and extend it for GitLab EE +as explained in the guidelines for +[implementing Enterprise Edition features](../ee_features.md#code-in-libgitlabbackground_migration). + +Batched Background migrations are simple classes that define a `perform` method. A +Sidekiq worker then executes such a class, passing any arguments to it. All +migration classes must be defined in the namespace +`Gitlab::BackgroundMigration`. Place the files in the directory +`lib/gitlab/background_migration/`. + +## Queueing + +Queueing a batched background migration should be done in a post-deployment +migration. Use this `queue_batched_background_migration` example, queueing the +migration to be executed in batches. Replace the class name and arguments with the values +from your migration: + +```ruby +queue_batched_background_migration( + JOB_CLASS_NAME, + TABLE_NAME, + JOB_ARGUMENTS, + JOB_INTERVAL + ) +``` + +Make sure the newly-created data is either migrated, or +saved in both the old and new version upon creation. Removals in +turn can be handled by defining foreign keys with cascading deletes. + +### Requeuing batched background migrations + +If one of the batched background migrations contains a bug that is fixed in a patch +release, you must requeue the batched background migration so the migration +repeats on systems that already performed the initial migration. + +When you requeue the batched background migration, turn the original +queuing into a no-op by clearing up the `#up` and `#down` methods of the +migration performing the requeuing. Otherwise, the batched background migration is +queued multiple times on systems that are upgrading multiple patch releases at +once. + +When you start the second post-deployment migration, delete the +previously batched migration with the provided code: + +```ruby +Gitlab::Database::BackgroundMigration::BatchedMigration + .for_configuration(MIGRATION_NAME, TABLE_NAME, COLUMN, JOB_ARGUMENTS) + .delete_all +``` + +## Cleaning up + +NOTE: +Cleaning up any remaining background migrations must be done in either a major +or minor release. You must not do this in a patch release. + +Because background migrations can take a long time, you can't immediately clean +things up after queueing them. For example, you can't drop a column used in the +migration process, as jobs would fail. You must add a separate _post-deployment_ +migration in a future release that finishes any remaining +jobs before cleaning things up. (For example, removing a column.) + +To migrate the data from column `foo` (containing a big JSON blob) to column `bar` +(containing a string), you would: + +1. Release A: + 1. Create a migration class that performs the migration for a row with a given ID. + 1. Update new rows using one of these techniques: + - Create a new trigger for simple copy operations that don't need application logic. + - Handle this operation in the model/service as the records are created or updated. + - Create a new custom background job that updates the records. + 1. Queue the batched background migration for all existing rows in a post-deployment migration. +1. Release B: + 1. Add a post-deployment migration that checks if the batched background migration is completed. + 1. Deploy code so that the application starts using the new column and stops to update new records. + 1. Remove the old column. + +Bump to the [import/export version](../../user/project/settings/import_export.md) may +be required, if importing a project from a prior version of GitLab requires the +data to be in the new format. + +## Example + +The `routes` table has a `source_type` field that's used for a polymorphic relationship. +As part of a database redesign, we're removing the polymorphic relationship. One step of +the work will be migrating data from the `source_id` column into a new singular foreign key. +Because we intend to delete old rows later, there's no need to update them as part of the +background migration. + +1. Start by defining our migration class, which should inherit + from `Gitlab::BackgroundMigration::BatchedMigrationJob`: + + ```ruby + class Gitlab::BackgroundMigration::BackfillRouteNamespaceId < BatchedMigrationJob + # For illustration purposes, if we were to use a local model we could + # define it like below, using an `ApplicationRecord` as the base class + # class Route < ::ApplicationRecord + # self.table_name = 'routes' + # end + + def perform + each_sub_batch( + operation_name: :update_all, + batching_scope: -> (relation) { relation.where("source_type <> 'UnusedType'") } + ) do |sub_batch| + sub_batch.update_all('namespace_id = source_id') + end + end + end + ``` + + NOTE: + Job classes must be subclasses of `BatchedMigrationJob` to be + correctly handled by the batched migration framework. Any subclass of + `BatchedMigrationJob` will be initialized with necessary arguments to + execute the batch, as well as a connection to the tracking database. + Additional `job_arguments` set on the migration will be passed to the + job's `perform` method. + +1. Add a new trigger to the database to update newly created and updated routes, + similar to this example: + + ```ruby + execute(<<~SQL) + CREATE OR REPLACE FUNCTION example() RETURNS trigger + LANGUAGE plpgsql + AS $$ + BEGIN + NEW."namespace_id" = NEW."source_id" + RETURN NEW; + END; + $$; + SQL + ``` + +1. Create a post-deployment migration that queues the migration for existing data: + + ```ruby + class QueueBackfillRoutesNamespaceId < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + MIGRATION = 'BackfillRouteNamespaceId' + DELAY_INTERVAL = 2.minutes + + def up + queue_batched_background_migration( + MIGRATION, + :routes, + :id, + job_interval: DELAY_INTERVAL + ) + end + + def down + Gitlab::Database::BackgroundMigration::BatchedMigration + .for_configuration(MIGRATION, :routes, :id, []).delete_all + end + end + ``` + + After deployment, our application: + - Continues using the data as before. + - Ensures that both existing and new data are migrated. + +1. In the next release, remove the trigger. We must also add a new post-deployment migration + that checks that the batched background migration is completed. For example: + + ```ruby + class FinalizeBackfillRouteNamespaceId < Gitlab::Database::Migration[1.0] + MIGRATION = 'BackfillRouteNamespaceId' + disable_ddl_transaction! + + def up + ensure_batched_background_migration_is_finished( + job_class_name: MIGRATION, + table_name: :routes, + column_name: :id, + job_arguments: [] + ) + end + + def down + # no-op + end + end + ``` + + If the application does not depend on the data being 100% migrated (for + instance, the data is advisory, and not mission-critical), then you can skip this + final step. This step confirms that the migration is completed, and all of the rows were migrated. + +After the batched migration is completed, you can safely depend on the +data in `routes.namespace_id` being populated. + +## Testing + +Writing tests is required for: + +- The batched background migrations' queueing migration. +- The batched background migration itself. +- A cleanup migration. + +The `:migration` and `schema: :latest` RSpec tags are automatically set for +background migration specs. Refer to the +[Testing Rails migrations](../testing_guide/testing_migrations_guide.md#testing-a-non-activerecordmigration-class) +style guide. + +Remember that `before` and `after` RSpec hooks +migrate your database down and up. These hooks can result in other batched background +migrations being called. Using `spy` test doubles with +`have_received` is encouraged, instead of using regular test doubles, because +your expectations defined in a `it` block can conflict with what is +called in RSpec hooks. Refer to [issue #35351](https://gitlab.com/gitlab-org/gitlab/-/issues/18839) +for more details. + +## Best practices + +1. Know how much data you're dealing with. +1. Make sure the batched background migration jobs are idempotent. +1. Confirm the tests you write are not false positives. +1. If the data being migrated is critical and cannot be lost, the + clean-up migration must also check the final state of the data before completing. +1. Discuss the numbers with a database specialist. The migration may add + more pressure on DB than you expect. Measure on staging, + or ask someone to measure on production. +1. Know how much time is required to run the batched background migration. + +## Additional tips and strategies + +### Viewing failure error logs + +You can view failures in two ways: + +- Via GitLab logs: + 1. After running a batched background migration, if any jobs fail, + view the logs in [Kibana](https://log.gprd.gitlab.net/goto/5f06a57f768c6025e1c65aefb4075694). + View the production Sidekiq log and filter for: + + - `json.new_state: failed` + - `json.job_class_name: <Batched Background Migration job class name>` + - `json.job_arguments: <Batched Background Migration job class arguments>` + + 1. Review the `json.exception_class` and `json.exception_message` values to help + understand why the jobs failed. + + 1. Remember the retry mechanism. Having a failure does not mean the job failed. + Always check the last status of the job. + +- Via database: + + 1. Get the batched background migration `CLASS_NAME`. + 1. Execute the following query in the PostgreSQL console: + + ```sql + SELECT migration.id, migration.job_class_name, transition_logs.exception_class, transition_logs.exception_message + FROM batched_background_migrations as migration + INNER JOIN batched_background_migration_jobs as jobs + ON jobs.batched_background_migration_id = migration.id + INNER JOIN batched_background_migration_job_transition_logs as transition_logs + ON transition_logs.batched_background_migration_job_id = jobs.id + WHERE transition_logs.next_status = '2' AND migration.job_class_name = "CLASS_NAME"; + ``` |