diff options
Diffstat (limited to 'doc/development/background_migrations.md')
-rw-r--r-- | doc/development/background_migrations.md | 500 |
1 files changed, 7 insertions, 493 deletions
diff --git a/doc/development/background_migrations.md b/doc/development/background_migrations.md index 9fffbd25518..3c9c34bccf8 100644 --- a/doc/development/background_migrations.md +++ b/doc/development/background_migrations.md @@ -1,497 +1,11 @@ --- -type: reference, dev -stage: none -group: Development -info: "See the Technical Writers assigned to Development Guidelines: https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments-to-development-guidelines" +redirect_to: 'database/background_migrations.md' +remove_date: '2022-07-08' --- -# Background migrations +This document was moved to [another location](database/background_migrations.md). -Background migrations should be used to perform data migrations whenever a -migration exceeds [the time limits in our guidelines](migration_style_guide.md#how-long-a-migration-should-take). For example, you can use background -migrations to migrate data that's stored in a single JSON column -to a separate table instead. - -If the database cluster is considered to be in an unhealthy state, background -migrations automatically reschedule themselves for a later point in time. - -## When To Use Background Migrations - -You should use a background migration when you migrate _data_ in tables that have -so many rows that the process would exceed [the time limits in our guidelines](migration_style_guide.md#how-long-a-migration-should-take) if performed using a regular Rails migration. - -- Background migrations should be used when migrating data in [high-traffic tables](migration_style_guide.md#high-traffic-tables). -- Background migrations may also be used when executing numerous single-row queries -for every item on a large dataset. Typically, for single-record patterns, runtime is -largely dependent on the size of the dataset, hence it should be split accordingly -and put into background migrations. -- Background migrations should not be used to perform schema migrations. - -Some examples where background migrations can be useful: - -- Migrating events from one table to multiple separate tables. -- Populating one column based on JSON stored in another column. -- Migrating data that depends on the output of external services (for example, an API). - -NOTE: -If the background migration is part of an important upgrade, make sure it's announced -in the release post. Discuss with your Project Manager if you're not sure the migration falls -into this category. - -## Isolation - -Background migrations must be isolated and can not use application code (for example, -models defined in `app/models`). Since these migrations can take a long time to -run it's possible for new versions to be deployed while they are still running. - -It's also possible for different migrations to be executed at the same time. -This means that different background migrations should not migrate data in a -way that would cause conflicts. - -## Idempotence - -Background migrations are executed in a context of a Sidekiq process. -Usual Sidekiq rules apply, especially the rule that jobs should be small -and idempotent. - -See [Sidekiq best practices guidelines](https://github.com/mperham/sidekiq/wiki/Best-Practices) -for more details. - -Make sure that in case that your migration job is going to be retried data -integrity is guaranteed. - -## Background migrations for EE-only features - -All the background migration classes for EE-only features should be present in GitLab CE. -For this purpose, an empty class can be created for GitLab CE, and it can be extended for GitLab EE -as explained in the [guidelines for implementing Enterprise Edition features](ee_features.md#code-in-libgitlabbackground_migration). - -## How It Works - -Background migrations are simple classes that define a `perform` method. A -Sidekiq worker will then execute such a class, passing any arguments to it. All -migration classes must be defined in the namespace -`Gitlab::BackgroundMigration`, the files should be placed in the directory -`lib/gitlab/background_migration/`. - -## Scheduling - -Scheduling a background migration should be done in a post-deployment -migration that includes `Gitlab::Database::MigrationHelpers` -To do so, simply use the following code while -replacing the class name and arguments with whatever values are necessary for -your migration: - -```ruby -migrate_in('BackgroundMigrationClassName', [arg1, arg2, ...]) -``` - -You can use the function `queue_background_migration_jobs_by_range_at_intervals` -to automatically split the job into batches: - -```ruby -queue_background_migration_jobs_by_range_at_intervals( - ClassName, - BackgroundMigrationClassName, - 2.minutes, - batch_size: 10_000 - ) -``` - -You'll also need to make sure that newly created data is either migrated, or -saved in both the old and new version upon creation. For complex and time -consuming migrations it's best to schedule a background job using an -`after_create` hook so this doesn't affect response timings. The same applies to -updates. Removals in turn can be handled by simply defining foreign keys with -cascading deletes. - -### Rescheduling background migrations - -If one of the background migrations contains a bug that is fixed in a patch -release, the background migration needs to be rescheduled so the migration would -be repeated on systems that already performed the initial migration. - -When you reschedule the background migration, make sure to turn the original -scheduling into a no-op by clearing up the `#up` and `#down` methods of the -migration performing the scheduling. Otherwise the background migration would be -scheduled multiple times on systems that are upgrading multiple patch releases at -once. - -When you start the second post-deployment migration, you should delete any -previously queued jobs from the initial migration with the provided -helper: - -```ruby -delete_queued_jobs('BackgroundMigrationClassName') -``` - -## Cleaning Up - -NOTE: -Cleaning up any remaining background migrations _must_ be done in either a major -or minor release, you _must not_ do this in a patch release. - -Because background migrations can take a long time you can't immediately clean -things up after scheduling them. For example, you can't drop a column that's -used in the migration process as this would cause jobs to fail. This means that -you'll need to add a separate _post deployment_ migration in a future release -that finishes any remaining jobs before cleaning things up (for example, removing a -column). - -As an example, say you want to migrate the data from column `foo` (containing a -big JSON blob) to column `bar` (containing a string). The process for this would -roughly be as follows: - -1. Release A: - 1. Create a migration class that performs the migration for a row with a given ID. - You can use [background jobs tracking](#background-jobs-tracking) to simplify cleaning up. - 1. Deploy the code for this release, this should include some code that will - schedule jobs for newly created data (for example, using an `after_create` hook). - 1. Schedule jobs for all existing rows in a post-deployment migration. It's - possible some newly created rows may be scheduled twice so your migration - should take care of this. -1. Release B: - 1. Deploy code so that the application starts using the new column and stops - scheduling jobs for newly created data. - 1. In a post-deployment migration, finalize all jobs that have not succeeded by now. - If you used [background jobs tracking](#background-jobs-tracking) in release A, - you can use `finalize_background_migration` from `BackgroundMigrationHelpers` to ensure no jobs remain. - This helper will: - 1. Use `Gitlab::BackgroundMigration.steal` to process any remaining - jobs in Sidekiq. - 1. Reschedule the migration to be run directly (that is, not through Sidekiq) - on any rows that weren't migrated by Sidekiq. This can happen if, for - instance, Sidekiq received a SIGKILL, or if a particular batch failed - enough times to be marked as dead. - 1. Remove `Gitlab::Database::BackgroundMigrationJob` rows where - `status = succeeded`. To retain diagnostic information that may - help with future bug tracking you can skip this step by specifying - the `delete_tracking_jobs: false` parameter. - 1. Remove the old column. - -This may also require a bump to the [import/export version](../user/project/settings/import_export.md), if -importing a project from a prior version of GitLab requires the data to be in -the new format. - -## Example - -To explain all this, let's use the following example: the table `integrations` has a -field called `properties` which is stored in JSON. For all rows you want to -extract the `url` key from this JSON object and store it in the `integrations.url` -column. There are millions of integrations and parsing JSON is slow, thus you can't -do this in a regular migration. - -To do this using a background migration we'll start with defining our migration -class: - -```ruby -class Gitlab::BackgroundMigration::ExtractIntegrationsUrl - class Integration < ActiveRecord::Base - self.table_name = 'integrations' - end - - def perform(start_id, end_id) - Integration.where(id: start_id..end_id).each do |integration| - json = JSON.load(integration.properties) - - integration.update(url: json['url']) if json['url'] - rescue JSON::ParserError - # If the JSON is invalid we don't want to keep the job around forever, - # instead we'll just leave the "url" field to whatever the default value - # is. - next - end - end -end -``` - -Next we'll need to adjust our code so we schedule the above migration for newly -created and updated integrations. We can do this using something along the lines of -the following: - -```ruby -class Integration < ActiveRecord::Base - after_commit :schedule_integration_migration, on: :update - after_commit :schedule_integration_migration, on: :create - - def schedule_integration_migration - BackgroundMigrationWorker.perform_async('ExtractIntegrationsUrl', [id, id]) - end -end -``` - -We're using `after_commit` here to ensure the Sidekiq job is not scheduled -before the transaction completes as doing so can lead to race conditions where -the changes are not yet visible to the worker. - -Next we'll need a post-deployment migration that schedules the migration for -existing data. - -```ruby -class ScheduleExtractIntegrationsUrl < Gitlab::Database::Migration[1.0] - disable_ddl_transaction! - - MIGRATION = 'ExtractIntegrationsUrl' - DELAY_INTERVAL = 2.minutes - - def up - queue_background_migration_jobs_by_range_at_intervals( - define_batchable_model('integrations'), - MIGRATION, - DELAY_INTERVAL) - end - - def down - end -end -``` - -Once deployed our application will continue using the data as before but at the -same time will ensure that both existing and new data is migrated. - -In the next release we can remove the `after_commit` hooks and related code. We -will also need to add a post-deployment migration that consumes any remaining -jobs and manually run on any un-migrated rows. Such a migration would look like -this: - -```ruby -class ConsumeRemainingExtractIntegrationsUrlJobs < Gitlab::Database::Migration[1.0] - disable_ddl_transaction! - - def up - # This must be included - Gitlab::BackgroundMigration.steal('ExtractIntegrationsUrl') - - # This should be included, but can be skipped - see below - define_batchable_model('integrations').where(url: nil).each_batch(of: 50) do |batch| - range = batch.pluck('MIN(id)', 'MAX(id)').first - - Gitlab::BackgroundMigration::ExtractIntegrationsUrl.new.perform(*range) - end - end - - def down - end -end -``` - -The final step runs for any un-migrated rows after all of the jobs have been -processed. This is in case a Sidekiq process running the background migrations -received SIGKILL, leading to the jobs being lost. (See -[more reliable Sidekiq queue](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/36791) for more information.) - -If the application does not depend on the data being 100% migrated (for -instance, the data is advisory, and not mission-critical), then this final step -can be skipped. - -This migration will then process any jobs for the ExtractIntegrationsUrl migration -and continue once all jobs have been processed. Once done you can safely remove -the `integrations.properties` column. - -## Testing - -It is required to write tests for: - -- The background migrations' scheduling migration. -- The background migration itself. -- A cleanup migration. - -The `:migration` and `schema: :latest` RSpec tags are automatically set for -background migration specs. -See the -[Testing Rails migrations](testing_guide/testing_migrations_guide.md#testing-a-non-activerecordmigration-class) -style guide. - -Keep in mind that `before` and `after` RSpec hooks are going -to migrate you database down and up, which can result in other background -migrations being called. That means that using `spy` test doubles with -`have_received` is encouraged, instead of using regular test doubles, because -your expectations defined in a `it` block can conflict with what is being -called in RSpec hooks. See [issue #35351](https://gitlab.com/gitlab-org/gitlab/-/issues/18839) -for more details. - -## Best practices - -1. Make sure to know how much data you're dealing with. -1. Make sure that background migration jobs are idempotent. -1. Make sure that tests you write are not false positives. -1. Make sure that if the data being migrated is critical and cannot be lost, the - clean-up migration also checks the final state of the data before completing. -1. When migrating many columns, make sure it won't generate too many - dead tuples in the process (you may need to directly query the number of dead tuples - and adjust the scheduling according to this piece of data). -1. Make sure to discuss the numbers with a database specialist, the migration may add - more pressure on DB than you expect (measure on staging, - or ask someone to measure on production). -1. Make sure to know how much time it'll take to run all scheduled migrations. -1. Provide an estimation section in the description, estimating both the total migration - run time and the query times for each background migration job. Explain plans for each query - should also be provided. - - For example, assuming a migration that deletes data, include information similar to - the following section: - - ```plaintext - Background Migration Details: - - 47600 items to delete - batch size = 1000 - 47600 / 1000 = 48 batches - - Estimated times per batch: - - 820ms for select statement with 1000 items (see linked explain plan) - - 900ms for delete statement with 1000 items (see linked explain plan) - Total: ~2 sec per batch - - 2 mins delay per batch (safe for the given total time per batch) - - 48 batches * 2 min per batch = 96 mins to run all the scheduled jobs - ``` - - The execution time per batch (2 sec in this example) is not included in the calculation - for total migration time. The jobs are scheduled 2 minutes apart without knowledge of - the execution time. - -## Additional tips and strategies - -### Nested batching - -A strategy to make the migration run faster is to schedule larger batches, and then use `EachBatch` -within the background migration to perform multiple statements. - -The background migration helpers that queue multiple jobs such as -`queue_background_migration_jobs_by_range_at_intervals` use [`EachBatch`](iterating_tables_in_batches.md). -The example above has batches of 1000, where each queued job takes two seconds. If the query has been optimized -to make the time for the delete statement within the [query performance guidelines](query_performance.md), -1000 may be the largest number of records that can be deleted in a reasonable amount of time. - -The minimum and most common interval for delaying jobs is two minutes. This results in two seconds -of work for each two minute job. There's nothing that prevents you from executing multiple delete -statements in each background migration job. - -Looking at the example above, you could alternatively do: - -```plaintext -Background Migration Details: - -47600 items to delete -batch size = 10_000 -47600 / 10_000 = 5 batches - -Estimated times per batch: -- Records are updated in sub-batches of 1000 => 10_000 / 1000 = 10 total updates -- 820ms for select statement with 1000 items (see linked explain plan) -- 900ms for delete statement with 1000 items (see linked explain plan) -Sub-batch total: ~2 sec per sub-batch, -Total batch time: 2 * 10 = 20 sec per batch - -2 mins delay per batch - -5 batches * 2 min per batch = 10 mins to run all the scheduled jobs -``` - -The batch time of 20 seconds still fits comfortably within the two minute delay, yet the total run -time is cut by a tenth from around 100 minutes to 10 minutes! When dealing with large background -migrations, this can cut the total migration time by days. - -When batching in this way, it is important to look at query times on the higher end -of the table or relation being updated. `EachBatch` may generate some queries that become much -slower when dealing with higher ID ranges. - -### Delay time - -When looking at the batch execution time versus the delay time, the execution time -should fit comfortably within the delay time for a few reasons: - -- To allow for a variance in query times. -- To allow autovacuum to catch up after periods of high churn. - -Never try to optimize by fully filling the delay window even if you are confident -the queries themselves have no timing variance. - -### Background jobs tracking - -`queue_background_migration_jobs_by_range_at_intervals` can create records for each job that is scheduled to run. -You can enable this behavior by passing `track_jobs: true`. Each record starts with a `pending` status. Make sure that your worker updates the job status to `succeeded` by calling `Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded` in the `perform` method of your background migration. - -```ruby -# Background migration code - -def perform(start_id, end_id) - # do work here - - mark_job_as_succeeded(start_id, end_id) -end - -private - -# Make sure that the arguments passed here match those passed to the background -# migration -def mark_job_as_succeeded(*arguments) - Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded( - self.class.name.demodulize, - arguments - ) -end -``` - -```ruby -# Post deployment migration -MIGRATION = 'YourBackgroundMigrationName' -DELAY_INTERVAL = 2.minutes.to_i # can be different -BATCH_SIZE = 10_000 # can be different - -disable_ddl_transaction! - -def up - queue_background_migration_jobs_by_range_at_intervals( - define_batchable_model('name_of_the_table_backing_the_model'), - MIGRATION, - DELAY_INTERVAL, - batch_size: BATCH_SIZE, - track_jobs: true - ) -end - -def down - # no-op -end -``` - -See [`lib/gitlab/background_migration/drop_invalid_vulnerabilities.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/background_migration/drop_invalid_vulnerabilities.rb) for a full example. - -#### Rescheduling pending jobs - -You can reschedule pending migrations from the `background_migration_jobs` table by creating a post-deployment migration and calling `requeue_background_migration_jobs_by_range_at_intervals` with the migration name and delay interval. - -```ruby -# Post deployment migration -MIGRATION = 'YourBackgroundMigrationName' -DELAY_INTERVAL = 2.minutes - -disable_ddl_transaction! - -def up - requeue_background_migration_jobs_by_range_at_intervals(MIGRATION, DELAY_INTERVAL) -end - -def down - # no-op -end -``` - -See [`db/post_migrate/20210604070207_retry_backfill_traversal_ids.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/db/post_migrate/20210604070207_retry_backfill_traversal_ids.rb) for a full example. - -### Viewing failure error logs - -After running a background migration, if any jobs have failed, you can view the logs in [Kibana](https://log.gprd.gitlab.net/goto/5f06a57f768c6025e1c65aefb4075694). -View the production Sidekiq log and filter for: - -- `json.class: BackgroundMigrationWorker` -- `json.job_status: fail` -- `json.meta.caller_id: <MyBackgroundMigrationSchedulingMigrationClassName>` -- `json.args: <MyBackgroundMigrationClassName>` - -Looking at the `json.error_class`, `json.error_message` and `json.error_backtrace` values may be helpful in understanding why the jobs failed. - -Depending on when and how the failure occurred, you may find other helpful information by filtering with `json.class: <MyBackgroundMigrationClassName>`. +<!-- This redirect file can be deleted after <2022-07-08>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (for example, link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> |