diff options
Diffstat (limited to 'doc/development/insert_into_tables_in_batches.md')
-rw-r--r-- | doc/development/insert_into_tables_in_batches.md | 199 |
1 files changed, 7 insertions, 192 deletions
diff --git a/doc/development/insert_into_tables_in_batches.md b/doc/development/insert_into_tables_in_batches.md index ebed3d16319..ced5332e880 100644 --- a/doc/development/insert_into_tables_in_batches.md +++ b/doc/development/insert_into_tables_in_batches.md @@ -1,196 +1,11 @@ --- -stage: Data Stores -group: Database -info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments -description: "Sometimes it is necessary to store large amounts of records at once, which can be inefficient -when iterating collections and performing individual `save`s. With the arrival of `insert_all` -in Rails 6, which operates at the row level (that is, using `Hash`es), GitLab has added a set -of APIs that make it safe and simple to insert ActiveRecord objects in bulk." +redirect_to: 'database/insert_into_tables_in_batches.md' +remove_date: '2022-11-05' --- -# Insert into tables in batches +This document was moved to [another location](database/insert_into_tables_in_batches.md). -Sometimes it is necessary to store large amounts of records at once, which can be inefficient -when iterating collections and saving each record individually. With the arrival of -[`insert_all`](https://apidock.com/rails/ActiveRecord/Persistence/ClassMethods/insert_all) -in Rails 6, which operates at the row level (that is, using `Hash` objects), GitLab has added a set -of APIs that make it safe and simple to insert `ActiveRecord` objects in bulk. - -## Prepare `ApplicationRecord`s for bulk insertion - -In order for a model class to take advantage of the bulk insertion API, it has to include the -`BulkInsertSafe` concern first: - -```ruby -class MyModel < ApplicationRecord - # other includes here - # ... - include BulkInsertSafe # include this last - - # ... -end -``` - -The `BulkInsertSafe` concern has two functions: - -- It performs checks against your model class to ensure that it does not use ActiveRecord - APIs that are not safe to use with respect to bulk insertions (more on that below). -- It adds new class methods `bulk_insert!` and `bulk_upsert!`, which you can use to insert many records at once. - -## Insert records with `bulk_insert!` and `bulk_upsert!` - -If the target class passes the checks performed by `BulkInsertSafe`, you can insert an array of -ActiveRecord model objects as follows: - -```ruby -records = [MyModel.new, ...] - -MyModel.bulk_insert!(records) -``` - -Calls to `bulk_insert!` always attempt to insert _new records_. If instead -you would like to replace existing records with new values, while still inserting those -that do not already exist, then you can use `bulk_upsert!`: - -```ruby -records = [MyModel.new, existing_model, ...] - -MyModel.bulk_upsert!(records, unique_by: [:name]) -``` - -In this example, `unique_by` specifies the columns by which records are considered to be -unique and as such are updated if they existed prior to insertion. For example, if -`existing_model` has a `name` attribute, and if a record with the same `name` value already -exists, its fields are updated with those of `existing_model`. - -The `unique_by` parameter can also be passed as a `Symbol`, in which case it specifies -a database index by which a column is considered unique: - -```ruby -MyModel.bulk_insert!(records, unique_by: :index_on_name) -``` - -### Record validation - -The `bulk_insert!` method guarantees that `records` are inserted transactionally, and -runs validations on each record prior to insertion. If any record fails to validate, -an error is raised and the transaction is rolled back. You can turn off validations via -the `:validate` option: - -```ruby -MyModel.bulk_insert!(records, validate: false) -``` - -### Batch size configuration - -In those cases where the number of `records` is above a given threshold, insertions -occur in multiple batches. The default batch size is defined in -[`BulkInsertSafe::DEFAULT_BATCH_SIZE`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb). -Assuming a default threshold of 500, inserting 950 records -would result in two batches being written sequentially (of size 500 and 450 respectively.) -You can override the default batch size via the `:batch_size` option: - -```ruby -MyModel.bulk_insert!(records, batch_size: 100) -``` - -Assuming the same number of 950 records, this would result in 10 batches being written instead. -Since this also affects the number of `INSERT` statements that occur, make sure you measure the -performance impact this might have on your code. There is a trade-off between the number of -`INSERT` statements the database has to process and the size and cost of each `INSERT`. - -### Handling duplicate records - -NOTE: -This parameter applies only to `bulk_insert!`. If you intend to update existing -records, use `bulk_upsert!` instead. - -It may happen that some records you are trying to insert already exist, which would result in -primary key conflicts. There are two ways to address this problem: failing fast by raising an -error or skipping duplicate records. The default behavior of `bulk_insert!` is to fail fast -and raise an `ActiveRecord::RecordNotUnique` error. - -If this is undesirable, you can instead skip duplicate records with the `skip_duplicates` flag: - -```ruby -MyModel.bulk_insert!(records, skip_duplicates: true) -``` - -### Requirements for safe bulk insertions - -Large parts of ActiveRecord's persistence API are built around the notion of callbacks. Many -of these callbacks fire in response to model life cycle events such as `save` or `create`. -These callbacks cannot be used with bulk insertions, since they are meant to be called for -every instance that is saved or created. Since these events do not fire when -records are inserted in bulk, we currently prevent their use. - -The specifics around which callbacks are explicitly allowed are defined in -[`BulkInsertSafe`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb). -Consult the module source code for details. If your class uses callbacks that are not explicitly designated -safe and you `include BulkInsertSafe` the application fails with an error. - -### `BulkInsertSafe` versus `InsertAll` - -Internally, `BulkInsertSafe` is based on `InsertAll`, and you may wonder when to choose -the former over the latter. To help you make the decision, -the key differences between these classes are listed in the table below. - -| | Input type | Validates input | Specify batch size | Can bypass callbacks | Transactional | -|--------------- | -------------------- | --------------- | ------------------ | --------------------------------- | ------------- | -| `bulk_insert!` | ActiveRecord objects | Yes (optional) | Yes (optional) | No (prevents unsafe callback use) | Yes | -| `insert_all!` | Attribute hashes | No | No | Yes | Yes | - -To summarize, `BulkInsertSafe` moves bulk inserts closer to how ActiveRecord objects -and inserts would normally behave. However, if all you need is to insert raw data in bulk, then -`insert_all` is more efficient. - -## Insert `has_many` associations in bulk - -A common use case is to save collections of associated relations through the owner side of the relation, -where the owned relation is associated to the owner through the `has_many` class method: - -```ruby -owner = OwnerModel.new(owned_relations: array_of_owned_relations) -# saves all `owned_relations` one-by-one -owner.save! -``` - -This issues a single `INSERT`, and transaction, for every record in `owned_relations`, which is inefficient if -`array_of_owned_relations` is large. To remedy this, the `BulkInsertableAssociations` concern can be -used to declare that the owner defines associations that are safe for bulk insertion: - -```ruby -class OwnerModel < ApplicationRecord - # other includes here - # ... - include BulkInsertableAssociations # include this last - - has_many :my_models -end -``` - -Here `my_models` must be declared `BulkInsertSafe` (as described previously) for bulk insertions -to happen. You can now insert any yet unsaved records as follows: - -```ruby -BulkInsertableAssociations.with_bulk_insert do - owner = OwnerModel.new(my_models: array_of_my_model_instances) - # saves `my_models` using a single bulk insert (possibly via multiple batches) - owner.save! -end -``` - -You can still save relations that are not `BulkInsertSafe` in this block; they -simply are treated as if you had invoked `save` from outside the block. - -## Known limitations - -There are a few restrictions to how these APIs can be used: - -- `BulkInsertableAssociations`: - - It is currently only compatible with `has_many` relations. - - It does not yet support `has_many through: ...` relations. - -Moreover, input data should either be limited to around 1000 records at most, -or already batched prior to calling bulk insert. The `INSERT` statement runs in a single -transaction, so for large amounts of records it may negatively affect database stability. +<!-- This redirect file can be deleted after <2022-11-05>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (for example, link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> |