diff options
Diffstat (limited to 'doc/development/database/adding_database_indexes.md')
-rw-r--r-- | doc/development/database/adding_database_indexes.md | 410 |
1 files changed, 410 insertions, 0 deletions
diff --git a/doc/development/database/adding_database_indexes.md b/doc/development/database/adding_database_indexes.md new file mode 100644 index 00000000000..8abd7c8298e --- /dev/null +++ b/doc/development/database/adding_database_indexes.md @@ -0,0 +1,410 @@ +--- +stage: Data Stores +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Adding Database Indexes + +Indexes can be used to speed up database queries, but when should you add a new +index? Traditionally the answer to this question has been to add an index for +every column used for filtering or joining data. For example, consider the +following query: + +```sql +SELECT * +FROM projects +WHERE user_id = 2; +``` + +Here we are filtering by the `user_id` column and as such a developer may decide +to index this column. + +While in certain cases indexing columns using the above approach may make sense, +it can actually have a negative impact. Whenever you write data to a table, any +existing indexes must also be updated. The more indexes there are, the slower this +can potentially become. Indexes can also take up significant disk space, depending +on the amount of data indexed and the index type. For example, PostgreSQL offers +`GIN` indexes which can be used to index certain data types that cannot be +indexed by regular B-tree indexes. These indexes, however, generally take up more +data and are slower to update compared to B-tree indexes. + +Because of all this, it's important make the following considerations +when adding a new index: + +1. Do the new queries re-use as many existing indexes as possible? +1. Is there enough data that using an index is faster than iterating over + rows in the table? +1. Is the overhead of maintaining the index worth the reduction in query + timings? + +## Re-using Queries + +The first step is to make sure your query re-uses as many existing indexes as +possible. For example, consider the following query: + +```sql +SELECT * +FROM todos +WHERE user_id = 123 +AND state = 'open'; +``` + +Now imagine we already have an index on the `user_id` column but not on the +`state` column. One may think this query performs badly due to `state` being +unindexed. In reality the query may perform just fine given the index on +`user_id` can filter out enough rows. + +The best way to determine if indexes are re-used is to run your query using +`EXPLAIN ANALYZE`. Depending on the joined tables and the columns being used for filtering, +you may find an extra index doesn't make much, if any, difference. + +In short: + +1. Try to write your query in such a way that it re-uses as many existing + indexes as possible. +1. Run the query using `EXPLAIN ANALYZE` and study the output to find the most + ideal query. + +## Data Size + +A database may not use an index even when a regular sequence scan +(iterating over all rows) is faster, especially for small tables. + +Consider adding an index if a table is expected to grow, and your query has to filter a lot of rows. +You may _not_ want to add an index if the table size is small (<`1,000` records), +or if existing indexes already filter out enough rows. + +## Maintenance Overhead + +Indexes have to be updated on every table write. In the case of PostgreSQL, _all_ +existing indexes are updated whenever data is written to a table. As a +result, having many indexes on the same table slows down writes. It's therefore important +to balance query performance with the overhead of maintaining an extra index. + +Let's say that adding an index reduces SELECT timings by 5 milliseconds but increases +INSERT/UPDATE/DELETE timings by 10 milliseconds. In this case, the new index may not be worth +it. A new index is more valuable when SELECT timings are reduced and INSERT/UPDATE/DELETE +timings are unaffected. + +## Finding Unused Indexes + +To see which indexes are unused you can run the following query: + +```sql +SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass)) +FROM pg_stat_all_indexes +WHERE schemaname = 'public' +AND idx_scan = 0 +AND idx_tup_read = 0 +AND idx_tup_fetch = 0 +ORDER BY pg_relation_size(indexrelname::regclass) desc; +``` + +This query outputs a list containing all indexes that are never used and sorts +them by indexes sizes in descending order. This query helps in +determining whether existing indexes are still required. More information on +the meaning of the various columns can be found at +<https://www.postgresql.org/docs/current/monitoring-stats.html>. + +To determine if an index is still being used on production, use the following +Thanos query with your index name: + +```sql +sum(rate(pg_stat_user_indexes_idx_tup_read{env="gprd", indexrelname="index_ci_name", type="patroni-ci"}[5m])) +``` + +Because the query output relies on the actual usage of your database, it +may be affected by factors such as: + +- Certain queries never being executed, thus not being able to use certain + indexes. +- Certain tables having little data, resulting in PostgreSQL using sequence + scans instead of index scans. + +This data is only reliable for a frequently used database with +plenty of data, and using as many GitLab features as possible. + +## Requirements for naming indexes + +Indexes with complex definitions must be explicitly named rather than +relying on the implicit naming behavior of migration methods. In short, +that means you **must** provide an explicit name argument for an index +created with one or more of the following options: + +- `where` +- `using` +- `order` +- `length` +- `type` +- `opclass` + +### Considerations for index names + +Check our [Constraints naming conventions](constraint_naming_convention.md) page. + +### Why explicit names are required + +As Rails is database agnostic, it generates an index name only +from the required options of all indexes: table name and column names. +For example, imagine the following two indexes are created in a migration: + +```ruby +def up + add_index :my_table, :my_column + + add_index :my_table, :my_column, where: 'my_column IS NOT NULL' +end +``` + +Creation of the second index would fail, because Rails would generate +the same name for both indexes. + +This naming issue is further complicated by the behavior of the `index_exists?` method. +It considers only the table name, column names, and uniqueness specification +of the index when making a comparison. Consider: + +```ruby +def up + unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL') + add_index :my_table, :my_column, where: 'my_column IS NOT NULL' + end +end +``` + +The call to `index_exists?` returns true if **any** index exists on +`:my_table` and `:my_column`, and index creation is bypassed. + +The `add_concurrent_index` helper is a requirement for creating indexes +on populated tables. Because it cannot be used inside a transactional +migration, it has a built-in check that detects if the index already +exists. In the event a match is found, index creation is skipped. +Without an explicit name argument, Rails can return a false positive +for `index_exists?`, causing a required index to not be created +properly. By always requiring a name for certain types of indexes, the +chance of error is greatly reduced. + +## Temporary indexes + +There may be times when an index is only needed temporarily. + +For example, in a migration, a column of a table might be conditionally +updated. To query which columns must be updated in the +[query performance guidelines](query_performance.md), an index is needed +that would otherwise not be used. + +In these cases, consider a temporary index. To specify a +temporary index: + +1. Prefix the index name with `tmp_` and follow the [naming conventions](constraint_naming_convention.md). +1. Create a follow-up issue to remove the index in the next (or future) milestone. +1. Add a comment in the migration mentioning the removal issue. + +A temporary migration would look like: + +```ruby +INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled' + +def up + # Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234 + add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME +end + +def down + remove_concurrent_index_by_name :projects, INDEX_NAME +end +``` + +## Create indexes asynchronously + +For very large tables, index creation can be a challenge to manage. +While `add_concurrent_index` creates indexes in a way that does not block +normal traffic, it can still be problematic when index creation runs for +many hours. Necessary database operations like `autovacuum` cannot run, and +on GitLab.com, the deployment process is blocked waiting for index +creation to finish. + +To limit impact on GitLab.com, a process exists to create indexes +asynchronously during weekend hours. Due to generally lower traffic and fewer deployments, +index creation can proceed at a lower level of risk. + +### Schedule index creation for a low-impact time + +1. [Schedule the index to be created](#schedule-the-index-to-be-created). +1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production). +1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously). + +### Schedule the index to be created + +Create an MR with a post-deployment migration which prepares the index +for asynchronous creation. An example of creating an index using +the asynchronous index helpers can be seen in the block below. This migration +enters the index name and definition into the `postgres_async_indexes` +table. The process that runs on weekends pulls indexes from this +table and attempt to create them. + +```ruby +# in db/post_migrate/ + +INDEX_NAME = 'index_ci_builds_on_some_column' + +def up + prepare_async_index :ci_builds, :some_column, name: INDEX_NAME +end + +def down + unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME +end +``` + +### Verify the MR was deployed and the index exists in production + +You can verify if the post-deploy migration was executed on GitLab.com by: + +- Executing `/chatops run auto_deploy status <merge_sha>`. If the output returns `db/gprd`, + the post-deploy migration has been executed in the production database. More details in this + [guide](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/post_deploy_migration/readme.md#how-to-determine-if-a-post-deploy-migration-has-been-executed-on-gitlabcom). +- Use a meta-command in #database-lab, such as: `\d <index_name>`. + - Ensure that the index is not [`invalid`](https://www.postgresql.org/docs/12/sql-createindex.html#:~:text=The%20psql%20%5Cd%20command%20will%20report%20such%20an%20index%20as%20INVALID). +- Ask someone in #database to check if the index exists. +- With proper access, you can also verify directly on production or in a + production clone. + +### Add a migration to create the index synchronously + +After the index is verified to exist on the production database, create a second +merge request that adds the index synchronously. The schema changes must be +updated and committed to `structure.sql` in this second merge request. +The synchronous migration results in a no-op on GitLab.com, but you should still add the +migration as expected for other installations. The below block +demonstrates how to create the second migration for the previous +asynchronous example. + +**WARNING:** +Verify that the index exists in production before merging a second migration with `add_concurrent_index`. +If the second migration is deployed before the index has been created, +the index is created synchronously when the second migration executes. + +```ruby +# in db/post_migrate/ + +INDEX_NAME = 'index_ci_builds_on_some_column' + +disable_ddl_transaction! + +def up + add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME +end + +def down + remove_concurrent_index_by_name :ci_builds, INDEX_NAME +end +``` + +## Test database index changes locally + +You must test the database index changes locally before creating a merge request. + +### Verify indexes created asynchronously + +Use the asynchronous index helpers on your local environment to test changes for creating an index: + +1. Enable the feature flags by running `Feature.enable(:database_async_index_creation)` and `Feature.enable(:database_reindexing)` in the Rails console. +1. Run `bundle exec rails db:migrate` so that it creates an entry in the `postgres_async_indexes` table. +1. Run `bundle exec rails gitlab:db:reindex` so that the index is created asynchronously. +1. To verify the index, open the PostgreSQL console using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md) command `gdk psql` and run the command `\d <index_name>` to check that your newly created index exists. + +## Drop indexes asynchronously + +For very large tables, index destruction can be a challenge to manage. +While `remove_concurrent_index` removes indexes in a way that does not block +normal traffic, it can still be problematic if index destruction runs for +during autovacuum. Necessary database operations like `autovacuum` cannot run, and +the deployment process on GitLab.com is blocked while waiting for index +destruction to finish. + +To limit the impact on GitLab.com, use the following process to remove indexes +asynchronously during weekend hours. Due to generally lower traffic and fewer deployments, +index destruction can proceed at a lower level of risk. + +1. [Schedule the index to be removed](#schedule-the-index-to-be-removed). +1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production). +1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously). + +### Schedule the index to be removed + +Create an MR with a post-deployment migration which prepares the index +for asynchronous destruction. For example. to destroy an index using +the asynchronous index helpers: + +```ruby +# in db/post_migrate/ + +INDEX_NAME = 'index_ci_builds_on_some_column' + +def up + prepare_async_index_removal :ci_builds, :some_column, name: INDEX_NAME +end + +def down + unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME +end +``` + +This migration enters the index name and definition into the `postgres_async_indexes` +table. The process that runs on weekends pulls indexes from this table and attempt +to remove them. + +You must test the database index changes locally before creating a merge request. + +### Verify the MR was deployed and the index exists in production + +You can verify if the MR was deployed to GitLab.com with +`/chatops run auto_deploy status <merge_sha>`. To verify the existence of +the index, you can: + +- Use a meta-command in `#database-lab`, for example: `\d <index_name>`. + - Make sure the index is not [`invalid`](https://www.postgresql.org/docs/12/sql-createindex.html#:~:text=The%20psql%20%5Cd%20command%20will%20report%20such%20an%20index%20as%20INVALID). +- Ask someone in `#database` to check if the index exists. +- If you have access, you can verify directly on production or in a + production clone. + +### Add a migration to destroy the index synchronously + +After you verify the index exists in the production database, create a second +merge request that removes the index synchronously. The schema changes must be +updated and committed to `structure.sql` in this second merge request. +The synchronous migration results in a no-op on GitLab.com, but you should still add the +migration as expected for other installations. For example, to +create the second migration for the previous asynchronous example: + +**WARNING:** +Verify that the index no longer exist in production before merging a second migration with `remove_concurrent_index_by_name`. +If the second migration is deployed before the index has been destroyed, +the index is destroyed synchronously when the second migration executes. + +```ruby +# in db/post_migrate/ + +INDEX_NAME = 'index_ci_builds_on_some_column' + +disable_ddl_transaction! + +def up + remove_concurrent_index_by_name :ci_builds, name: INDEX_NAME +end + +def down + add_concurrent_index :ci_builds, :some_column, INDEX_NAME +end +``` + +### Verify indexes removed asynchronously + +To test changes for removing an index, use the asynchronous index helpers on your local environment: + +1. Enable the feature flags by running `Feature.enable(:database_async_index_destruction)` and `Feature.enable(:database_reindexing)` in the Rails console. +1. Run `bundle exec rails db:migrate` which should create an entry in the `postgres_async_indexes` table. +1. Run `bundle exec rails gitlab:db:reindex` destroy the index asynchronously. +1. To verify the index, open the PostgreSQL console by using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md) + command `gdk psql` and run `\d <index_name>` to check that the destroyed index no longer exists. |