diff options
Diffstat (limited to 'doc/development/database/multiple_databases.md')
-rw-r--r-- | doc/development/database/multiple_databases.md | 198 |
1 files changed, 195 insertions, 3 deletions
diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md index 2895cef86fc..71dcc5bb866 100644 --- a/doc/development/database/multiple_databases.md +++ b/doc/development/database/multiple_databases.md @@ -61,7 +61,6 @@ development: adapter: postgresql encoding: unicode database: gitlabhq_development_ci - migrations_paths: db/ci_migrate host: /path/to/gdk/postgresql pool: 10 prepared_statements: false @@ -82,7 +81,6 @@ test: &test adapter: postgresql encoding: unicode database: gitlabhq_test_ci - migrations_paths: db/ci_migrate host: /path/to/gdk/postgresql pool: 10 prepared_statements: false @@ -92,10 +90,204 @@ test: &test ### Migrations -Any migrations that affect `Ci::BaseModel` models +Any migrations that affect `Ci::CiDatabaseRecord` models and their tables must be placed in two directories for now: - `db/migrate` - `db/ci_migrate` We aim to keep the schema for both tables the same across both databases. + +### Removing joins between `ci_*` and non `ci_*` tables + +We are planning on moving all the `ci_*` tables to a separate database so +referencing `ci_*` tables with other tables will not be possible. This means, +that using any kind of `JOIN` in SQL queries will not work. We have identified +already many such examples that need to be fixed in +<https://gitlab.com/groups/gitlab-org/-/epics/6289> . + +The following are some real examples that have resulted from this and these +patterns may apply to future cases. + +#### Remove the code + +The simplest solution we've seen several times now has been an existing scope +that is unused. This is the easiest example to fix. So the first step is to +investigate if the code is unused and then remove it. These are some +real examples: + +- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67162> +- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66714> +- <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66503> + +There may be more examples where the code is used, but we can evaluate +if we need it or if the feature should behave this way. +Before complicating things by adding new columns and tables, +consider if you can simplify the solution and still meet the requirements. +One case being evaluated involves changing how certain `UsageData` is +calculated to remove a join query in +<https://gitlab.com/gitlab-org/gitlab/-/issues/336170>. This is a good candidate +to evaluate, because `UsageData` is not critical to users and it may be possible +to get a similarly useful metric with a simpler approach. Alternatively we may +find that nobody is using these metrics, so we can remove them. + +#### Use `preload` instead of `includes` + +The `includes` and `preload` methods in Rails are both ways to avoid an N+1 +query. The `includes` method in Rails uses a heuristic approach to determine +if it needs to join to the table, or if it can load all of the +records in a separate query. This method assumes it needs to join if it thinks +you need to query the columns from the other table, but sometimes +this method gets it wrong and executes a join even when not needed. In +this case using `preload` to explicitly load the data in a separate query +allows you to avoid the join, while still avoiding the N+1 query. + +You can see a real example of this solution being used in +<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67655>. + +#### De-normalize some foreign key to the table + +De-normalization refers to adding redundant precomputed (duplicated) data to +a table to simplify certain queries or to improve performance. In this +case, it can be useful when you are doing a join that involves three tables, where +you are joining through some intermediate table. + +Generally when modeling a database schema, a "normalized" structure is +preferred because of the following reasons: + +- Duplicate data uses extra storage. +- Duplicate data needs to be kept in sync. + +Sometimes normalized data is less performant so de-normalization has been a +common technique GitLab has used to improve the performance of database queries +for a while. The above problems are mitigated when the following conditions are +met: + +1. There isn't much data (for example, it's just an integer column). +1. The data does not update often (for example, the `project_id` column is almost + never updated for most tables). + +One example we found was the `security_scans` table. This table has a foreign +key `security_scans.build_id` which allows you to join to the build. Therefore +you could join to the project like so: + +```sql +select projects.* from security_scans +inner join ci_builds on security_scans.build_id = ci_builds.id +inner join projects on ci_builds.project_id = projects.id +``` + +The problem with this query is that `ci_builds` is in a different database +from the other two tables. + +The solution in this case is to add the `project_id` column to +`security_scans`. This doesn't use much extra storage, and due to the way +these features work, it's never updated (a build never moves projects). + +This simplified the query to: + +```sql +select projects.* from security_scans +inner join projects on security_scans.project_id = projects.id +``` + +This also improves performance because you don't need to join through an extra +table. + +You can see this approach implemented in +<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66963> . This MR also +de-normalizes `pipeline_id` to fix a similar query. + +#### De-normalize into an extra table + +Sometimes the previous de-normalization (adding an extra column) doesn't work for +your specific case. This may be due to the fact that your data is not 1:1, or +because the table you're adding to is already too wide (for example, the `projects` +table shouldn't have more columns added). + +In this case you may decide to just store the extra data in a separate table. + +One example where this approach is being used was to implement the +`Project.with_code_coverage` scope. This scope was essentially used to narrow +down a list of projects to only those that have at one point in time used code +coverage features. This query (simplified) was: + +```sql +select projects.* from projects +inner join ci_daily_build_group_report_results on ci_daily_build_group_report_results.project_id = projects.id +where ((data->'coverage') is not null) +and ci_daily_build_group_report_results.default_branch = true +group by projects.id +``` + +This work is still in progress but the current plan is to introduce a new table +called `projects_with_ci_feature_usage` which has 2 columns `project_id` and +`ci_feature`. This table would be written to the first time a project creates a +`ci_daily_build_group_report_results` for code coverage. Therefore the new +query would be: + +```sql +select projects.* from projects +inner join projects_with_ci_feature_usage on projects_with_ci_feature_usage.project_id = projects.id +where projects_with_ci_feature_usage.ci_feature = 'code_coverage' +``` + +The above example uses as a text column for simplicity but we should probably +use an [enum](../creating_enums.md) to save space. + +The downside of this new design is that this may need to be +updated (removed if the `ci_daily_build_group_report_results` is deleted). +Depending on your domain, however, this may not be necessary because deletes are +edge cases or impossible, or because the user impact of seeing the project on the +list page may not be problematic. It's also possible to implement the +logic to delete these rows if or whenever necessary in your domain. + +Finally, this de-normalization and new query also improves performance because +it does less joins and needs less filtering. + +#### Summary of cross-join removal patterns + +A quick checklist for fixing a specific join query would be: + +1. Is the code even used? If not just remove it +1. If the code is used, then is this feature even used or can we implement the + feature in a simpler way and still meet the requirements. Always prefer the + simplest option. +1. Can we remove the join if we de-normalize the data you are joining to by + adding a new column +1. Can we remove the join by adding a new table in the correct database that + replicates the minimum data needed to do the join + +#### How to validate you have correctly removed a cross-join + +Using RSpec tests, you can validate all SQL queries within a code block to +ensure that none of them are joining across the two databases. This is a useful +tool to confirm you have correctly fixed an existing cross-join. + +At some point in the future we will have fixed all cross-joins and this tool +will run by default in all tests. For now, the tool needs to be explicitly enabled +for your test. + +You can use this method like so: + +```ruby +it 'does not join across databases' do + with_cross_joins_prevented do + ::Ci::Build.joins(:project).to_a + end +end +``` + +This will raise an exception if the query joins across the two databases. The +previous example is fixed by removing the join, like so: + +```ruby +it 'does not join across databases' do + with_cross_joins_prevented do + ::Ci::Build.preload(:project).to_a + end +end +``` + +You can see a real example of using this method for fixing a cross-join in +<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67655>. |