diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2023-06-07 00:07:14 +0300 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2023-06-07 00:07:14 +0300 |
commit | b04f912deb494b6dc0d0dba36776a5e53f622b43 (patch) | |
tree | b6af17cf69082cda35c42801a123c6e090b28c12 /doc | |
parent | 9e5484cee690f8bb2c1796013345d8cbc1872d77 (diff) |
Add latest changes from gitlab-org/gitlab@master
Diffstat (limited to 'doc')
-rw-r--r-- | doc/architecture/blueprints/database/automated_query_analysis/index.md | 226 | ||||
-rw-r--r-- | doc/development/contributing/index.md | 13 |
2 files changed, 230 insertions, 9 deletions
diff --git a/doc/architecture/blueprints/database/automated_query_analysis/index.md b/doc/architecture/blueprints/database/automated_query_analysis/index.md new file mode 100644 index 00000000000..c08784dab48 --- /dev/null +++ b/doc/architecture/blueprints/database/automated_query_analysis/index.md @@ -0,0 +1,226 @@ +--- +status: proposed +creation-date: "2023-02-08" +authors: [ "@mattkasa", "@jon_jenkins" ] +coach: "@DylanGriffith" +approvers: [ "@rogerwoo", "@alexives" ] +owning-stage: "~devops::data_stores" +participating-stages: [] +--- + +# Automated Query Analysis + +## Problem Summary + +Our overarching goal is to improve the reliability and throughput of GitLab’s +database review process. The current process requires merge request authors to +manually provide query plans and raw SQL when introducing new queries or +updating existing queries. This is both time consuming and error prone. + +We believe we can improve operational efficiency by automatically identifying +and analyzing newly introduced SQL queries. This will reduce the risk of human +error, leading to improved system stability and an overall reduction in +performance regressions. + +Our key success metric is a reduction in the number of manual actions required +by both code contributors and database reviewers, while maintaining a consistent +standard for database related code contributions. + +## Goals + +1. Replace the current process of the author manually obtaining SQL and query + plans with an automated process. +1. Decrease the incidence of performance regressions when poorly performing + queries are missed by a manual process. +1. Increase contributor and reviewer efficiency by automating the query testing + portion of database review. + +## Challenges + +- Capturing the number of SQL queries generated by an application the size of + `gitlab-org/gitlab` without causing an increase in CI time and/or resources + may present a challenge. +- Storing the number of SQL queries generated by an application the size of + `gitlab-org/gitlab` may consume large amounts of database storage. + +## Opportunity + +- Automated test suites already generate a large number of SQL queries, for + instance `rspec` test suites, that can be captured and used to perform + automated analysis. +- We already utilize `postgres.ai` to analyze query performance, and it has an + API that will allow us to automate the creation of database clones with + realistic production-like data in order to perform automated analysis. +- For customers who do not use something like `postgres.ai`, but who are + connecting to a test database in CI, we would use this connection to generate + query plans. The accuracy of these query plans will be affected by how + realistic the test data is, and can be improved by seeding the test database + with production-like data. +- By storing queries and their query plans, we can tokenize the query plan into + plan components, assign a cost and weight, then match those against a machine + learning model. We can build this model by generating query plans for queries + in our slow query logs, and assign actual cost and weight to their plan + components. This will allow us to leverage our corpus of queries and slow + query logs to predict the performance of arbitrary query text for other + applications and our customers. + +## Proposal + +We plan to automate the process of identifying new and changed database queries, +so that contributors and reviewers can more accurately and efficiently assess +the database performance impact of a code change. + +We will capture queries generated as a side effect of running tests in CI, +normalize them, deduplicate them, analyze them using one or more analyzers, and +store them with their analyses and other metadata for future retrieval and +comparison. + +We will post a comment to the originating merge request, containing a summary of +the new and changed queries, with links to their analyses, and highlighting any +queries that exceed established timing or other performance guidelines. + +## Design and implementation details + +### Iteration 1 + +In the first iteration we will focus on how we capture queries, including +normalization, deduplication, and storage. We must consider the performance and +resource impacts on CI pipelines during capture, and include things like +partitioning and time decay for the information we are storing. + +#### Capturing queries + +We will strive to limit the time and resource impacts on our CI pipelines as +much as possible. These are some of the options we will consider for capturing +queries: + +- **Instrumenting `ActiveRecord` in `ruby`** + - Challenges: + - Only applies to `ruby` projects so it would not be applicable to projects + like `container-registry`. + - Has a non-zero impact on time and resources in CI pipelines (these impacts + can be observed in + [!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638)) + - Opportunities: + - Simple and straightforward to implement. + - Allows access to more information (eg. stacktrace and calling locations). +- **Connection proxy with logging** + - Challenges: + - Adds complexity and possible performance overhead. + - Requires maintaining the code for the proxy. + - Opportunities: + - Allows us to customize the capture. + - Allows us to perform normalization/deduplication at capture time. +- **Built-in logging in `postgresql`** + - Challenges: + - Require adding a configuration to enable logging. + - May be difficult to obtain the resulting logs. + - Opportunities: + - Doesn't require maintaining any code. + - Light weight in terms of performance impact. +- **Capture from `pg_stat_statements`** + - Challenges: + - Requires creating the extension in the test database. + - Requires adding a configuration to set `pg_stat_statements.max` to a value + high enough to capture all queries. + - Consumes shared memory proportional to `pg_stat_statements.max`. + - Opportunities: + - Requires minimal code. + - Simple to obtain the data. + - Data is already normalized. + +We have already built a proof of concept for instrumenting `ActiveRecord` in +`ruby` in +[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a +first step we will benchmark the other capture methods against it and select the +best option. + +#### Storing queries + +For the next step of the first iteration we will use the proof of concept in +[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well +as any data gathered from testing other capture methods to estimate the number +of rows per project, and use the pipeline execution statistics for +`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to +evaluate storage mechanisms that are suitable for our purpose. + +Some of the storage mechanisms we plan to evaluate are: + +- **In the `ci` database in the GitLab database instance** + - Challenges: + - Places additional strain on this resource for `GitLab.com`. + - Opportunities: + - Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`. + - Allows us to leverage associations with `ci_builds` and `ci_pipelines`. + - Simplifies deployment for self-managed. +- **In a new decomposed database in the GitLab database instance** + - Challenges: + - Adds to required development and testing effort. + - Adds to deployment effort for `GitLab.com`. + - Opportunities: + - Isolates database performance impacts from the existing `main` and `ci` database instances. +- **In a new external service** + - Challenges: + - Adds to required development and testing effort. + - Adds to deployment effort for `GitLab.com` and for self-managed. + - Opportunities: + - Isolates performance impacts from `gitlab-org/gitlab`. + - Allows us to iterate faster without impacting the main application. +- **In ClickHouse** + - Challenges: + - Not yet available for self-managed. + - Opportunities: + - Isolates database performance impacts from the existing `main` and `ci` database instances. + +An example database schema for storing queries: + +```sql +CREATE TABLE queries ( + created_at timestamp with time zone NOT NULL, + updated_at timestamp with time zone NOT NULL, + id bigint NOT NULL, + project_id bigint NOT NULL, + analysis_id bigint, + hash text, + sql text +); +CREATE TABLE pipeline_queries ( + id bigint NOT NULL, + project_id bigint NOT NULL, + pipeline_id bigint NOT NULL, + query_id bigint NOT NULL +); +CREATE TABLE analyses ( + created_at timestamp with time zone NOT NULL, + updated_at timestamp with time zone NOT NULL, + id bigint NOT NULL, + project_id bigint NOT NULL, + query_id bigint NOT NULL, + buffers int, + walltime int, + explain text, + analysis_url text +); +``` + +One possible method of partitioning a schema like the above example would be to +utilize +[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning). +If we partition by `project_id` then by some interval of `updated_at`, and touch +the row when we see a query, we can store only queries that the codebase is +still executing, and prune partitions that only contain queries the code is no +longer generating. + +### Iteration 2 + +In the second iteration we plan to identify new and changed queries, and post MR +comments containing a summary. We will begin soliciting feedback on the accuracy +and utility of the information, and improve or filter it to maximize it's +usefulness. + +### Iteration 3+ + +In the third and following iterations we plan to automate query analysis using +one or more analyzers, store these analyses, and add them to the MR comments. We +also intend to re-evaluate our use of the database to store query information, +and the API to retrieve it, and potentially move this to an external service. diff --git a/doc/development/contributing/index.md b/doc/development/contributing/index.md index 98139af23d6..5e255921a29 100644 --- a/doc/development/contributing/index.md +++ b/doc/development/contributing/index.md @@ -84,7 +84,7 @@ For details, see the [merge request workflow](merge_request_workflow.md). 1. When you create a merge request, the [`@gitlab-bot`](https://gitlab.com/gitlab-bot) automatically applies the ["~Community contribution"](https://about.gitlab.com/handbook/engineering/quality/triage-operations/#ensure-quick-feedback-for-community-contributions) label. 1. In the 24-48 hours after you create the merge request, a - [Merge Request Coach](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/merge-request-coach-lifecycle.html) + [Merge Request Coach](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/merge-request-coach-lifecycle.html) will review your merge request and apply stage, group, and type labels. 1. If a merge request was not automatically assigned, ask for a review by typing `@gitlab-bot ready` in a comment. If your code has not been assigned a reviewer within two working days of its initial submission, you can ask @@ -139,15 +139,10 @@ Lastly, keep the following in mind when submitting merge requests: - For the criteria for closing issues, see [the Issue Triage handbook page](https://about.gitlab.com/handbook/engineering/quality/issue-triage/#outdated-issues). - For the criteria for closing merge requests, see [the Merge Request Workflow](merge_request_workflow.md). -## Getting an Enterprise Edition license +## Contributing to Premium/Ultimate features with an Enterprise Edition license -GitLab has two development platforms: - -- GitLab Community Edition (CE), our free and open source edition. -- GitLab Enterprise Edition (EE), which is our commercial edition. - -If you need a license for contributing to an EE-feature, see -[relevant information](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee). +If you would like to work on GitLab features that are within a paid tier, also known as the code that lives in the [EE folder](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee), it requires a GitLab Enterprise Edition license. +Please request an Enterprise Edition Developers License according to the [documented process](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee). ## Get help |