Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2023-06-07 00:07:14 +0300
committerGitLab Bot <gitlab-bot@gitlab.com>2023-06-07 00:07:14 +0300
commitb04f912deb494b6dc0d0dba36776a5e53f622b43 (patch)
treeb6af17cf69082cda35c42801a123c6e090b28c12 /doc
parent9e5484cee690f8bb2c1796013345d8cbc1872d77 (diff)
Add latest changes from gitlab-org/gitlab@master
Diffstat (limited to 'doc')
-rw-r--r--doc/architecture/blueprints/database/automated_query_analysis/index.md226
-rw-r--r--doc/development/contributing/index.md13
2 files changed, 230 insertions, 9 deletions
diff --git a/doc/architecture/blueprints/database/automated_query_analysis/index.md b/doc/architecture/blueprints/database/automated_query_analysis/index.md
new file mode 100644
index 00000000000..c08784dab48
--- /dev/null
+++ b/doc/architecture/blueprints/database/automated_query_analysis/index.md
@@ -0,0 +1,226 @@
+---
+status: proposed
+creation-date: "2023-02-08"
+authors: [ "@mattkasa", "@jon_jenkins" ]
+coach: "@DylanGriffith"
+approvers: [ "@rogerwoo", "@alexives" ]
+owning-stage: "~devops::data_stores"
+participating-stages: []
+---
+
+# Automated Query Analysis
+
+## Problem Summary
+
+Our overarching goal is to improve the reliability and throughput of GitLab’s
+database review process. The current process requires merge request authors to
+manually provide query plans and raw SQL when introducing new queries or
+updating existing queries. This is both time consuming and error prone.
+
+We believe we can improve operational efficiency by automatically identifying
+and analyzing newly introduced SQL queries. This will reduce the risk of human
+error, leading to improved system stability and an overall reduction in
+performance regressions.
+
+Our key success metric is a reduction in the number of manual actions required
+by both code contributors and database reviewers, while maintaining a consistent
+standard for database related code contributions.
+
+## Goals
+
+1. Replace the current process of the author manually obtaining SQL and query
+ plans with an automated process.
+1. Decrease the incidence of performance regressions when poorly performing
+ queries are missed by a manual process.
+1. Increase contributor and reviewer efficiency by automating the query testing
+ portion of database review.
+
+## Challenges
+
+- Capturing the number of SQL queries generated by an application the size of
+ `gitlab-org/gitlab` without causing an increase in CI time and/or resources
+ may present a challenge.
+- Storing the number of SQL queries generated by an application the size of
+ `gitlab-org/gitlab` may consume large amounts of database storage.
+
+## Opportunity
+
+- Automated test suites already generate a large number of SQL queries, for
+ instance `rspec` test suites, that can be captured and used to perform
+ automated analysis.
+- We already utilize `postgres.ai` to analyze query performance, and it has an
+ API that will allow us to automate the creation of database clones with
+ realistic production-like data in order to perform automated analysis.
+- For customers who do not use something like `postgres.ai`, but who are
+ connecting to a test database in CI, we would use this connection to generate
+ query plans. The accuracy of these query plans will be affected by how
+ realistic the test data is, and can be improved by seeding the test database
+ with production-like data.
+- By storing queries and their query plans, we can tokenize the query plan into
+ plan components, assign a cost and weight, then match those against a machine
+ learning model. We can build this model by generating query plans for queries
+ in our slow query logs, and assign actual cost and weight to their plan
+ components. This will allow us to leverage our corpus of queries and slow
+ query logs to predict the performance of arbitrary query text for other
+ applications and our customers.
+
+## Proposal
+
+We plan to automate the process of identifying new and changed database queries,
+so that contributors and reviewers can more accurately and efficiently assess
+the database performance impact of a code change.
+
+We will capture queries generated as a side effect of running tests in CI,
+normalize them, deduplicate them, analyze them using one or more analyzers, and
+store them with their analyses and other metadata for future retrieval and
+comparison.
+
+We will post a comment to the originating merge request, containing a summary of
+the new and changed queries, with links to their analyses, and highlighting any
+queries that exceed established timing or other performance guidelines.
+
+## Design and implementation details
+
+### Iteration 1
+
+In the first iteration we will focus on how we capture queries, including
+normalization, deduplication, and storage. We must consider the performance and
+resource impacts on CI pipelines during capture, and include things like
+partitioning and time decay for the information we are storing.
+
+#### Capturing queries
+
+We will strive to limit the time and resource impacts on our CI pipelines as
+much as possible. These are some of the options we will consider for capturing
+queries:
+
+- **Instrumenting `ActiveRecord` in `ruby`**
+ - Challenges:
+ - Only applies to `ruby` projects so it would not be applicable to projects
+ like `container-registry`.
+ - Has a non-zero impact on time and resources in CI pipelines (these impacts
+ can be observed in
+ [!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638))
+ - Opportunities:
+ - Simple and straightforward to implement.
+ - Allows access to more information (eg. stacktrace and calling locations).
+- **Connection proxy with logging**
+ - Challenges:
+ - Adds complexity and possible performance overhead.
+ - Requires maintaining the code for the proxy.
+ - Opportunities:
+ - Allows us to customize the capture.
+ - Allows us to perform normalization/deduplication at capture time.
+- **Built-in logging in `postgresql`**
+ - Challenges:
+ - Require adding a configuration to enable logging.
+ - May be difficult to obtain the resulting logs.
+ - Opportunities:
+ - Doesn't require maintaining any code.
+ - Light weight in terms of performance impact.
+- **Capture from `pg_stat_statements`**
+ - Challenges:
+ - Requires creating the extension in the test database.
+ - Requires adding a configuration to set `pg_stat_statements.max` to a value
+ high enough to capture all queries.
+ - Consumes shared memory proportional to `pg_stat_statements.max`.
+ - Opportunities:
+ - Requires minimal code.
+ - Simple to obtain the data.
+ - Data is already normalized.
+
+We have already built a proof of concept for instrumenting `ActiveRecord` in
+`ruby` in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a
+first step we will benchmark the other capture methods against it and select the
+best option.
+
+#### Storing queries
+
+For the next step of the first iteration we will use the proof of concept in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well
+as any data gathered from testing other capture methods to estimate the number
+of rows per project, and use the pipeline execution statistics for
+`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to
+evaluate storage mechanisms that are suitable for our purpose.
+
+Some of the storage mechanisms we plan to evaluate are:
+
+- **In the `ci` database in the GitLab database instance**
+ - Challenges:
+ - Places additional strain on this resource for `GitLab.com`.
+ - Opportunities:
+ - Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`.
+ - Allows us to leverage associations with `ci_builds` and `ci_pipelines`.
+ - Simplifies deployment for self-managed.
+- **In a new decomposed database in the GitLab database instance**
+ - Challenges:
+ - Adds to required development and testing effort.
+ - Adds to deployment effort for `GitLab.com`.
+ - Opportunities:
+ - Isolates database performance impacts from the existing `main` and `ci` database instances.
+- **In a new external service**
+ - Challenges:
+ - Adds to required development and testing effort.
+ - Adds to deployment effort for `GitLab.com` and for self-managed.
+ - Opportunities:
+ - Isolates performance impacts from `gitlab-org/gitlab`.
+ - Allows us to iterate faster without impacting the main application.
+- **In ClickHouse**
+ - Challenges:
+ - Not yet available for self-managed.
+ - Opportunities:
+ - Isolates database performance impacts from the existing `main` and `ci` database instances.
+
+An example database schema for storing queries:
+
+```sql
+CREATE TABLE queries (
+ created_at timestamp with time zone NOT NULL,
+ updated_at timestamp with time zone NOT NULL,
+ id bigint NOT NULL,
+ project_id bigint NOT NULL,
+ analysis_id bigint,
+ hash text,
+ sql text
+);
+CREATE TABLE pipeline_queries (
+ id bigint NOT NULL,
+ project_id bigint NOT NULL,
+ pipeline_id bigint NOT NULL,
+ query_id bigint NOT NULL
+);
+CREATE TABLE analyses (
+ created_at timestamp with time zone NOT NULL,
+ updated_at timestamp with time zone NOT NULL,
+ id bigint NOT NULL,
+ project_id bigint NOT NULL,
+ query_id bigint NOT NULL,
+ buffers int,
+ walltime int,
+ explain text,
+ analysis_url text
+);
+```
+
+One possible method of partitioning a schema like the above example would be to
+utilize
+[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning).
+If we partition by `project_id` then by some interval of `updated_at`, and touch
+the row when we see a query, we can store only queries that the codebase is
+still executing, and prune partitions that only contain queries the code is no
+longer generating.
+
+### Iteration 2
+
+In the second iteration we plan to identify new and changed queries, and post MR
+comments containing a summary. We will begin soliciting feedback on the accuracy
+and utility of the information, and improve or filter it to maximize it's
+usefulness.
+
+### Iteration 3+
+
+In the third and following iterations we plan to automate query analysis using
+one or more analyzers, store these analyses, and add them to the MR comments. We
+also intend to re-evaluate our use of the database to store query information,
+and the API to retrieve it, and potentially move this to an external service.
diff --git a/doc/development/contributing/index.md b/doc/development/contributing/index.md
index 98139af23d6..5e255921a29 100644
--- a/doc/development/contributing/index.md
+++ b/doc/development/contributing/index.md
@@ -84,7 +84,7 @@ For details, see the [merge request workflow](merge_request_workflow.md).
1. When you create a merge request, the [`@gitlab-bot`](https://gitlab.com/gitlab-bot) automatically applies
the ["~Community contribution"](https://about.gitlab.com/handbook/engineering/quality/triage-operations/#ensure-quick-feedback-for-community-contributions) label.
1. In the 24-48 hours after you create the merge request, a
- [Merge Request Coach](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/merge-request-coach-lifecycle.html)
+ [Merge Request Coach](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/merge-request-coach-lifecycle.html)
will review your merge request and apply stage, group, and type labels.
1. If a merge request was not automatically assigned, ask for a review by typing `@gitlab-bot ready` in a comment.
If your code has not been assigned a reviewer within two working days of its initial submission, you can ask
@@ -139,15 +139,10 @@ Lastly, keep the following in mind when submitting merge requests:
- For the criteria for closing issues, see [the Issue Triage handbook page](https://about.gitlab.com/handbook/engineering/quality/issue-triage/#outdated-issues).
- For the criteria for closing merge requests, see [the Merge Request Workflow](merge_request_workflow.md).
-## Getting an Enterprise Edition license
+## Contributing to Premium/Ultimate features with an Enterprise Edition license
-GitLab has two development platforms:
-
-- GitLab Community Edition (CE), our free and open source edition.
-- GitLab Enterprise Edition (EE), which is our commercial edition.
-
-If you need a license for contributing to an EE-feature, see
-[relevant information](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee).
+If you would like to work on GitLab features that are within a paid tier, also known as the code that lives in the [EE folder](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee), it requires a GitLab Enterprise Edition license.
+Please request an Enterprise Edition Developers License according to the [documented process](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee).
## Get help