Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2023-06-07 00:07:14 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2023-06-07 00:07:14 +0300
commit: b04f912deb494b6dc0d0dba36776a5e53f622b43 (patch)
tree: b6af17cf69082cda35c42801a123c6e090b28c12 /doc
parent: 9e5484cee690f8bb2c1796013345d8cbc1872d77 (diff)
2 files changed, 230 insertions, 9 deletions
diff --git a/doc/architecture/blueprints/database/automated_query_analysis/index.md b/doc/architecture/blueprints/database/automated_query_analysis/index.md
new file mode 100644
index 00000000000..c08784dab48
--- /dev/null
+++ b/doc/architecture/blueprints/database/automated_query_analysis/index.md
@@ -0,0 +1,226 @@
+---
+status: proposed
+creation-date: "2023-02-08"
+authors: [ "@mattkasa", "@jon_jenkins" ]
+coach: "@DylanGriffith"
+approvers: [ "@rogerwoo", "@alexives" ]
+owning-stage: "~devops::data_stores"
+participating-stages: []
+---
+
+# Automated Query Analysis
+
+## Problem Summary
+
+Our overarching goal is to improve the reliability and throughput of GitLab’s
+database review process. The current process requires merge request authors to
+manually provide query plans and raw SQL when introducing new queries or
+updating existing queries. This is both time consuming and error prone.
+
+We believe we can improve operational efficiency by automatically identifying
+and analyzing newly introduced SQL queries. This will reduce the risk of human
+error, leading to improved system stability and an overall reduction in
+performance regressions.
+
+Our key success metric is a reduction in the number of manual actions required
+by both code contributors and database reviewers, while maintaining a consistent
+standard for database related code contributions.
+
+## Goals
+
+1. Replace the current process of the author manually obtaining SQL and query
+   plans with an automated process.
+1. Decrease the incidence of performance regressions when poorly performing
+   queries are missed by a manual process.
+1. Increase contributor and reviewer efficiency by automating the query testing
+   portion of database review.
+
+## Challenges
+
+- Capturing the number of SQL queries generated by an application the size of
+  `gitlab-org/gitlab` without causing an increase in CI time and/or resources
+  may present a challenge.
+- Storing the number of SQL queries generated by an application the size of
+  `gitlab-org/gitlab` may consume large amounts of database storage.
+
+## Opportunity
+
+- Automated test suites already generate a large number of SQL queries, for
+  instance  `rspec` test suites, that can be captured and used to perform
+  automated analysis.
+- We already utilize `postgres.ai` to analyze query performance, and it has an
+  API that will allow us to automate the creation of database clones with
+  realistic production-like data in order to perform automated analysis.
+- For customers who do not use something like `postgres.ai`, but who are
+  connecting to a test database in CI, we would use this connection to generate
+  query plans. The accuracy of these query plans will be affected by how
+  realistic the test data is, and can be improved by seeding the test database
+  with production-like data.
+- By storing queries and their query plans, we can tokenize the query plan into
+  plan components, assign a cost and weight, then match those against a machine
+  learning model. We can build this model by generating query plans for queries
+  in our slow query logs, and assign actual cost and weight to their plan
+  components. This will allow us to leverage our corpus of queries and slow
+  query logs to predict the performance of arbitrary query text for other
+  applications and our customers.
+
+## Proposal
+
+We plan to automate the process of identifying new and changed database queries,
+so that contributors and reviewers can more accurately and efficiently assess
+the database performance impact of a code change.
+
+We will capture queries generated as a side effect of running tests in CI,
+normalize them, deduplicate them, analyze them using one or more analyzers, and
+store them with their analyses and other metadata for future retrieval and
+comparison.
+
+We will post a comment to the originating merge request, containing a summary of
+the new and changed queries, with links to their analyses, and highlighting any
+queries that exceed established timing or other performance guidelines.
+
+## Design and implementation details
+
+### Iteration 1
+
+In the first iteration we will focus on how we capture queries, including
+normalization, deduplication, and storage. We must consider the performance and
+resource impacts on CI pipelines during capture, and include things like
+partitioning and time decay for the information we are storing.
+
+#### Capturing queries
+
+We will strive to limit the time and resource impacts on our CI pipelines as
+much as possible. These are some of the options we will consider for capturing
+queries:
+
+- **Instrumenting `ActiveRecord` in `ruby`**
+  - Challenges:
+    - Only applies to `ruby` projects so it would not be applicable to projects
+      like `container-registry`.
+    - Has a non-zero impact on time and resources in CI pipelines (these impacts
+      can be observed in
+      [!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638))
+  - Opportunities:
+    - Simple and straightforward to implement.
+    - Allows access to more information (eg. stacktrace and calling locations).
+- **Connection proxy with logging**
+  - Challenges:
+    - Adds complexity and possible performance overhead.
+    - Requires maintaining the code for the proxy.
+  - Opportunities:
+    - Allows us to customize the capture.
+    - Allows us to perform normalization/deduplication at capture time.
+- **Built-in logging in `postgresql`**
+  - Challenges:
+    - Require adding a configuration to enable logging.
+    - May be difficult to obtain the resulting logs.
+  - Opportunities:
+    - Doesn't require maintaining any code.
+    - Light weight in terms of performance impact.
+- **Capture from `pg_stat_statements`**
+  - Challenges:
+    - Requires creating the extension in the test database.
+    - Requires adding a configuration to set `pg_stat_statements.max` to a value
+      high enough to capture all queries.
+    - Consumes shared memory proportional to `pg_stat_statements.max`.
+  - Opportunities:
+    - Requires minimal code.
+    - Simple to obtain the data.
+    - Data is already normalized.
+
+We have already built a proof of concept for instrumenting `ActiveRecord` in
+`ruby` in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a
+first step we will benchmark the other capture methods against it and select the
+best option.
+
+#### Storing queries
+
+For the next step of the first iteration we will use the proof of concept in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well
+as any data gathered from testing other capture methods to estimate the number
+of rows per project, and use the pipeline execution statistics for
+`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to
+evaluate storage mechanisms that are suitable for our purpose.
+
+Some of the storage mechanisms we plan to evaluate are:
+
+- **In the `ci` database in the GitLab database instance**
+  - Challenges:
+    - Places additional strain on this resource for `GitLab.com`.
+  - Opportunities:
+    - Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`.
+    - Allows us to leverage associations with `ci_builds` and `ci_pipelines`.
+    - Simplifies deployment for self-managed.
+- **In a new decomposed database in the GitLab database instance**
+  - Challenges:
+    - Adds to required development and testing effort.
+    - Adds to deployment effort for `GitLab.com`.
+  - Opportunities:
+    - Isolates database performance impacts from the existing `main` and `ci` database instances.
+- **In a new external service**
+  - Challenges:
+    - Adds to required development and testing effort.
+    - Adds to deployment effort for `GitLab.com` and for self-managed.
+  - Opportunities:
+    - Isolates performance impacts from `gitlab-org/gitlab`.
+    - Allows us to iterate faster without impacting the main application.
+- **In ClickHouse**
+  - Challenges:
+    - Not yet available for self-managed.
+  - Opportunities:
+    - Isolates database performance impacts from the existing `main` and `ci` database instances.
+
+An example database schema for storing queries:
+
+```sql
+CREATE TABLE queries (
+    created_at timestamp with time zone NOT NULL,
+    updated_at timestamp with time zone NOT NULL,
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    analysis_id bigint,
+    hash text,
+    sql text
+);
+CREATE TABLE pipeline_queries (
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    pipeline_id bigint NOT NULL,
+    query_id bigint NOT NULL
+);
+CREATE TABLE analyses (
+    created_at timestamp with time zone NOT NULL,
+    updated_at timestamp with time zone NOT NULL,
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    query_id bigint NOT NULL,
+    buffers int,
+    walltime int,
+    explain text,
+    analysis_url text
+);
+```
+
+One possible method of partitioning a schema like the above example would be to
+utilize
+[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning).
+If we partition by `project_id` then by some interval of `updated_at`, and touch
+the row when we see a query, we can store only queries that the codebase is
+still executing, and prune partitions that only contain queries the code is no
+longer generating.
+
+### Iteration 2
+
+In the second iteration we plan to identify new and changed queries, and post MR
+comments containing a summary. We will begin soliciting feedback on the accuracy
+and utility of the information, and improve or filter it to maximize it's
+usefulness.
+
+### Iteration 3+
+
+In the third and following iterations we plan to automate query analysis using
+one or more analyzers, store these analyses, and add them to the MR comments. We
+also intend to re-evaluate our use of the database to store query information,
+and the API to retrieve it, and potentially move this to an external service.
diff --git a/doc/development/contributing/index.md b/doc/development/contributing/index.md
index 98139af23d6..5e255921a29 100644
--- a/doc/development/contributing/index.md
+++ b/doc/development/contributing/index.md
@@ -84,7 +84,7 @@ For details, see the [merge request workflow](merge_request_workflow.md).
 1. When you create a merge request, the [`@gitlab-bot`](https://gitlab.com/gitlab-bot) automatically applies
    the ["~Community contribution"](https://about.gitlab.com/handbook/engineering/quality/triage-operations/#ensure-quick-feedback-for-community-contributions) label.
 1. In the 24-48 hours after you create the merge request, a
-   [Merge Request Coach](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/merge-request-coach-lifecycle.html)
+   [Merge Request Coach](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/merge-request-coach-lifecycle.html)
    will review your merge request and apply stage, group, and type labels.
 1. If a merge request was not automatically assigned, ask for a review by typing `@gitlab-bot ready` in a comment.
    If your code has not been assigned a reviewer within two working days of its initial submission, you can ask
@@ -139,15 +139,10 @@ Lastly, keep the following in mind when submitting merge requests:
 - For the criteria for closing issues, see [the Issue Triage handbook page](https://about.gitlab.com/handbook/engineering/quality/issue-triage/#outdated-issues).
 - For the criteria for closing merge requests, see [the Merge Request Workflow](merge_request_workflow.md).
 
-## Getting an Enterprise Edition license
+## Contributing to Premium/Ultimate features with an Enterprise Edition license
 
-GitLab has two development platforms:
-
-- GitLab Community Edition (CE), our free and open source edition.
-- GitLab Enterprise Edition (EE), which is our commercial edition.
-
-If you need a license for contributing to an EE-feature, see
-[relevant information](https://about.gitlab.com/handbook/marketing/community-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee).
+If you would like to work on GitLab features that are within a paid tier, also known as the code that lives in the [EE folder](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee), it requires a GitLab Enterprise Edition license. 
+Please request an Enterprise Edition Developers License according to the [documented process](https://about.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows.html#contributing-to-the-gitlab-enterprise-edition-ee).
 
 ## Get help
author	GitLab Bot <gitlab-bot@gitlab.com>	2023-06-07 00:07:14 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2023-06-07 00:07:14 +0300
commit	b04f912deb494b6dc0d0dba36776a5e53f622b43 (patch)
tree	b6af17cf69082cda35c42801a123c6e090b28c12 /doc
parent	9e5484cee690f8bb2c1796013345d8cbc1872d77 (diff)