Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2023-02-06 21:09:43 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2023-02-06 21:09:43 +0300
commit: 33998a0e768263828f497685ae030f585193317f (patch)
tree: 791e2c3322a17f668a2a4abdd251a8362885f381 /doc
parent: 7b69a22d499787378aa30561822ef797a99c22e5 (diff)
9 files changed, 669 insertions, 25 deletions
diff --git a/doc/administration/inactive_project_deletion.md b/doc/administration/inactive_project_deletion.md
index ed75373448e..aad6a420246 100644
--- a/doc/administration/inactive_project_deletion.md
+++ b/doc/administration/inactive_project_deletion.md
@@ -32,6 +32,9 @@ To configure deletion of inactive projects:
    - The email duration must be less than the **Delete project after** duration.
 1. Select **Save changes**.
 
+Inactive projects that meet the criteria are scheduled for deletion and a warning email is sent. If the
+projects remain inactive, they are deleted after the specified duration.
+
 ### Configuration example
 
 If you use these settings:
diff --git a/doc/development/database/avoiding_downtime_in_migrations.md b/doc/development/database/avoiding_downtime_in_migrations.md
index fc1f94cbc2c..8e1eeee7a42 100644
--- a/doc/development/database/avoiding_downtime_in_migrations.md
+++ b/doc/development/database/avoiding_downtime_in_migrations.md
@@ -447,14 +447,16 @@ without downtime and causing too much load on the database is described below.
 
 To start the process, add a regular migration to create the new `bigint` columns. Use the provided
 `initialize_conversion_of_integer_to_bigint` helper. The helper also creates a database trigger
-to keep in sync both columns for any new records ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/migrate/20210608072312_initialize_conversion_of_ci_stages_to_bigint.rb)):
+to keep in sync both columns for any new records ([code](https://gitlab.com/gitlab-org/gitlab/-/blob/97aee76c4bfc2043dc0a1ef9ffbb71c58e0e2857/db/migrate/20230127093353_initialize_conversion_of_merge_request_metrics_to_bigint.rb)):
 
 ```ruby
-class InitializeConversionOfCiStagesToBigint < Gitlab::Database::Migration[2.1]
-  TABLE = :ci_stages
-  COLUMNS = %i[id]
+# frozen_string_literal: true
+
+class InitializeConversionOfMergeRequestMetricsToBigint < Gitlab::Database::Migration[2.1]
+  disable_ddl_transaction!
 
-  enable_lock_retries!
+  TABLE = :merge_request_metrics
+  COLUMNS = %i[id]
 
   def up
     initialize_conversion_of_integer_to_bigint(TABLE, COLUMNS)
@@ -469,25 +471,28 @@ end
 Ignore the new `bigint` columns:
 
 ```ruby
-module Ci
-  class Stage < Ci::ApplicationRecord
-    include IgnorableColumns
-    ignore_column :id_convert_to_bigint, remove_with: '14.2', remove_after: '2021-08-22'
-  end
+# frozen_string_literal: true
+
+class MergeRequest::Metrics < ApplicationRecord
+  include IgnorableColumns
+  ignore_column :id_convert_to_bigint, remove_with: '16.0', remove_after: '2023-05-22'
+end
 ```
 
-Enqueue batched background migration ([another example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/migrate/20210608072346_backfill_ci_stages_for_bigint_conversion.rb))
+Enqueue batched background migration ([code](https://gitlab.com/gitlab-org/gitlab/-/blob/97aee76c4bfc2043dc0a1ef9ffbb71c58e0e2857/db/post_migrate/20230127101834_backfill_merge_request_metrics_for_bigint_conversion.rb))
 to migrate the existing data:
 
 ```ruby
-class BackfillCiStagesForBigintConversion < Gitlab::Database::Migration[2.1]
-  TABLE = :ci_stages
-  COLUMNS = %i[id]
+# frozen_string_literal: true
 
-  restrict_gitlab_migration gitlab_schema: :gitlab_ci
+class BackfillMergeRequestMetricsForBigintConversion < Gitlab::Database::Migration[2.1]
+  restrict_gitlab_migration gitlab_schema: :gitlab_main
+
+  TABLE = :merge_request_metrics
+  COLUMNS = %i[id]
 
   def up
-    backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS)
+    backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS, sub_batch_size: 200)
   end
 
   def down
diff --git a/doc/development/database/clickhouse/gitlab_activity_data.md b/doc/development/database/clickhouse/gitlab_activity_data.md
new file mode 100644
index 00000000000..6ba11b8afaf
--- /dev/null
+++ b/doc/development/database/clickhouse/gitlab_activity_data.md
@@ -0,0 +1,482 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Store GitLab activity data in ClickHouse
+
+## Overview of the existing implementation
+
+### What is GitLab activity data
+
+GitLab records activity data during its operation as users interact with the application. Most of these interactions revolve around the projects, issues, and merge requests domain objects. Users can perform several different actions and some of these actions are recorded in a separate PostgreSQL database table called `events`.
+
+Example events:
+
+- Issue opened
+- Issue reopened
+- User joined a project
+- Merge Request merged
+- Repository pushed
+- Snippet created
+
+### Where is the activity data used
+
+Several features use activity data:
+
+- The user's [contribution calendar](../../../user/profile/contributions_calendar.md) on the profile page.
+- Paginated list of the user's contributions.
+- Paginated list of user activity for a Project and a Group.
+- [Contribution analytics](../../../user/group/contribution_analytics/index.md).
+
+### How is the activity data created
+
+The activity data is usually generated on the service layer when a specific operation is executed by the user. The persistence characteristics of an `events` record depend on the implementation of the service. Two main approaches exist:
+
+1. In the database transaction where the actual event occurs.
+1. After the database transaction (which could be delayed).
+
+The above-mentioned mechanics provide a "mostly" consistent stream of `events`.
+
+For example, consistently recording an `events` record:
+
+```ruby
+ApplicationRecord.transaction do
+  issue.closed!
+  Event.create!(action: :closed, target: issue)
+end
+```
+
+Example, unsafe recording of an `events` record:
+
+```ruby
+ApplicationRecord.transaction do
+  issue.closed!
+end
+
+# If a crash happens here, the event will not be recorded.
+Event.create!(action: :closed, target: issue)
+```
+
+### Database table structure
+
+The `events` table uses [polymorphic association](https://guides.rubyonrails.org/association_basics.html#polymorphic-associations) to allow associating different database tables (issues, merge requests, etc.) with a record. A simplified database structure:
+
+```sql
+   Column    |           Type            | Nullable |              Default               | Storage  |
+-------------+--------------------------+-----------+----------+------------------------------------+
+ project_id  | integer                   |          |                                    | plain    |
+ author_id   | integer                   | not null |                                    | plain    |
+ target_id   | integer                   |          |                                    | plain    |
+ created_at  | timestamp with time zone  | not null |                                    | plain    |
+ updated_at  | timestamp with time zone  | not null |                                    | plain    |
+ action      | smallint                  | not null |                                    | plain    |
+ target_type | character varying         |          |                                    | extended |
+ group_id    | bigint                    |          |                                    | plain    |
+ fingerprint | bytea                     |          |                                    | extended |
+ id          | bigint                    | not null | nextval('events_id_seq'::regclass) | plain    |
+```
+
+Some unexpected characteristics due to the evolving database design:
+
+- The `project_id` and the `group_id` columns are mutually exclusive, internally we call them resource parent.
+  - Example 1: for an issue opened event, the `project_id` field is populated.
+  - Example 2: for an epic-related event, the `group_id` field is populated (epic is always part of a group).
+- The `target_id` and `target_type` column pair identifies the target record.
+  - Example: `target_id=1` and `target_type=Issue`.
+  - When the columns are `null`, we refer to an event which has no representation in the database. For example a repository `push` action.
+- Fingerprint is used in some cases to later alter the event based on some metadata change. This approach is mostly used for Wiki pages.
+
+### Database record modifications
+
+Most of the data is written once however, we cannot say that the table is append-only. A few use cases where actual row updates and deletions happen:
+
+- Fingerprint-based update for certain Wiki page records.
+- When user or an associated resource is deleted, the event rows are also deleted.
+  - The deletion of the associated `events` records happens in batches.
+
+### Current performance problems
+
+- The table uses significant disk space.
+- Adding new events may significantly increase the database record count.
+- Implementing data pruning logic is difficult.
+- Time-range-based aggregations are not performant enough, some features may break due to slow database queries.
+
+### Example queries
+
+NOTE:
+These queries have been significantly simplified from the actual queries from production.
+
+Database query for the user's contribution graph:
+
+```sql
+SELECT DATE(events.created_at), COUNT(*)
+FROM events
+WHERE events.author_id = 1
+AND events.created_at BETWEEN '2022-01-17 23:00:00' AND '2023-01-18 22:59:59.999999'
+AND (
+  (
+    events.action = 5
+  ) OR
+  (
+    events.action IN (1, 3) -- Enum values are documented in the Event model, see the ACTIONS constant in app/models/event.rb
+    AND events.target_type IN ('Issue', 'WorkItem')
+  ) OR
+  (
+    events.action IN (7, 1, 3)
+    AND events.target_type = 'MergeRequest'
+  ) OR
+  (
+    events.action = 6
+  )
+)
+GROUP BY DATE(events.created_at)
+```
+
+Query for group contributions for each user:
+
+```sql
+SELECT events.author_id, events.target_type, events.action, COUNT(*)
+FROM events
+WHERE events.created_at BETWEEN '2022-01-17 23:00:00' AND '2023-03-18 22:59:59.999999'
+AND events.project_id IN (1, 2, 3) -- list of project ids in the group
+GROUP BY events.author_id, events.target_type, events.action
+```
+
+## Storing activity data in ClickHouse
+
+### Data persistence
+
+At the moment, there is no consensus about the way we would replicate data from the PostgreSQL database to ClickHouse. A few ideas that might work for the `events` table:
+
+#### Record data immediately
+
+This approach provides a simple way to keep the existing `events` table working while we're also sending data to the ClickHouse database. When an event record is created, ensure that it's created outside of the transaction. After persisting the data in PostgreSQL, persist it in ClickHouse.
+
+```ruby
+ApplicationRecord.transaction do
+  issue.update!(state: :closed)
+end
+
+# could be a method to hide complexity
+Event.create!(action: :closed, target: issue)
+ClickHouse::Event.create(action: :closed, target: issue)
+```
+
+What's behind the implementation of `ClickHouse::Event` is not decided yet, it could be one of the following:
+
+- ActiveRecord model directly connecting the ClickHouse database.
+- REST API call to an intermediate service.
+- Enqueueing an event to an event-streaming tool (like Kafka).
+
+#### Replication of `events` rows
+
+Assuming that the creation of `events` record is an integral part of the system, introducing another storage call might cause performance degradation in various code paths, or it could introduce significant complexity.
+
+Rather than sending data to ClickHouse on event creation time, we would move this processing in the background by iterating over the `events` table and sending the newly created database rows.
+
+By keeping track of which records have been sent over ClickHouse, we could incrementally send data.
+
+```ruby
+last_updated_at = SyncProcess.last_updated_at
+
+# oversimplified loop, we would probably batch this...
+Event.where(updated_at > last_updated_at).each do |row|
+  last_row = ClickHouse::Event.create(row)
+end
+
+SyncProcess.last_updated_at = last_row.updated_at
+```
+
+### ClickHouse database table structure
+
+When coming up with the initial database structure, we must look at the way the data is queried.
+
+We have two main use cases:
+
+- Query data for a certain user, within a time range.
+  - `WHERE author_id = 1 AND created_at BETWEEN '2021-01-01' AND '2021-12-31'`
+  - Additionally, there might be extra `project_id` condition due to the access control check.
+- Query data for a project or group, within a time range.
+  - `WHERE project_id IN (1, 2) AND created_at BETWEEN '2021-01-01' AND '2021-12-31'`
+
+The `author_id` and `project_id` columns are considered high-selectivity columns. By this we mean that optimizing the filtering of the `author_id` and the `project_id` columns is desirable for having performant database queries.
+
+The most recent activity data is queried more often. At some point, we might just drop or relocate older data. Most of the features look back only a year.
+
+For these reasons, we could start with a database table storing low-level `events` data:
+
+```plantuml
+hide circle
+
+entity "events" as events {
+  id : UInt64 ("primary key")
+--
+  project_id : UInt64
+  group_id : UInt64
+  target_id : UInt64
+  target_type : String
+  action : UInt8
+  fingerprint : UInt64
+  created_at : DateTime
+  updated_at : DateTime
+}
+```
+
+The SQL statement for creating the table:
+
+```sql
+CREATE TABLE events
+(
+    `id` UInt64,
+    `project_id` UInt64 DEFAULT 0 NOT NULL,
+    `group_id` UInt64 DEFAULT 0 NOT NULL,
+    `author_id` UInt64 DEFAULT 0 NOT NULL,
+    `target_id` UInt64 DEFAULT 0 NOT NULL,
+    `target_type` LowCardinality(String) DEFAULT '' NOT NULL,
+    `action` UInt8 DEFAULT 0 NOT NULL,
+    `fingerprint` UInt64 DEFAULT 0 NOT NULL,
+    `created_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL,
+    `updated_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL
+)
+ENGINE = ReplacingMergeTree(updated_at)
+ORDER BY id;
+```
+
+A few changes compared to the PostgreSQL version:
+
+- `target_type` uses [an optimization](https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality/) for low-cardinality column values.
+- `fingerprint` becomes an integer and leverages a performant integer-based hashing function such as xxHash64.
+- All columns get a default value, the 0 default value for the integer columns means no value. See the related [best practices](https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns/).
+- `NOT NULL` to ensure that we always use the default values when data is missing (different behavior compared to PostgreSQL).
+- The "primary" key automatically becomes the `id` column due to the `ORDER BY` clause.
+
+Let's insert the same primary key value twice:
+
+```sql
+INSERT INTO events (id, project_id, target_id, author_id, target_type, action) VALUES (1, 2, 3, 4, 'Issue', null);
+INSERT INTO events (id, project_id, target_id, author_id, target_type, action) VALUES (1, 20, 30, 5, 'Issue', null);
+```
+
+Let's inspect the results:
+
+```sql
+SELECT * FROM events
+```
+
+- We have two rows with the same `id` value (primary key).
+- The `null` `action` becomes `0`.
+- The non-specified fingerprint column becomes `0`.
+- The `DateTime` columns have the insert timestamp.
+
+ClickHouse will eventually "replace" the rows with the same primary key in the background. When running this operation, the higher `updated_at` value takes precedence. The same behavior can be simulated with the `final` keyword:
+
+```sql
+SELECT * FROM events FINAL
+```
+
+Adding `FINAL` to a query can have significant performance consequences, some of the issues are documented in the [ClickHouse documentation](https://clickhouse.com/docs/en/sql-reference/statements/select/from/#final-modifier).
+
+We should always expect duplicated values in the table, so we must take care of the deduplication in query time.
+
+### ClickHouse database queries
+
+ClickHouse uses SQL for querying the data, in some cases, a PostgreSQL query can be used in ClickHouse without major modifications assuming that the underlying database structure is very similar.
+
+Query for group contributions for each user (PostgreSQL):
+
+```sql
+SELECT events.author_id, events.target_type, events.action, COUNT(*)
+FROM events
+WHERE events.created_at BETWEEN '2022-01-17 23:00:00' AND '2023-03-18 22:59:59.999999'
+AND events.project_id IN (1, 2, 3) -- list of project ids in the group
+GROUP BY events.author_id, events.target_type, events.action
+```
+
+The same query would work in PostgreSQL however, we might see duplicated values in ClickHouse due to the way the table engine works. The deduplication can be achieved by using a nested `FROM` statement.
+
+```sql
+SELECT author_id, target_type, action, count(*)
+FROM (
+  SELECT
+  id,
+  argMax(events.project_id, events.updated_at) AS project_id,
+  argMax(events.group_id, events.updated_at) AS group_id,
+  argMax(events.author_id, events.updated_at) AS author_id,
+  argMax(events.target_type, events.updated_at) AS target_type,
+  argMax(events.target_id, events.updated_at) AS target_id,
+  argMax(events.action, events.updated_at) AS action,
+  argMax(events.fingerprint, events.updated_at) AS fingerprint,
+  FIRST_VALUE(events.created_at) AS created_at,
+  MAX(events.updated_at) AS updated_at
+  FROM events
+  WHERE events.created_at BETWEEN '2022-01-17 23:00:00' AND '2023-03-18 22:59:59.999999'
+  AND events.project_id IN (1, 2, 3) -- list of project ids in the group
+  GROUP BY id
+) AS events
+GROUP BY author_id, target_type, action
+```
+
+- Take the most recent column values based on the `updated_at` column.
+- Take the first value for `created_at`, assuming that the first `INSERT` contains the correct value. An issue only when we don't sync `created_at` at all and the default value (`NOW()`) is used.
+- Take the most recent `updated_at` value.
+
+The query looks more complicated now because of the deduplication logic. The complexity can be hidden behind a database view.
+
+### Optimizing the performance
+
+The aggregation query in the previous section might not be performant enough for production use due to the large volume of data.
+
+Let's add 1 million extra rows to the `events` table:
+
+```sql
+INSERT INTO events (id, project_id, author_id, target_id, target_type, action)  SELECT id, project_id, author_id, target_id, 'Issue' AS target_type, action FROM generateRandom('id UInt64, project_id UInt64, author_id UInt64, target_id UInt64, action UInt64') LIMIT 1000000;
+```
+
+Running the previous aggregation query in the console prints out some performance data:
+
+```plaintext
+1 row in set. Elapsed: 0.122 sec. Processed 1.00 million rows, 42.00 MB (8.21 million rows/s., 344.96 MB/s.)
+```
+
+The query returned 1 row (correctly) however, it had to process 1 million rows (full table). We can optimize the query with an index on the `project_id` column:
+
+```sql
+ALTER TABLE events ADD INDEX project_id_index project_id TYPE minmax GRANULARITY 10;
+ALTER TABLE events MATERIALIZE INDEX project_id_index;
+```
+
+Executing the query returns much better figures:
+
+```plaintext
+Read 2 rows, 107.00 B in 0.005616811 sec., 356 rows/sec., 18.60 KiB/sec.
+```
+
+To optimize the date range filter on the `created_at` column, we could try adding another index on the `created_at` column.
+
+#### Query for the contribution graph
+
+Just to recap, this is the PostgreSQL query:
+
+```sql
+SELECT DATE(events.created_at), COUNT(*)
+FROM events
+WHERE events.author_id = 1
+AND events.created_at BETWEEN '2022-01-17 23:00:00' AND '2023-01-18 22:59:59.999999'
+AND (
+  (
+    events.action = 5
+  ) OR
+  (
+    events.action IN (1, 3) -- Enum values are documented in the Event model, see the ACTIONS constant in app/models/event.rb
+    AND events.target_type IN ('Issue', 'WorkItem')
+  ) OR
+  (
+    events.action IN (7, 1, 3)
+    AND events.target_type = 'MergeRequest'
+  ) OR
+  (
+    events.action = 6
+  )
+)
+GROUP BY DATE(events.created_at)
+```
+
+The filtering and the count aggregation is mainly done on the `author_id` and the `created_at` columns. Grouping the data by these two columns would probably give an adequate performance.
+
+We could attempt adding an index on the `author_id` column however, we still need an additional index on the `created_at` column to properly cover this query. Besides, under the contribution graph, GitLab shows the list of ordered contributions of the user which would be great to get it efficiently via a different query with the `ORDER BY` clause.
+
+For these reasons, it's probably better to use a ClickHouse projection which stores the events rows redundantly but we can specify a different sort order.
+
+The ClickHouse query would be the following (with a slightly adjusted date range):
+
+```sql
+SELECT DATE(events.created_at) AS date, COUNT(*) AS count
+FROM (
+  SELECT
+  id,
+  argMax(events.created_at, events.updated_at) AS created_at
+  FROM events
+  WHERE events.author_id = 4
+  AND events.created_at BETWEEN '2023-01-01 23:00:00' AND '2024-01-01 22:59:59.999999'
+  AND (
+    (
+      events.action = 5
+    ) OR
+    (
+      events.action IN (1, 3) -- Enum values are documented in the Event model, see the ACTIONS constant in app/models/event.rb
+      AND events.target_type IN ('Issue', 'WorkItem')
+    ) OR
+    (
+      events.action IN (7, 1, 3)
+      AND events.target_type = 'MergeRequest'
+    ) OR
+    (
+      events.action = 6
+    )
+  )
+  GROUP BY id
+) AS events
+GROUP BY DATE(events.created_at)
+```
+
+The query does a full table scan, let's optimize it:
+
+```sql
+ALTER TABLE events ADD PROJECTION events_by_authors (
+  SELECT * ORDER BY author_id, created_at -- different sort order for the table
+);
+
+ALTER TABLE events MATERIALIZE PROJECTION events_by_authors;
+```
+
+#### Pagination of contributions
+
+Listing the contributions of a user can be queried in the following way:
+
+```sql
+SELECT events.*
+FROM (
+  SELECT
+  id,
+  argMax(events.project_id, events.updated_at) AS project_id,
+  argMax(events.group_id, events.updated_at) AS group_id,
+  argMax(events.author_id, events.updated_at) AS author_id,
+  argMax(events.target_type, events.updated_at) AS target_type,
+  argMax(events.target_id, events.updated_at) AS target_id,
+  argMax(events.action, events.updated_at) AS action,
+  argMax(events.fingerprint, events.updated_at) AS fingerprint,
+  FIRST_VALUE(events.created_at) AS created_at,
+  MAX(events.updated_at) AS updated_at
+  FROM events
+  WHERE events.author_id = 4
+  GROUP BY id
+  ORDER BY created_at DESC, id DESC
+) AS events
+LIMIT 20
+```
+
+ClickHouse supports the standard `LIMIT N OFFSET M` clauses, so we can request the next page:
+
+```sql
+SELECT events.*
+FROM (
+  SELECT
+  id,
+  argMax(events.project_id, events.updated_at) AS project_id,
+  argMax(events.group_id, events.updated_at) AS group_id,
+  argMax(events.author_id, events.updated_at) AS author_id,
+  argMax(events.target_type, events.updated_at) AS target_type,
+  argMax(events.target_id, events.updated_at) AS target_id,
+  argMax(events.action, events.updated_at) AS action,
+  argMax(events.fingerprint, events.updated_at) AS fingerprint,
+  FIRST_VALUE(events.created_at) AS created_at,
+  MAX(events.updated_at) AS updated_at
+  FROM events
+  WHERE events.author_id = 4
+  GROUP BY id
+  ORDER BY created_at DESC, id DESC
+) AS events
+LIMIT 20 OFFSET 20
+```
diff --git a/doc/development/database/clickhouse/index.md b/doc/development/database/clickhouse/index.md
new file mode 100644
index 00000000000..a26bac261fd
--- /dev/null
+++ b/doc/development/database/clickhouse/index.md
@@ -0,0 +1,85 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Introduction to ClickHouse use and table design
+
+## How it differs from PostgreSQL
+
+The [intro](https://clickhouse.com/docs/en/intro/) page is quite good to give an overview of ClickHouse.
+
+ClickHouse has a lot of differences from traditional OLTP (online transaction processing) databases like PostgreSQL. The underlying architecture is a bit different, and the processing is a lot more CPU-bound than in traditional databases.
+ClickHouse is a log-centric database where immutability is a key component. The advantages of such approaches are well documented [[1]](https://www.odbms.org/2015/10/the-rise-of-immutable-data-stores/) however it also makes updates much harder. See ClickHouse [documentation](https://clickhouse.com/docs/en/guides/developer/mutations/) for operations that provide UPDATE/DELETE support. It is noticeable that these operations are supposed to be non-frequent.
+
+This distinction is important while designing tables. Either:
+
+- The updates are not required (best case)
+- If they are needed, they aren't to be run during query execution.
+
+## ACID compatibility
+
+ClickHouse has a slightly different overview of Transactional support, where the guarantees are applicable only up to a block of inserted data to a specific table. See [the Transactional (ACID) support](https://clickhouse.com/docs/en/guides/developer/transactional/) documentation for details.
+
+Multiple insertions in a single write should be avoided as transactional support across multiple tables is only covered in materialized views.
+
+ClickHouse is heavily geared towards providing the best-in-class support for analytical queries. Operations like aggregation are very fast and there are several features to augment these capabilities.
+ClickHouse has some good blog posts covering [details of aggregations](https://altinity.com/blog/clickhouse-aggregation-fun-part-1-internals-and-handy-tools).
+
+## Primary indexes, sorting index and dictionaries
+
+It is highly recommended to read ["A practical introduction to primary indexes in ClickHouse""](https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-intro) to get an understanding of indexes in ClickHouse.
+
+Particularly how database index design in ClickHouse [differs](https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-design/#an-index-design-for-massive-data-scales) from those in transactional databases like PostgreSQL.
+
+Primary index design plays a very important role in query performance and should be stated carefully. Almost all of the queries should rely on the primary index as full data scans are bound to take longer.
+
+Read the documentation for [primary keys and indexes in queries](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#primary-keys-and-indexes-in-queries) to learn how indexes can affect query performance in MergeTree Table engines (default table engine in ClickHouse).
+
+Secondary indexes in ClickHouse are different from what is available in other systems. They are also called data-skipping indexes as they are used to skip over a block of data. See the documentation for [data-skipping indexes](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-data_skipping-indexes).
+
+ClickHouse also offers ["Dictionaries"](https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts/) which can be used as external indexes. Dictionaries are loaded from memory and can be used to look up values on query runtime.
+
+## Data types & Partitioning
+
+ClickHouse offers SQL-compatible [data types](https://clickhouse.com/docs/en/sql-reference/data-types/) and few specialized data types like:
+
+- [`LowCardinality`](https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality)
+- [UUID](https://clickhouse.com/docs/en/sql-reference/data-types/uuid)
+- [Maps](https://clickhouse.com/docs/en/sql-reference/data-types/map)
+- [Nested](https://clickhouse.com/docs/en/sql-reference/data-types/nested-data-structures/nested) which is interesting, because it simulates a table inside a column.
+
+One key design aspect that comes up front while designing a table is the partitioning key. Partitions can be any arbitrary expression but usually, these are time duration like months, days, or weeks. ClickHouse takes a best-effort approach to minimize the data read by using the smallest set of partitions.
+
+Suggested reads:
+
+- [Choose a low cardinality partitioning key](https://clickhouse.com/docs/en/optimize/partitioning-key)
+- [Custom partitioning key](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key).
+
+## Sharding and replication
+
+Sharding is a feature that allows splitting the data into multiple ClickHouse nodes to increase throughput and decrease latency. The sharding feature uses a distributed engine that is backed by local tables. The distributed engine is a "virtual" table that does not store any data. It is used as an interface to insert and query data.
+
+See [the ClickHouse documentation](https://clickhouse.com/docs/en/engines/table-engines/special/distributed/) and this section on [replication and sharding](https://clickhouse.com/docs/en/manage/replication-and-sharding/). ClickHouse can use either Zookeeper or its own compatible API via a component called [ClickHouse Keeper](https://clickhouse.com/docs/en/operations/clickhouse-keeper) to maintain consensus.
+
+After nodes are set up, they can become invisible from the Clients and both write and read queries can be issued to any node.
+
+In most cases, clusters usually start with a fixed number of nodes(~ shards). [Rebalancing shards](https://clickhouse.com/docs/en/guides/sre/scaling-clusters) is operationally heavy and requires rigorous testing.
+
+Replication is supported by MergeTree Table engine, see the [replication section](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication/) in documentation for details on how to define them.
+ClickHouse relies on a distributed coordination component (either Zookeeper or ClickHouse Keeper) to track the participating nodes in the quorum. Replication is asynchronous and multi-leader. Inserts can be issued to any node and they can appear on other nodes with some latency. If desired, stickiness to a specific node can be used to make sure that reads observe the latest written data.
+
+## Materialized views
+
+One of the defining features of ClickHouse is materialized views. Functionally they resemble insert triggers for ClickHouse.
+Materialized views can be used for a variety of use cases which are well [documented](https://www.polyscale.ai/blog/polyscale-metrics-materialized-views/) on the web.
+
+We recommended reading the [views](https://clickhouse.com/docs/en/sql-reference/statements/create/view#materialized-view) section from the official documentation to get a better understanding of how they work.
+
+Quoting the [documentation](https://clickhouse.com/docs/en/sql-reference/statements/create/view#materialized-view):
+
+> Materialized views in ClickHouse are implemented more like insert triggers.
+> If there's some aggregation in the view query, it's applied only to the batch
+> of freshly inserted data. Any changes to existing data of the source table
+> (like update, delete, drop a partition, etc.) do not change the materialized view.
diff --git a/doc/development/database/clickhouse/optimization.md b/doc/development/database/clickhouse/optimization.md
new file mode 100644
index 00000000000..166bbf7d2b1
--- /dev/null
+++ b/doc/development/database/clickhouse/optimization.md
@@ -0,0 +1,60 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Optimizing query execution
+
+ClickHouse Inc has listed a [variety of optimization strategies](https://clickhouse.com/docs/en/optimize/).
+
+ClickHouse relies heavily on the structure of the primary index. However, in some cases, it's possible that queries rely on a column that's part of the primary index, but isn't the first column. See [Using multiple primary indexes](https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-multiple) which offers several options in such cases. For example: using a data skipping index as a secondary index.
+
+In cases of compound primary indexes, it's helpful to understand the data characteristics of key columns is also very helpful. They can allow the index to be more efficient. [Ordering key columns efficiently](https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-cardinality) goes into details on these concepts.
+
+ClickHouse blog also has a very good post, [Super charging your ClickHouse queries](https://clickhouse.com/blog/clickhouse-faster-queries-with-projections-and-primary-indexes), that outlines almost all of the approaches listed above.
+
+It is possible to use [`EXPLAIN`](https://clickhouse.com/docs/en/sql-reference/statements/explain/) statements with queries to get visible steps of the query pipeline. Note the different [types](https://clickhouse.com/docs/en/sql-reference/statements/explain/#explain-types) of `EXPLAIN`.
+
+Also, to get detailed query execution pipeline, you can toggle the logs level to `trace` via `clickhouse-client` and then execute the query.
+
+For example:
+
+```plaintext
+$ clickhouse-client :) SET send_logs_level = 'trace'
+$ clickhouse-client :) select count(traceID) from jaeger_index WHERE tenant = '12' AND service != 'jaeger-query' FORMAT Vertical ;
+
+SELECT count(traceID)
+FROM jaeger_index
+WHERE (tenant = '12') AND (service != 'jaeger-query')
+FORMAT Vertical
+
+Query id: 6ce40daf-e1b1-4714-ab02-268246f3c5c9
+
+[cluster-0-0-0] 2023.01.30 06:31:32.240819 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> executeQuery: (from 127.0.0.1:53654) select count(traceID) from jaeger_index WHERE tenant = '12' AND service != 'jaeger-query' FORMAT Vertical ; (stage: Complete)
+....
+[cluster-0-0-0] 2023.01.30 06:31:32.244071 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "service != 'jaeger-query'" moved to PREWHERE
+[cluster-0-0-0] 2023.01.30 06:31:32.244420 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "service != 'jaeger-query'" moved to PREWHERE
+....
+[cluster-0-0-0] 2023.01.30 06:31:32.245153 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
+[cluster-0-0-0] 2023.01.30 06:31:32.245255 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> InterpreterSelectQuery: Complete -> Complete
+[cluster-0-0-0] 2023.01.30 06:31:32.245590 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> tracing_gcs.jaeger_index_local (66c6ca81-e20d-44dc-8101-92678fc24d99) (SelectExecutor): Key condition: (column 1 not in ['jaeger-query', 'jaeger-query']), unknown, (column 0 in ['12', '12']), and, and
+[cluster-0-0-0] 2023.01.30 06:31:32.245784 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> tracing_gcs.jaeger_index_local (66c6ca81-e20d-44dc-8101-92678fc24d99) (SelectExecutor): MinMax index condition: unknown, unknown, and, unknown, and
+[cluster-0-0-0] 2023.01.30 06:31:32.246239 [ 1503 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> tracing_gcs.jaeger_index_local (66c6ca81-e20d-44dc-8101-92678fc24d99) (SelectExecutor): Used generic exclusion search over index for part 202301_1512_21497_9164 with 4 steps
+[cluster-0-0-0] 2023.01.30 06:31:32.246293 [ 1503 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> tracing_gcs.jaeger_index_local (66c6ca81-e20d-44dc-8101-92678fc24d99) (SelectExecutor): Used generic exclusion search over index for part 202301_21498_24220_677 with 1 steps
+[cluster-0-0-0] 2023.01.30 06:31:32.246488 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> tracing_gcs.jaeger_index_local (66c6ca81-e20d-44dc-8101-92678fc24d99) (SelectExecutor): Selected 2/2 parts by partition key, 1 parts by primary key, 2/4 marks by primary key, 2 marks to read from 1 ranges
+[cluster-0-0-0] 2023.01.30 06:31:32.246591 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> MergeTreeInOrderSelectProcessor: Reading 1 ranges in order from part 202301_1512_21497_9164, approx. 16384 rows starting from 0
+[cluster-0-0-0] 2023.01.30 06:31:32.642095 [ 348 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> AggregatingTransform: Aggregating
+[cluster-0-0-0] 2023.01.30 06:31:32.642193 [ 348 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> Aggregator: An entry for key=16426982211452591884 found in cache: sum_of_sizes=2, median_size=1
+[cluster-0-0-0] 2023.01.30 06:31:32.642210 [ 348 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> Aggregator: Aggregation method: without_key
+[cluster-0-0-0] 2023.01.30 06:31:32.642330 [ 348 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> AggregatingTransform: Aggregated. 3211 to 1 rows (from 50.18 KiB) in 0.395452983 sec. (8119.802 rows/sec., 126.89 KiB/sec.)
+[cluster-0-0-0] 2023.01.30 06:31:32.642343 [ 348 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Trace> Aggregator: Merging aggregated data
+Row 1:
+──────
+count(traceID): 3211
+[cluster-0-0-0] 2023.01.30 06:31:32.642887 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Information> executeQuery: Read 16384 rows, 620.52 KiB in 0.401978272 sec., 40758 rows/sec., 1.51 MiB/sec.
+[cluster-0-0-0] 2023.01.30 06:31:32.645232 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> MemoryTracker: Peak memory usage (for query): 831.98 KiB.
+[cluster-0-0-0] 2023.01.30 06:31:32.645251 [ 4991 ] {6ce40daf-e1b1-4714-ab02-268246f3c5c9} <Debug> TCPHandler: Processed in 0.404908496 sec.
+
+1 row in set. Elapsed: 0.402 sec. Processed 16.38 thousand rows, 635.41 KB (40.71 thousand rows/s., 1.58 MB/s.)
+```
diff --git a/doc/development/database/index.md b/doc/development/database/index.md
index fca76c98715..8f22eaac496 100644
--- a/doc/development/database/index.md
+++ b/doc/development/database/index.md
@@ -104,6 +104,12 @@ including the major methods:
 - [External PostgreSQL instances](../../administration/postgresql/external.md)
 - [Replication and failover](../../administration/postgresql/replication_and_failover.md)
 
+## ClickHouse
+
+- [Introduction](clickhouse/index.md)
+- [Optimizing query execution](clickhouse/optimization.md)
+- [Rebuild GitLab features using ClickHouse 1: Activity data](clickhouse/gitlab_activity_data.md)
+
 ## Miscellaneous
 
 - [Maintenance operations](maintenance_operations.md)
diff --git a/doc/development/testing_guide/end_to_end/package_and_test_pipeline.md b/doc/development/testing_guide/end_to_end/package_and_test_pipeline.md
index f93ba95ea61..b0257e7b02c 100644
--- a/doc/development/testing_guide/end_to_end/package_and_test_pipeline.md
+++ b/doc/development/testing_guide/end_to_end/package_and_test_pipeline.md
@@ -113,21 +113,21 @@ Considering example above, perform the following steps to create a new job:
 ### Parallel jobs
 
 For selective execution to work correctly with job types that require running multiple parallel jobs,
-a job definition typically must be split into parallel and non parallel variants. Splitting is necessary so that when selective execution
+a job definition typically must be split into parallel and selective variants. Splitting is necessary so that when selective execution
 executes only a single spec, multiple unnecessary jobs are not spawned. For example:
 
 ```yaml
-ee:my-new-job:
+ee:my-new-job-selective:
   extends: .qa
   variables:
     QA_SCENARIO: Test::Integration::MyNewJob
   rules:
-    - !reference [.rules:test:qa-non-parallel, rules]
+    - !reference [.rules:test:qa-selective, rules]
     - if: $QA_SUITES =~ /Test::Integration::MyNewJob/
-ee:instance-parallel:
+ee:my-new-job:
   extends:
     - .parallel
-    - ee:my-new-job
+    - ee:my-new-job-selective
   rules:
     - !reference [.rules:test:qa-parallel, rules]
     - if: $QA_SUITES =~ /Test::Integration::MyNewJob/
diff --git a/doc/user/application_security/policies/scan-execution-policies.md b/doc/user/application_security/policies/scan-execution-policies.md
index 4eec7b9f615..e679ed907e9 100644
--- a/doc/user/application_security/policies/scan-execution-policies.md
+++ b/doc/user/application_security/policies/scan-execution-policies.md
@@ -11,10 +11,11 @@ info: To determine the technical writer assigned to the Stage/Group associated w
 > - Operational container scanning [introduced](https://gitlab.com/groups/gitlab-org/-/epics/3410) in GitLab 15.5
 
 Group, subgroup, or project owners can use scan execution policies to require that security scans run on a specified
-schedule or with the project (or multiple projects if the policy is defined at a group or subgroup level) pipeline. Required scans are injected into the CI pipeline as new jobs
-with a long, random job name. In the unlikely event of a job name collision, the security policy job overwrites
-any pre-existing job in the pipeline. If a policy is created at the group-level, it applies to every child
-project or subgroup. A group-level policy cannot be edited from a child project or subgroup.
+schedule or with the project pipeline. The security scan runs with multiple project pipelines if you define the policy
+at a group or subgroup level. GitLab injects the required scans into the CI pipeline as new jobs. In the event
+of a job name collision, GitLab adds a dash and a number to the job name. GitLab increments the number until the name
+no longer conflicts with existing job names. If you create a policy at the group level, it applies to every child project
+or subgroup. You cannot edit a group-level policy from a child project or subgroup.
 
 This feature has some overlap with [compliance framework pipelines](../../group/compliance_frameworks.md#configure-a-compliance-pipeline),
 as we have not [unified the user experience for these two features](https://gitlab.com/groups/gitlab-org/-/epics/7312).
diff --git a/doc/user/group/epics/manage_epics.md b/doc/user/group/epics/manage_epics.md
index fd26e2bdb3a..a942f37c797 100644
--- a/doc/user/group/epics/manage_epics.md
+++ b/doc/user/group/epics/manage_epics.md
@@ -240,6 +240,7 @@ than 1000. The cached value is rounded to thousands or millions and updated ever
 > - Filtering by milestone and confidentiality [introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/268372) in GitLab 14.2 [with a flag](../../../administration/feature_flags.md) named `vue_epics_list`. Disabled by default.
 > - [Enabled on GitLab.com and self-managed](https://gitlab.com/gitlab-org/gitlab/-/issues/276189) in GitLab 14.7.
 > - [Feature flag `vue_epics_list`](https://gitlab.com/gitlab-org/gitlab/-/issues/327320) removed in GitLab 14.8.
+> - Filtering by group was [introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/385191) in GitLab 15.9.
 
 You can filter the list of epics by:
 
@@ -249,6 +250,7 @@ You can filter the list of epics by:
 - Milestones
 - Confidentiality
 - Reaction emoji
+- Groups
 
 ![epics filter](img/epics_filter_v14_7.png)
author	GitLab Bot <gitlab-bot@gitlab.com>	2023-02-06 21:09:43 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2023-02-06 21:09:43 +0300
commit	33998a0e768263828f497685ae030f585193317f (patch)
tree	791e2c3322a17f668a2a4abdd251a8362885f381 /doc
parent	7b69a22d499787378aa30561822ef797a99c22e5 (diff)