Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/architecture/blueprints')
-rw-r--r--doc/architecture/blueprints/ai_gateway/index.md6
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/git-access.md37
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md6
-rw-r--r--doc/architecture/blueprints/cells/index.md3
-rw-r--r--doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md5
-rw-r--r--doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md7
-rw-r--r--doc/architecture/blueprints/cells/routing-service.md644
-rw-r--r--doc/architecture/blueprints/ci_builds_runner_fleet_metrics/ci_insights.md154
-rw-r--r--doc/architecture/blueprints/ci_builds_runner_fleet_metrics/img/current_page.pngbin0 -> 132200 bytes
-rw-r--r--doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md4
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md26
-rw-r--r--doc/architecture/blueprints/cloud_connector/index.md18
-rw-r--r--doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md65
-rw-r--r--doc/architecture/blueprints/gitlab_housekeeper/index.md133
-rw-r--r--doc/architecture/blueprints/gitlab_steps/data.drawio.pngbin42192 -> 19270 bytes
-rw-r--r--doc/architecture/blueprints/gitlab_steps/step-runner-sequence.drawio.pngbin70107 -> 32938 bytes
-rw-r--r--doc/architecture/blueprints/runner_admission_controller/index.md2
-rw-r--r--doc/architecture/blueprints/runner_tokens/index.md10
-rw-r--r--doc/architecture/blueprints/runway/index.md2
-rw-r--r--doc/architecture/blueprints/secret_detection/index.md188
-rw-r--r--doc/architecture/blueprints/secret_manager/index.md6
-rw-r--r--doc/architecture/blueprints/tailwindcss/index.md172
22 files changed, 1329 insertions, 159 deletions
diff --git a/doc/architecture/blueprints/ai_gateway/index.md b/doc/architecture/blueprints/ai_gateway/index.md
index c09f8aaa621..e40861139d6 100644
--- a/doc/architecture/blueprints/ai_gateway/index.md
+++ b/doc/architecture/blueprints/ai_gateway/index.md
@@ -103,7 +103,7 @@ GitLab instances, JSON API, and gRPC differ on these items:
| + A new Ruby-gRPC server for vscode: likely faster because we can limit dependencies to load ([modular monolith](https://gitlab.com/gitlab-org/gitlab/-/issues/365293)) | - Existing Grape API for vscode: meaning slow boot time and unneeded resources loaded |
| + Bi-directional streaming | - Straight forward way to stream requests and responses (could still be added) |
| - A new Python-gRPC server: we don't have experience running gRPC-Python servers | + Existing Python fastapi server, already running for Code Suggestions to extend |
-| - Hard to pass on unknown messages from vscode through GitLab to ai-gateway | + Easier support for newer vscode + newer ai-gatway, through old GitLab instance |
+| - Hard to pass on unknown messages from vscode through GitLab to ai-gateway | + Easier support for newer VS Code + newer AI-gateway, through old GitLab instance |
| - Unknown support for gRPC in other clients (vscode, jetbrains, other editors) | + Support in all external clients |
| - Possible protocol mismatch (VSCode --REST--> Rails --gRPC--> AI gateway) | + Same protocol across the stack |
@@ -264,7 +264,7 @@ Another example use case includes 2 versions of a prompt passed in the `prompt_c
a field in the gateway, and keep them around for at least 2 major
versions of GitLab.**
-A good practise that might help support backwards compatibility is to provide building blocks for the prompt inside the `prompt_components` rather then a complete prompt. By moving responsibility of compiling prompt out of building blocks on the AI-Gateway, one can achive more flexibility in terms of prompt adjustments in the future.
+A good practice that might help support backward compatibility: provide building blocks for the prompt inside the `prompt_components`, rather then a complete prompt. By moving responsibility of compiling the prompt out of building blocks and into the AI-Gateway, more flexible prompt adjustments are possible in the future.
#### Example feature: Code Suggestions
@@ -503,7 +503,7 @@ It is deployed to a Kubernetes cluster in it's own project. There is a
staging environment that is currently used directly by engineers for
testing.
-In the future, this will be deloyed using
+In the future, this will be deployed using
[Runway](https://gitlab.com/gitlab-com/gl-infra/platform/runway/). At
that time, there will be a production and staging deployment. The
staging deployment can be used for automated QA-runs that will have
diff --git a/doc/architecture/blueprints/cells/impacted_features/git-access.md b/doc/architecture/blueprints/cells/impacted_features/git-access.md
index 611b4db5f43..d2d357d4178 100644
--- a/doc/architecture/blueprints/cells/impacted_features/git-access.md
+++ b/doc/architecture/blueprints/cells/impacted_features/git-access.md
@@ -6,12 +6,10 @@ description: 'Cells: Git Access'
<!-- vale gitlab.FutureTense = NO -->
-This document is a work-in-progress and represents a very early state of the
-Cells design. Significant aspects are not documented, though we expect to add
-them in the future. This is one possible architecture for Cells, and we intend to
-contrast this with alternatives before deciding which approach to implement.
-This documentation will be kept even if we decide not to implement this so that
-we can document the reasons for not choosing this approach.
+This document is a work-in-progress and represents a very early state of the Cells design.
+Significant aspects are not documented, though we expect to add them in the future.
+This is one possible architecture for Cells, and we intend to contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that we can document the reasons for not choosing this approach.
# Cells: Git Access
@@ -146,11 +144,34 @@ Where:
Supporting Git repositories if a Cell can access only its own repositories does not appear to be complex.
The one major complication is supporting snippets, but this likely falls in the same category as for the approach to support a user's Personal Namespace.
-## 4.1. Pros
+### 4.1. Pros
1. The API used for supporting HTTPS/SSH and Hooks are well defined and can easily be made routable.
-## 4.2. Cons
+### 4.2. Cons
1. The sharing of repositories objects is limited to the given Cell and Gitaly node.
1. Cross-Cells forks are likely impossible to be supported (discover: How this works today across different Gitaly node).
+
+## 5. Forking and object pools
+
+One of the biggest struggles that needs to be addressed with the Cells architecture is how to handle forking. At present, Gitaly utilizes object pools to provide deduplication of fork storage. If forks are not created on the same storage node as the upstream repository that is being forked, we end up with significant storage inefficiencies as we will effectively have two complete copies of the repository and we will not be able to utilize object pools to improve performance.
+
+The storage nodes from one Cell cannot talk to the storage nodes of another Cell, making forking across Cells impossible. Therefore, it will be necessary to ensure that forked repositories end up in the same Cell (and on the same Gitaly nodes) as their upstream parent repository. This will also enable Gitaly to continue to utilize object pools to provide storage and performance efficiency.
+
+### 5.1. How this works today
+
+**Single Gitaly storage node**
+
+Currently, for a GitLab instance backed with a single Gitaly storage node, forking works just fine.
+Any forks must reside on the same storage node as there is only one, and therefore object deduplication (and object pools) all function as expected.
+
+**Sharded Gitaly storage**
+
+A sharded Gitaly storage is when multiple Gitaly storage nodes are attached to a single instance, and repositories are assigned based on a priority weighting between the nodes.
+
+Since Gitaly knows how to do cross-storage fetches, forking across shards works without issue.
+
+**Gitaly Cluster**
+
+For Gitaly cluster, we recently resolved [the issue](https://gitlab.com/gitlab-org/gitaly/-/issues/5094) of object pools not being created on the same storage nodes as the parent repository. This enables forking to work correctly from an efficiency perspective (can share an object pool) and from an object deduplication perspective (Git can properly deduplicate storage).
diff --git a/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md b/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
index 757f83c32d3..d80f5c44b98 100644
--- a/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
+++ b/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
@@ -138,6 +138,6 @@ Cons:
## 4. Evaluation
-The most straightforward solution requiring the least engineering effort is to create [one personal Namespace in each Organization](#33-one-personal-namespace-in-each-organization).
-We recognize that this solution is not ideal for users working across multiple Organizations, but find this acceptable due to our expectation that most users will mainly work in one Organization.
-At a later point, this concept will be reviewed and possibly replaced with a better solution.
+We will begin by [making the personal namespace optional for Organizations](https://gitlab.com/groups/gitlab-org/-/epics/12179). The goal of this iteration is to disable personal namespaces for any Organization other than the default Organization, so that customers who do not want to use personal namespaces can already move to Organizations. The first phase will only change the Ruby on Rails model relationships in preparation for further changes at the user-facing level.
+
+We need to [split the concept of a User Profile and a personal namespace](https://gitlab.com/gitlab-org/gitlab/-/issues/432654) now that a User is cluster-wide and a User's personal namespace must be Cell-local. It is likely we will [discontinue personal namespaces](#34-discontinue-personal-namespaces) in favor of Groups.
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
index 3b800a54781..6f00fe4e61e 100644
--- a/doc/architecture/blueprints/cells/index.md
+++ b/doc/architecture/blueprints/cells/index.md
@@ -305,8 +305,7 @@ It is expected that initial iterations will be rather slow, because they require
The Cells architecture has long lasting implications to data processing, location, scalability and the GitLab architecture.
This section links all different technical proposals that are being evaluated.
-- [Stateless Router That Uses a Cache to Pick Cell and Is Redirected When Wrong Cell Is Reached](proposal-stateless-router-with-buffering-requests.md)
-- [Stateless Router That Uses a Cache to Pick Cell and pre-flight `/api/v4/internal/cells/learn`](proposal-stateless-router-with-routes-learning.md)
+- [Routing Service](routing-service.md)
## Impacted features
diff --git a/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md b/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
index 847532a36dc..699a41879a9 100644
--- a/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
+++ b/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
@@ -2,8 +2,11 @@
stage: enablement
group: Tenant Scale
description: 'Cells Stateless Router Proposal'
+status: rejected
---
+_This proposal was superseded by the [routing service proposal](routing-service.md)_
+
<!-- vale gitlab.FutureTense = NO -->
This document is a work-in-progress and represents a very early state of the
@@ -13,7 +16,7 @@ contrast this with alternatives before deciding which approach to implement.
This documentation will be kept even if we decide not to implement this so that
we can document the reasons for not choosing this approach.
-# Proposal: Stateless Router
+# Proposal: Stateless Router using Requests Buffering
We will decompose `gitlab_users`, `gitlab_routes` and `gitlab_admin` related
tables so that they can be shared between all cells and allow any cell to
diff --git a/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md b/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
index cdcb5b8b21f..72b96e9ab8c 100644
--- a/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
+++ b/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
@@ -2,8 +2,11 @@
stage: enablement
group: Tenant Scale
description: 'Cells Stateless Router Proposal'
+status: rejected
---
+_This proposal was superseded by the [routing service proposal](routing-service.md)_
+
<!-- vale gitlab.FutureTense = NO -->
This document is a work-in-progress and represents a very early state of the
@@ -13,7 +16,7 @@ contrast this with alternatives before deciding which approach to implement.
This documentation will be kept even if we decide not to implement this so that
we can document the reasons for not choosing this approach.
-# Proposal: Stateless Router
+# Proposal: Stateless Router using Routes Learning
We will decompose `gitlab_users`, `gitlab_routes` and `gitlab_admin` related
tables so that they can be shared between all cells and allow any cell to
@@ -35,7 +38,7 @@ Organization can only be on a single Cell.
## Differences
The main difference between this proposal and one [with buffering requests](proposal-stateless-router-with-buffering-requests.md)
-is that this proposal uses a pre-flight API request (`/pi/v4/internal/cells/learn`) to redirect the request body to the correct Cell.
+is that this proposal uses a pre-flight API request (`/api/v4/internal/cells/learn`) to redirect the request body to the correct Cell.
This means that each request is sent exactly once to be processed, but the URI is used to decode which Cell it should be directed.
## Summary in diagrams
diff --git a/doc/architecture/blueprints/cells/routing-service.md b/doc/architecture/blueprints/cells/routing-service.md
index 9efdbdf3f91..bd5570b68f4 100644
--- a/doc/architecture/blueprints/cells/routing-service.md
+++ b/doc/architecture/blueprints/cells/routing-service.md
@@ -59,20 +59,23 @@ For example:
## Requirements
-| Requirement | Description | Priority |
-|---------------|-------------------------------------------------------------------|----------|
-| Discovery | needs to be able to discover and monitor the health of all Cells. | high |
-| Security | only authorized cells can be routed to | high |
-| Single domain | e.g. GitLab.com | high |
-| Caching | can cache routing information for performance | high |
-| [50 ms of increased latency](#low-latency) | | high |
-| Path-based | can make routing decision based on path | high |
-| Complexity | the routing service should be configuration-driven and small | high |
-| Stateless | does not need database, Cells provide all routing information | medium |
-| Secrets-based | can make routing decision based on secret (e.g. JWT) | medium |
-| Observability | can use existing observability tooling | low |
-| Self-managed | can be eventually used by [self-managed](goals.md#self-managed) | low |
-| Regional | can route requests to different [regions](goals.md#regions) | low |
+| Requirement | Description | Priority |
+| ------------------- | ----------------------------------------------------------------- | -------- |
+| Discovery | needs to be able to discover and monitor the health of all Cells. | high |
+| Security | only authorized cells can be routed to | high |
+| Single domain | for example GitLab.com | high |
+| Caching | can cache routing information for performance | high |
+| Low latency | [50 ms of increased latency](#low-latency) | high |
+| Path-based | can make routing decision based on path | high |
+| Complexity | the routing service should be configuration-driven and small | high |
+| Rolling | the routing service works with Cells running mixed versions | high |
+| Feature Flags | features can be turned on, off, and % rollout | high |
+| Progressive Rollout | we can slowly rollout a change | medium |
+| Stateless | does not need database, Cells provide all routing information | medium |
+| Secrets-based | can make routing decision based on secret (for example JWT) | medium |
+| Observability | can use existing observability tooling | low |
+| Self-managed | can be eventually used by [self-managed](goals.md#self-managed) | low |
+| Regional | can route requests to different [regions](goals.md#regions) | low |
### Low Latency
@@ -91,7 +94,7 @@ The main SLI we use is the [rails requests](../../../development/application_sli
It has multiple `satisfied` targets (apdex) depending on the [request urgency](../../../development/application_slis/rails_request.md#how-to-adjust-the-urgency):
| Urgency | Duration in ms |
-|------------|----------------|
+| ---------- | -------------- |
| `:high` | 250 _ms_ |
| `:medium` | 500 _ms_ |
| `:default` | 1000 _ms_ |
@@ -108,7 +111,7 @@ The way we calculate the headroom we have is by using the following:
**`web`**:
| Target Duration | Percentile | Headroom |
-|-----------------|------------|-----------|
+| --------------- | ---------- | --------- |
| 5000 _ms_ | p99 | 4000 _ms_ |
| 5000 _ms_ | p95 | 4500 _ms_ |
| 5000 _ms_ | p90 | 4600 _ms_ |
@@ -131,7 +134,7 @@ _Analysis was done in <https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note
**`api`**:
| Target Duration | Percentile | Headroom |
-|-----------------|------------|-----------|
+| --------------- | ---------- | --------- |
| 5000 _ms_ | p99 | 3500 _ms_ |
| 5000 _ms_ | p95 | 4300 _ms_ |
| 5000 _ms_ | p90 | 4600 _ms_ |
@@ -154,7 +157,7 @@ _Analysis was done in <https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note
**`git`**:
| Target Duration | Percentile | Headroom |
-|-----------------|------------|-----------|
+| --------------- | ---------- | --------- |
| 5000 _ms_ | p99 | 3760 _ms_ |
| 5000 _ms_ | p95 | 4280 _ms_ |
| 5000 _ms_ | p90 | 4430 _ms_ |
@@ -180,7 +183,585 @@ Not yet defined.
## Proposal
-TBD
+The Routing Service implements the following design guidelines:
+
+1. Simple:
+ - Routing service does not buffer requests.
+ - Routing service can only proxy to a single Cell based on request headers.
+1. Stateless:
+ - Routing service does not have permanent storage.
+ - Routing service uses multi-level cache: in-memory, external shared cache.
+1. Zero-trust:
+ - Routing service signs each request that is being proxied.
+ - The trust is established by using JWT token, or mutual authentication scheme.
+ - Cells can be available over public internet, as long as they follow the zero-trust model.
+1. Configuration-based:
+ - Routing service is configured with a static list of Cells.
+ - Routing service configuration is applied as part of service deployment.
+1. Rule-based:
+ - Routing service is deployed with a routing rules gathered from all Cells.
+ - Routing service does support rules lists generated by different versions of GitLab.
+ - rules allows to match by any criteria: header, content of the header, or route path.
+1. Agnostic:
+ - Routing service is not aware of high-level concepts like organizations.
+ - The classification is done per-specification provided in a rules, to find the sharding key.
+ - The sharding key result is cached.
+ - The single sharding key cached is used to handle many similar requests.
+
+The following diagram shows how a user request routes through DNS to the Routing Service deployed
+as Cloudflare Worker and the router chooses a cell to send the request to.
+
+```mermaid
+graph TD;
+ user((User));
+ router[Routing Service];
+ cell_us0{Cell US0};
+ cell_us1{Cell US1};
+ cell_eu0{Cell EU0};
+ cell_eu1{Cell EU1};
+ user-->router;
+ router-->cell_eu0;
+ router-->cell_eu1;
+ router-->cell_us0;
+ router-->cell_us1;
+ subgraph Europe
+ cell_eu0;
+ cell_eu1;
+ end
+ subgraph United States
+ cell_us0;
+ cell_us1;
+ end
+```
+
+### Routing rules
+
+Each Cell will publish a precompiled list of routing rules that will be consumed by the Routing Service:
+
+- The routing rules describe how to decode the request, find the sharding key, and make the routing decision.
+- The routing rules are compiled during the deployment of the Routing Service.
+ - The deployment process fetches latest version of the routing rules from each Cell
+ that is part of Routing Service configuration.
+ - The compilation process merges the routing rules from all Cells.
+ - The conflicting rules prevent routing service from being compiled / started.
+ - Each routing rule entry has a unique identifier to ease the merge.
+ - The Routing Service would be re-deployed only if the list of rules was changed,
+ which shouldn't happen frequently, because we expect the majority of newly added endpoints
+ to already adhere to the prior route rules.
+- The configuration describes from which Cells the routing rules need to be fetched during deploy.
+- The published routing rules might make routing decision based on the secret. For example, if the session cookie
+ or authentication token has prefix `c100-` all requests are to be forwarded to the given Cell.
+- The Cell does publish routing rules at `/api/v4/internal/cells/route_rules.json`.
+- The rules published by Cell only include endpoints that the particular Cell can process.
+- The Cell might request to perform dynamic classification based on sharding key, by configuring
+ routing rules to call `/api/v4/internal/cells/classify`.
+- The routing rules should use `prefix` as a way to speed up classification. During the compilation phase
+ the routing service transforms all found prefixes into a decision tree to speed up any subsequent regex matches.
+- The routing rules is ideally compiled into source code to avoid expensive parsing and evaluation of the rules
+ dynamically as part of deployment.
+
+The routing rules JSON structure describes all matchers:
+
+```json
+{
+ "rules": [
+ {
+ "id": "<unique-identifier>",
+ "cookies": {
+ "<cookie_name>": {
+ "prefix": "<match-given-prefix>",
+ "match_regex": "<regex_match>"
+ },
+ "<cookie_name2>": {
+ "prefix": "<match-given-prefix>",
+ "match_regex": "<regex_match>"
+ }
+ },
+ "headers": {
+ "<header_name>": {
+ "prefix": "<match-given-prefix>",
+ "match_regex": "<regex_match>"
+ },
+ "<header_name2>": {
+ "prefix": "<match-given-prefix>",
+ "match_regex": "<regex_match>"
+ },
+ },
+ "path": {
+ "prefix": "<match-given-prefix>",
+ "match_regex": "<regex_match>"
+ },
+ "method": ["<list_of_accepted_methods>"],
+
+ // If many rules are matched, define which one wins
+ "priority": 1000,
+
+ // Accept request and proxy to the Cell in question
+ "action": "proxy",
+
+ // Classify request based on regex matching groups
+ "action": "classify",
+ "classify": {
+ "keys": ["list_of_regex_match_capture_groups"]
+ }
+ }
+ ]
+}
+```
+
+Example of the routing rules published by the Cell 100 that makes routing decision based session cookie, and secret.
+The high priority is assigned since the routing rules is secret-based, and should take precedence before all other matchers:
+
+```json
+{
+ "rules": [
+ {
+ "id": "t4mkd5ndsk58si6uwwz7rdavil9m2hpq",
+ "cookies": {
+ "_gitlab_session": {
+ "prefix": "c100-" // accept `_gitlab_session` that are prefixed with `c100-`
+ }
+ },
+ "action": "proxy",
+ "priority": 1000
+ },
+ {
+ "id": "jcshae4d4dtykt8byd6zw1ecccl5dkts",
+ "headers": {
+ "GITLAB_TOKEN": {
+ "prefix": "C100_" // accept `GITLAB_TOKEN` that are prefixed with `C100_`
+ }
+ },
+ "action": "proxy",
+ "priority": 1000
+ }
+ ]
+}
+```
+
+Example of the routing rules published by all Cells that makes routing decision based on the path:
+
+```json
+{
+ "rules": [
+ {
+ "id": "c9scvaiwj51a75kzoh917uwtnw8z4ebl",
+ "path": {
+ "prefix": "/api/v4/projects/", // speed-up rule matching
+ "match_regex": "^/api/v4/projects/(?<project_id_or_path_encoded>[^/]+)(/.*)?$"
+ },
+ "action": "classify",
+ "classify": {
+ "keys": ["project_id_or_path_encoded"]
+ }
+ }
+ ]
+}
+```
+
+### Classification
+
+Each Cell does implement classification endpoint:
+
+- The classification endpoint is at `/api/v4/internal/cells/classify` (or gRPC endpoint).
+- The classification endpoint accepts a list of the sharding keys. Sharding keys are decoded from request,
+ based on the routing rules provided by the Cell.
+- The endpoint returns other equivalent sharding keys to pollute cache for similar requests.
+ This is to ensure that all similar requests can be handled quickly without having to classify each time.
+- Routing Service tracks the health of Cells, and issues a `classify` request to Cells based on weights,
+ health of the Cell, or other defined criteria. Weights would indicate which Cell is preferred to perform the
+ classification of sharding keys.
+- Routing Service retries the `classify` call for a reasonable amount of time.
+ The repetitive failure of Cell to `classify` is indicative of Cell being unhealthy.
+- The `classify` result is cached regardless of returned `action` (proxy or reject).
+ The rejected classification is cached to prevent excessive amount of
+ requests for sharding keys that are not found.
+- The cached response is for time defined by `expiry` and `refresh`.
+ - The `expiry` defines when the item is removed from cache unless used.
+ - The `refresh` defines when the item needs to be reclassified if used.
+ - The refresh is done asynchronously as the request should be served without a delay if they were classified. The refresh is done to ensure that cache is always hot and up-to date.
+
+For the above example:
+
+1. The router sees request to `/api/v4/projects/1000/issues`.
+1. It selects the above `rule` for this request, which requests `classify` for `project_id_or_path_encoded`.
+1. It decodes `project_id_or_path_encoded` to be `1000`.
+1. Checks the cache if there's `project_id_or_path_encoded=1000` associated to any Cell.
+1. Sends the request to `/api/v4/internal/cells/classify` if no Cells was found in cache.
+1. Rails responds with the Cell holding the given project, and also all other equivalent sharding keys
+ for the resource that should be put in the cache.
+1. Routing Service caches for the duration specified in configuration, or response.
+
+```json
+# POST /api/v4/internal/cells/classify
+## Request:
+{
+ "metadata": {
+ "rule_id": "c9scvaiwj51a75kzoh917uwtnw8z4ebl",
+ "headers": {
+ "all_request_headers": "value"
+ },
+ "method": "GET",
+ "path": "/api/v4/projects/100/issues"
+ },
+ "keys": {
+ "project_id_or_path_encoded": 100
+ }
+}
+
+## Response:
+{
+ "action": "proxy",
+ "proxy": {
+ "name": "cell_1",
+ "url": "https://cell1.gitlab.com"
+ },
+ "ttl": "10 minutes",
+ "matched_keys": [ // list of all equivalent keys that should be put in the cache
+ { "project_id_or_path_encoded": 100 },
+ { "project_id_or_path_encoded": "gitlab-org%2Fgitlab" },
+ { "project_full_path": "gitlab-org/gitlab" },
+ { "namespace_full_path": "gitlab-org" },
+ { "namespace_id": 10 },
+ { "organization_full_path": "gitlab-inc" },
+ { "organization_id": 50 },
+ ]
+}
+```
+
+The following code represents a negative response when a sharding key was not found:
+
+```json
+# POST /api/v4/internal/cells/classify
+## Request:
+{
+ "metadata": {
+ "rule_id": "c9scvaiwj51a75kzoh917uwtnw8z4ebl",
+ "headers": {
+ "all_request_headers": "value"
+ },
+ "method": "GET",
+ "path": "/api/v4/projects/100/issues"
+ },
+ "keys": {
+ "project_id_or_path_encoded": 100
+ }
+}
+
+## Response:
+{
+ "action": "reject",
+ "reject": {
+ "http_status": 404
+ },
+ "cache": {
+ "refresh": "10 minutes",
+ "expiry": "10 minutes"
+ },
+ "matched_keys": [ // list of all equivalent keys that should be put in the cache
+ { "project_id_or_path_encoded": 100 },
+ ]
+}
+```
+
+### Configuration
+
+The Routing Service will use the configuration similar to this:
+
+```toml
+[[cells]]
+name=cell_1
+url=https://cell1.gitlab.com
+key=ABC123
+classify_weight=100
+
+[[cells]]
+name=cell_2
+url=https://cell2.gitlab.com
+key=CDE123
+classify_weight=1
+
+[cache.memory.classify]
+refresh_time=10 minutes
+expiry_time=1 hour
+
+[cache.external.classify]
+refresh_time=30 minutes
+expiry_time=6 hour
+```
+
+We assume that this is acceptable to provide a static list of Cells, because:
+
+1. Static: Cells provisioned are unlikely to be dynamically provisioned and decommissioned.
+1. Good enough: We can manage such list even up to 100 Cells.
+1. Simple: We don't have to implement robust service discovery in the service,
+ and we have guarantee that this list is always exhaustive.
+
+The configuration describes all Cells, URLs, zero-trust keys, and weights,
+and how long requests should be cached. The `classify_weight` defines how often
+the Cell should receive classification requests versus other Cells.
+
+## Request flows
+
+1. There are two Cells.
+1. `gitlab-org` is a top-level namespace and lives in `Cell US0` in the `GitLab.com Public` organization.
+1. `my-company` is a top-level namespace and lives in `Cell EU0` in the `my-organization` organization.
+
+### Router configured to perform static routing
+
+1. The Cell US0 supports all other public-facing projects.
+1. The Cells is configured to generate all secrets and session cookies with a prefix like `eu0_` for Cell EU0.
+ 1. The Personal Access Token is scoped to Organization, and because the Organization is part only of a single Cell,
+ the PATs generated are prefixed with Cell identifier.
+ 1. The Session Cookie encodes Organization in-use, and because the Organization is part only of a single Cell,
+ the session cookie generated is prefixed with Cell identifier.
+1. The Cell EU0 allows only private organizations, groups, and projects.
+1. The Cell US0 is a target Cell for all requests unless explicitly prefixed.
+
+Cell US0:
+
+```json
+{
+ "rules": [
+ {
+ "id": "tjh147se67wadjzum7onwqiad2b75uft",
+ "path": {
+ "prefix": "/"
+ },
+ "action": "proxy",
+ "priority": 1
+ }
+ ]
+}
+```
+
+Cell EU0:
+
+```json
+{
+ "rules": [
+ {
+ "id": "t4mkd5ndsk58si6uwwz7rdavil9m2hpq",
+ "cookies": {
+ "_gitlab_session": {
+ "prefix": "eu0_"
+ }
+ },
+ "path": {
+ "prefix": "/"
+ },
+ "action": "proxy",
+ "priority": 1000
+ },
+ {
+ "id": "jcshae4d4dtykt8byd6zw1ecccl5dkts",
+ "headers": {
+ "GITLAB_TOKEN": {
+ "prefix": "eu0_"
+ }
+ },
+ "path": {
+ "prefix": "/"
+ },
+ "action": "proxy",
+ "priority": 1000
+ }
+ ]
+}
+```
+
+#### Navigates to `/my-company/my-project` while logged in into Cell EU0
+
+1. Because user switched the Organization to `my-company`, its session cookie is prefixed with `eu0_`.
+1. User sends request `/my-company/my-project`, and because the cookie is prefixed with `eu0_` it is directed to Cell EU0.
+1. `Cell EU0` returns the correct response.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cell_eu0 as Cell EU0
+ participant cell_eu1 as Cell EU1
+ user->>router: GET /my-company/my-project<br/>_gitlab_session=eu0_uwwz7rdavil9
+ router->>cell_eu0: GET /my-company/my-project
+ cell_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/my-company/my-project` while not logged in
+
+1. User visits `/my-company/my-project`, and because it does not have session cookie, the request is forwarded to `Cell US0`.
+1. User signs in.
+1. GitLab sees that user default organization is `my-company`, so it assigns session cookie with `eu0_` to indicate that
+ user is meant to interact with `my-company`.
+1. User sends request to `/my-company/my-project` again, now with the session cookie that proxies to `Cell EU0`.
+1. `Cell EU0` returns the correct response.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cell_us0 as Cell US0
+ participant cell_eu0 as Cell EU0
+ user->>router: GET /my-company/my-project
+ router->>cell_us0: GET /my-company/my-project
+ cell_us0->>user: HTTP 302 /users/sign_in?redirect=/my-company/my-project
+ user->>router: GET /users/sign_in?redirect=/my-company/my-project
+ router->>cell_us0: GET /users/sign_in?redirect=/my-company/my-project
+ cell_us0-->>user: <h1>Sign in...
+ user->>router: POST /users/sign_in?redirect=/my-company/my-project
+ router->>cell_us0: POST /users/sign_in?redirect=/my-company/my-project
+ cell_us0->>user: HTTP 302 /my-company/my-project<br/>_gitlab_session=eu0_uwwz7rdavil9
+ user->>router: GET /my-company/my-project<br/>_gitlab_session=eu0_uwwz7rdavil9
+ router->>cell_eu0: GET /my-company/my-project<br/>_gitlab_session=eu0_uwwz7rdavil9
+ cell_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/gitlab-org/gitlab` after last step
+
+User visits `/my-company/my-project`, and because it does not have a session cookie, the request is forwarded to `Cell US0`.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cell_eu0 as Cell EU0
+ participant cell_us0 as Cell US0
+ user->>router: GET /gitlab-org/gitlab<br/>_gitlab_session=eu0_uwwz7rdavil9
+ router->>cell_eu0: GET /gitlab-org/gitlab
+ cell_eu0->>user: HTTP 404
+```
+
+### Router configured to perform dynamic routing based on classification
+
+The Cells publish route rules that allows to classify the requests.
+
+Cell US0 and EU0:
+
+```json
+{
+ "rules": [
+ {
+ "id": "tjh147se67wadjzum7onwqiad2b75uft",
+ "path": {
+ "prefix": "/",
+ "regex": "^/(?top_level_group)[^/]+(/.*)?$",
+ },
+ "action": "classify",
+ "classify": {
+ "keys": ["top_level_group"]
+ }
+ },
+ {
+ "id": "jcshae4d4dtykt8byd6zw1ecccl5dkts",
+ "path": {
+ "prefix": "/"
+ },
+ "action": "proxy"
+ }
+ ]
+}
+```
+
+#### Navigates to `/my-company/my-project` while logged in into Cell EU0
+
+1. The `/my-company/my-project/` is visited.
+1. Router decodes sharding key `top_level_group=my-company`.
+1. Router checks if this sharding key is cached.
+1. Because it is not, the classification request is sent to a random Cell to `/classify`.
+1. The response of classify is cached.
+1. The request is then proxied to Cell returned by classification.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cache as Cache
+ participant cell_us0 as Cell US0
+ participant cell_eu0 as Cell EU0
+ user->>router: GET /my-company/my-project
+ router->>cache: CACHE_GET: top_level_group=my-company
+ cache->>router: CACHE_NOT_FOUND
+ router->>cell_us0: POST /api/v4/internal/cells/classify<br/>top_level_group=my-company
+ cell_us0->>router: CLASSIFY: top_level_group=my-company, cell=cell_eu0
+ router->>cache: CACHE_SET: top_level_group=my-company, cell=cell_eu0
+ router->>cell_eu0: GET /my-company/my-project
+ cell_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/my-company/my-project` while not logged in
+
+1. The `/my-company/my-project/` is visited.
+1. Router decodes sharding key `top_level_group=my-company`.
+1. Router checks if this sharding key is cached.
+1. Because it is not, the classification request is sent to a random Cell to `/classify`.
+1. The response of `classify` is cached.
+1. The request is then proxied to Cell returned by classification.
+1. Because project is private, user is redirected to sign in.
+1. The sign-in since is defined to be handled by all Cells, so it is proxied to a random Cell.
+1. User visits the `/my-company/my-project/` again after logging in.
+1. The `top_level_group=my-company` is proxied to the correct Cell.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cache as Cache
+ participant cell_us0 as Cell US0
+ participant cell_eu0 as Cell EU0
+ user->>router: GET /my-company/my-project
+ router->>cache: CACHE_GET: top_level_group=my-company
+ cache->>router: CACHE_NOT_FOUND
+ router->>cell_us0: POST /api/v4/internal/cells/classify<br/>top_level_group=my-company
+ cell_us0->>router: CLASSIFY: top_level_group=my-company, cell=cell_eu0
+ router->>cache: CACHE_SET: top_level_group=my-company, cell=cell_eu0
+ router->>cell_eu0: GET /my-company/my-project
+ cell_eu0->>user: HTTP 302 /users/sign_in?redirect=/my-company/my-project
+ user->>router: GET /users/sign_in?redirect=/my-company/my-project
+ router->>cell_us0: GET /users/sign_in?redirect=/my-company/my-project
+ cell_us0-->>user: <h1>Sign in...
+ user->>router: POST /users/sign_in?redirect=/my-company/my-project
+ router->>cell_eu0: POST /users/sign_in?redirect=/my-company/my-project
+ cell_eu0->>user: HTTP 302 /my-company/my-project
+ user->>router: GET /my-company/my-project
+ router->>cache: CACHE_GET: top_level_group=my-company
+ cache->>router: CACHE_FOUND: cell=cell_eu0
+ router->>cell_eu0: GET /my-company/my-project
+ cell_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/gitlab-org/gitlab` after last step
+
+1. Because the `/gitlab-org` is not found in cache, it will be classified and then directed to correct Cell.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router as Router
+ participant cache as Cache
+ participant cell_us0 as Cell US0
+ participant cell_eu0 as Cell EU0
+ user->>router: GET /gitlab-org/gitlab
+ router->>cache: CACHE_GET: top_level_group=gitlab-org
+ cache->>router: CACHE_NOT_FOUND
+ router->>cell_us0: POST /api/v4/internal/cells/classify<br/>top_level_group=gitlab-org
+ cell_us0->>router: CLASSIFY: top_level_group=gitlab-org, cell=cell_us0
+ router->>cache: CACHE_SET: top_level_group=gitlab-org, cell=cell_us0
+ router->>cell_us0: GET /gitlab-org/gitlab
+ cell_us0->>user: <h1>My Project...
+```
+
+### Performance and reliability considerations
+
+- It is expected that each Cell can classify all sharding keys.
+- Alternatively the classification could be done by Cluster-wide Data Provider
+ if it would own all data required to classify.
+- The published routing rules allow to define static criteria, allowing to make routing decision
+ only on a secret. As a result, the Routing Service doesn't add any latency
+ for request processing, and superior resiliency.
+- It is expected that there will be penalty when learning new sharding key. However,
+ it is expected that multi-layer cache should provide a very high cache-hit-ratio,
+ due to low cardinality of sharding key. The sharding key would effectively be mapped
+ into resource (organization, group, or project), and there's a finite amount of those.
## Technology
@@ -188,7 +769,30 @@ TBD
## Alternatives
-TBD
+### Buffering requests
+
+The [Stateless Router using Requests Buffering](proposal-stateless-router-with-buffering-requests.md)
+describes an approach where Cell answers with `X-Gitlab-Cell-Redirect` to redirect request to another Cell:
+
+- This is based on a need to buffer the whole request (headers + body) which is very memory intensive.
+- This proposal does not provide an easy way to handle mixed deployment of Cells, where Cells might be running different versions.
+- This proposal likely requires caching significantly more information, since it is based on requests, rather than on decoded sharding keys.
+
+### Learn request
+
+The [Stateless Router using Routes Learning](proposal-stateless-router-with-routes-learning.md)
+describes an approach similar to the one in this document. Except the route rules and classification
+is done in a single go in a form of pre-flight check `/api/v4/internal/cells/learn`:
+
+- This makes the whole routes learning dynamic, and dependent on availability of the Cells.
+- This proposal does not provide an easy way to handle mixed deployment of Cells, where Cells might be running different versions.
+- This proposal likely requires caching significantly more information, since it is based on requests, rather than on decoded sharding keys.
+
+## FAQ
+
+1. How and when will Routing Service compile set of rules?
+
+To be defined.
## Links
diff --git a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/ci_insights.md b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/ci_insights.md
new file mode 100644
index 00000000000..72d82558eb7
--- /dev/null
+++ b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/ci_insights.md
@@ -0,0 +1,154 @@
+---
+status: proposed
+creation-date: "2023-01-25"
+authors: [ "@pedropombeiro", "@vshushlin"]
+coach: "@grzesiek"
+approvers: [ ]
+stage: Verify
+group: Runner
+participating-stages: []
+description: 'CI Insights design'
+---
+
+# CI Insights
+
+## Summary
+
+As part of the Fleet Metrics, we would like to have a section dedicated to CI insights to help users monitor pipelines and summarize findings about pipelines speed, common job failures and more. It would eventually offer actionables to help users optimize and fix issues with their CI/CD.
+
+## Motivation
+
+We have a [page for CI/CD Analytics](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts?chart=pipelines) that contain some very basic analytics on pipelines. Most of this information relates to the **total** number of pipelines over time, which does not give any real value to customers: projects will always see an increase of pipelines number over time, so the total number of pipelines is of little consequence.
+
+![Current page](img/current_page.png)
+
+Because this page lacks real insights, it makes understanding pipelines slowdowns or failures hard to track and becomes a very manual task. We want to empower users to optimize their workflow in a centralized place to avoid all of the manual labor associated with either querying the API for data and then manually parsing it or navigating the UI through dozens of pages utils the insights or action required can be found.
+
+As we are going to process large quantities of data relating to a proejct pipelines, there is potential to eventually summarize findings with an AI tool to give insights into job failures, pipeline slowdowns and flaky specs. As AI has become a crucial part of our product roadmap and Verify lacks any promising lead in that area, this page could be the center of this new addition.
+
+- Deliver a new Pipelines Analysis Dashbord page
+- Have excellent data visualization to help digest information quickly
+- Flexible querying to let users get the information they want
+
+- Clear actionables based on information presented in the page
+- Show some default information on landing like pipelines duration over time and slowest jobs
+- Make the CI/CD Analytics more accessible, liked and remembered (AKA, more page views)
+
+### Non-Goals
+
+We do not aim to improve the GitLab project's pipeline speed. This feature could help us achieve this, but it is not a direct objective of this blueprint.
+
+We also are not aiming to have AI in the first iteration and should instead focus on making as much information available and disgestible as possible.
+
+## Proposal
+
+Revamp the [page for CI/CD Analytics](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts?chart=pipelines) to include more meaningful data so that users can troubleshoot their pipelines with ease. Here is a list of the main improvements:
+
+### Overall statistics
+
+The current "overall statistics" will become a one line header in a smaller font to keep this information available, but without taking as much visual space. For the pipelines chart, we will replace it with a stacked bar plot where each stack of a bar represents a status and each bar is a unit (in days, a day, in month a month and in years, a year) so users can keep track of how many pipelines ran in that specific unit of time and what percent of these pipelines ended up in failling or succeeding.
+
+### Pipeline duration graph
+
+A new pipeline duration graph that can be customized by type (MR pipelines, pipeline on a specific branch, etc), number of runs and status (success, failed, etc) and will replace the current `Pipeline durations for the last 30 commits` chart. The existing chart checks the latest 30 commits made on the repository with no filtering so the results presented are not very valuable.
+
+We also add jobs that failed multiple times and jobs that are the slowest in the last x pipelines on master. All of this is to support the effort of allowing users to query their pipelines data to figure out what they need to improve on or what kind of problems they are facing with their CI/CD configuration.
+
+### Visibility
+
+Add a link in the `pipelines` page to increase the visibility of this feature. We can add a new option with the `Run pipeline` primary button.
+
+### Master Broken
+
+Add a "Is master broken?" quick option that scans the last x pipelines on the main branch and check for failed jobs. All jobs that failed multiple times will be listed in a table with the option to create an incident from that list.
+
+### Color scheme
+
+Rethink our current color schemes for data visuliaztion when it comes to pipelines statuses. We currently use the default visualization colors, but they don't actually match with that colors user have grown accustomed to for pipeline/jobs statuses. There is an opportunity here to help user better understand their data through more relevant color schemes and better visualization.
+
+### Routing
+
+Change the routing from `pipelines/charts` to `pipelines/analytics` since `charts` is a really restrictive terminology when talking about data visualization. It also doesn't really convey what this page is, which is a way to get information, not just nice charts. Then we can also get rid of the query parameter for the tabs and instead support first-class routing.
+
+## Design and implementation details
+
+### New API for aggregated data
+
+This feature depends on having a new set of data available to us that aggregates jobs and pipelines insights and make them available to the client.
+
+We'll start by aggregating data from ClickHouse, and probably only for `gitlab.com`, as the MVC. We will aggregate the data on the backend on the fly. So far ClickHouse has been very capable of such things.
+
+We won't store the aggregated data anywhere (we'll probably have the materialized views in ClickHouse, but nothing more complex). Then if the features get traction, we can explore ways to bring these features to environments without ClickHouse
+
+This way we can move fast, test our ideas with real users, and get feedback.
+
+### Feature flag
+
+To develop this new analytic page, we will gate the new page behind a feature flag `ci_insights`, and conditionally render the old or new analytics page. Potentially, we could even add the flag on the controller to decide which route to render: the new `/analytic` when the flag is one, and the old `/charts` when it isn't.
+
+### Add analytics on page view
+
+Make sure that we can get information on how often this page is viewed. If we do not have it, then let's implment some to know how visible this page is. The changes to this section should make the view count go up and we want to track this as a measure of success.
+
+### Routing
+
+We are planning to have new routes for the page and some redicts to setup. To read more about the routing proposal, see the [related issue](https://gitlab.com/gitlab-org/gitlab/-/issues/437556)
+
+### Pipelines duration graph
+
+We want a way for user to query data about pipelines with a lot of different criterias. Most notably, query for only pipelines with the scope `finished` or by status `success` or `failed`. There is also the possibility to scope this to a ref, so users could either test for the main branch or maybe even a branch that has introduced a CI/CD change. We want branch comparaison for pipeline speed.
+
+To get more accurate data, we want to increase the count of pipelines requested. In graphQL, we have a limit of 100 items and we will probably get performance degradations quite quickly. We need to define how we could get more data set for more accurate data visualization.
+
+### Jobs insights
+
+Currently, there is no way to query a single job across multiple pipelines and it prevent us from doing a query that would look like this:
+
+```graphql
+query getJob($projectPath: ID!, $jobName: String!){
+ project(fullPath:$projectPath){
+ job(name: $jobName, last: 100){
+ nodes{
+ id
+ duration
+ }
+ }
+ }
+}
+```
+
+There are plans to create a new unified table to log job analytics and it is not yet defined what this API will look like. Without comitting yet to an API definiton, we want so unified way to query information for nalytics that may look rougly like so:
+
+```ruby
+get_jobs(project_id:, job_name: nil, stage: nil, stage_index: nil, *etc)
+# >
+[{id: 1, duration: 134, status: 'failed'}, *etc]
+
+get_jobs_statistics(project_id, job_name:, *etc)
+# >
+[{time_bucket: '2024-01-01:00:00:00', avg_duration: 234, count: 123, statuses_count: {success: 123, failed: 45, cancelled: 45}}]
+```
+
+### Revamping our charts
+
+Explore new color scheme and nicer look on our charts. Colaborate with UX to determine whether this is something we had on our mind or not and support any iniative to have nicer, more modern looking charts as our charts are quite forgettable.
+
+## Alternative Solutions
+
+### New page
+
+We could create a brand new page and leave this section as it is. The pro would be that we could perhaps have a more prominent placement in the Navigation under `Build`, while the cons are that we'd have clear overlap with the section.
+
+### Pipeline analysis per pipeline
+
+There was an [experiment](https://gitlab.com/gitlab-org/gitlab/-/issues/365902) in the past to add performance insights **per pipeline**. The experiment was removed and deemed not viable. Some of the findings were that:
+
+- Users did not interact with the page as much as thought and would not click on the button to view insights
+- Users who did click on the button did not try to get more insights into a job.
+- Users did not leave feedback in the issue.
+
+This experiment reveals to us mostly that users who go on the pipeline graph page `pipelines/:id` are **not** trying to imrpove the performance of pipelines. Instead, it is most likely that this page is used to debug pipeline failures, which means that they are from the IC/developer persona, not the DevOps engineer trying to improve the workflow. By having this section in a more "broad" area, we expect a much better adoption and more useful actionables.
+
+### Do nothing
+
+We could leave this section untouched and not add any new form of analytics. The pro here would be the saved resources and time. The cons are that we currently have no way to help customers improve their CI/CD configurations speed except reading our documentation. This revamped section would also be a great gateway for AI features and help user iteration on their setup.
diff --git a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/img/current_page.png b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/img/current_page.png
new file mode 100644
index 00000000000..42b09d37785
--- /dev/null
+++ b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/img/current_page.png
Binary files differ
diff --git a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
index 104a6ee2136..016db5f5766 100644
--- a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
+++ b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
@@ -61,6 +61,10 @@ The following customer problems should be solved when addressing this question.
#### Which runners have failures in the past hour?
+## CI Insights
+
+CI Insights is a page that would mostly expose data on pipelines and jobs duration, with a multitude of different filters, search and dynamic graphs. To read more on this, see [this related sub-section](ci_insights.md).
+
## Implementation
The current implementation plan is based on a
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 9a225c9cd97..78d9401d5a5 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -232,7 +232,7 @@ The version of the component can be (in order of highest priority first):
1. A commit SHA - For example: `gitlab.com/gitlab-org/dast@e3262fdd0914fa823210cdb79a8c421e2cef79d8`
1. A tag - For example: `gitlab.com/gitlab-org/dast@1.0`
-1. A special moving target version that points to the most recent released tag - For example: `gitlab.com/gitlab-org/dast@~latest`
+1. A special moving target version that points to the most recent published release - For example: `gitlab.com/gitlab-org/dast@~latest`
1. A branch name - For example: `gitlab.com/gitlab-org/dast@master`
If a tag and branch exist with the same name, the tag takes precedence over the branch.
@@ -244,6 +244,30 @@ As we want to be able to reference any revisions (even those not released), a co
When referencing a component by local path (for example `./path/to/component`), its version is implicit and matches
the commit SHA of the current pipeline context.
+#### The `~latest` version
+
+The use of `~latest` version qualifier is restricted to only those releases that are published in the Catalog.
+
+We debated whether `~latest` should be supported for projects that are not marked as catalog resources.
+
+There are various reasons for this decision:
+
+1. Versions could be unlisted from the Catalog and `~latest` need to reflect that.
+1. The Catalog will support private resources. There are currently no valid requirements for component projects to
+ have releases but not being published in the Catalog.
+1. In the future we will be separating the process of releasing and publishing a release, allowing users to choose
+ what release is published in the catalog.
+1. We could enforce better the use of semantic versioning when publishing a release, rejecting it if not following
+ the standard. We won't be able to enforce semantic versioning on the release model because it won't be backwards
+ compatible.
+1. Latest version will likely be denormalized for a catalog resource data structure for more performant queries both
+ when displaying the Catalog resources and when fetching a component version.
+
+In all the points above, if we were supporting `~latest` for both catalog resources and unpublished component projects we could
+introduce discrepancies and surprising behaviors. Starting with a stricter approach, like supporting
+only published versions, we have the freedom to expand this in the future to support unpublished component projects based on
+user demand.
+
### Note about future resource types
In the future, to support multiple types of resources in the Catalog we could
diff --git a/doc/architecture/blueprints/cloud_connector/index.md b/doc/architecture/blueprints/cloud_connector/index.md
index 9aef8bc7a98..50e233a6089 100644
--- a/doc/architecture/blueprints/cloud_connector/index.md
+++ b/doc/architecture/blueprints/cloud_connector/index.md
@@ -170,16 +170,16 @@ It will have the following responsibilities:
We suggest to use one of the following language stacks:
1. **Go.** There is substantial organizational knowledge in writing and running
-Go systems at GitLab, and it is a great systems language that gives us efficient ways to handle requests where
-they merely need to be forwarded (request proxying) and a powerful concurrency mechanism through goroutines. This makes the
-service easier to scale and cheaper to run than Ruby or Python, which scale largely at the process level due to their use
-of Global Interpreter Locks, and use inefficient memory models especially as regards byte stream handling and manipulation.
-A drawback of Go is that resource requirements such as memory use are less predictable because Go is a garbage collected language.
+ Go systems at GitLab, and it is a great systems language that gives us efficient ways to handle requests where
+ they merely need to be forwarded (request proxying) and a powerful concurrency mechanism through goroutines. This makes the
+ service easier to scale and cheaper to run than Ruby or Python, which scale largely at the process level due to their use
+ of Global Interpreter Locks, and use inefficient memory models especially as regards byte stream handling and manipulation.
+ A drawback of Go is that resource requirements such as memory use are less predictable because Go is a garbage collected language.
1. **Rust.** We are starting to build up knowledge in Rust at GitLab. Like Go, it is a great systems language that is
-also starting to see wider adoption in the Ruby ecosystem to write CRuby extensions. A major benefit is more predictable
-resource consumption because it is not garbage collected and allows for finer control of memory use.
-It is also very fast; we found that the Rust implementation for `prometheus-client-mmap` outperformed the original
-extension written in C.
+ also starting to see wider adoption in the Ruby ecosystem to write CRuby extensions. A major benefit is more predictable
+ resource consumption because it is not garbage collected and allows for finer control of memory use.
+ It is also very fast; we found that the Rust implementation for `prometheus-client-mmap` outperformed the original
+ extension written in C.
## Alternative solutions
diff --git a/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md
index f3335a0935e..7f451b4f92b 100644
--- a/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md
+++ b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md
@@ -43,14 +43,14 @@ configurations, especially the value of the concurrency limit, are static. There
are some drawbacks to this:
- It's tedious to maintain a sane value for the concurrency limit. Looking at
-this [production configuration](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/db11ef95859e42d656bb116c817402635e946a32/roles/gprd-base-stor-gitaly-common.json),
-each limit is heavily calibrated based on clues from different sources. When the
-overall scene changes, we need to tweak them again.
+ this [production configuration](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/db11ef95859e42d656bb116c817402635e946a32/roles/gprd-base-stor-gitaly-common.json),
+ each limit is heavily calibrated based on clues from different sources. When the
+ overall scene changes, we need to tweak them again.
- Static limits are not good for all usage patterns. It's not feasible to pick a
-fit-them-all value. If the limit is too low, big users will be affected. If the
-value is too loose, the protection effect is lost.
+ fit-them-all value. If the limit is too low, big users will be affected. If the
+ value is too loose, the protection effect is lost.
- A request may be rejected even though the server is idle as the rate is not
-necessarily an indicator of the load induced on the server.
+ necessarily an indicator of the load induced on the server.
To overcome all of those drawbacks while keeping the benefits of concurrency
limiting, one promising solution is to make the concurrency limit adaptive to
@@ -78,18 +78,18 @@ occurs. There are various criteria for determining whether Gitaly is in trouble.
In this proposal, we focus on two things:
- Lack of resources, particularly memory and CPU, which are essential for
-handling Git processes.
+ handling Git processes.
- Serious latency degradation.
The proposed solution is heavily inspired by many materials about this subject
shared by folks from other companies in the industry, especially the following:
- TCP Congestion Control ([RFC-2581](https://www.rfc-editor.org/rfc/rfc2581), [RFC-5681](https://www.rfc-editor.org/rfc/rfc5681),
-[RFC-9293](https://www.rfc-editor.org/rfc/rfc9293.html#name-tcp-congestion-control), [Computer Networks: A Systems Approach](https://book.systemsapproach.org/congestion/tcpcc.html)).
+ [RFC-9293](https://www.rfc-editor.org/rfc/rfc9293.html#name-tcp-congestion-control), [Computer Networks: A Systems Approach](https://book.systemsapproach.org/congestion/tcpcc.html)).
- Netflix adaptive concurrency limit ([blog post](https://tech.olx.com/load-shedding-with-nginx-using-adaptive-concurrency-control-part-1-e59c7da6a6df)
-and [implementation](https://github.com/Netflix/concurrency-limits))
+ and [implementation](https://github.com/Netflix/concurrency-limits))
- Envoy Adaptive Concurrency
-([doc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/adaptive_concurrency_filter#config-http-filters-adaptive-concurrency))
+ ([doc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/adaptive_concurrency_filter#config-http-filters-adaptive-concurrency))
We cannot blindly apply a solution without careful consideration and expect it
to function flawlessly. The suggested approach considers Gitaly's specific
@@ -116,12 +116,11 @@ process functioning but quickly reducing it when an issue occurs.
During initialization, we configure the following parameters:
- `initialLimit`: Concurrency limit to start with. This value is essentially
-equal to the current static concurrency limit.
+ equal to the current static concurrency limit.
- `maxLimit`: Maximum concurrency limit.
- `minLimit`: Minimum concurrency limit so that the process is considered as
-functioning. If it's equal to 0, it rejects all upcoming requests.
-- `backoffFactor`: how fast the limit decreases when a backoff event occurs (`0
-< backoff < 1`, default to `0.75`)
+ functioning. If it's equal to 0, it rejects all upcoming requests.
+- `backoffFactor`: how fast the limit decreases when a backoff event occurs (`0 < backoff < 1`, default to `0.75`)
When the Gitaly process starts, it sets `limit = initialLimit`, in which `limit`
is the maximum in-flight requests allowed at a time.
@@ -130,9 +129,9 @@ Periodically, maybe once per 15 seconds, the value of the `limit` is
re-calibrated:
- `limit = limit + 1` if there is no backoff event since the last
-calibration. The new limit cannot exceed `maxLimit`.
+ calibration. The new limit cannot exceed `maxLimit`.
- `limit = limit * backoffFactor` otherwise. The new limit cannot be lower than
-`minLimit`.
+ `minLimit`.
When a process can no longer handle requests or will not be able to handle them
soon, it is referred to as a back-off event. Ideally, we would love to see the
@@ -151,16 +150,16 @@ The concurrency limit restricts the total number of in-flight requests (IFR) at
a time.
- When `IFR < limit`, Gitaly handles new requests without waiting. After an
-increment, Gitaly immediately handles the subsequent request in the queue, if
-any.
+ increment, Gitaly immediately handles the subsequent request in the queue, if
+ any.
- When `IFR = limit`, it means the limit is reached. Subsequent requests are
-queued, waiting for their turn. If the queue length reaches a configured limit,
-Gitaly rejects new requests immediately. When a request stays in the queue long
-enough, it is also automatically dropped by Gitaly.
+ queued, waiting for their turn. If the queue length reaches a configured limit,
+ Gitaly rejects new requests immediately. When a request stays in the queue long
+ enough, it is also automatically dropped by Gitaly.
- When `IRF > limit`, it's appropriately a consequence of backoff events. It
-means Gitaly handles more requests than the newly appointed limits. In addition
-to queueing upcoming requests similarly to the above case, Gitaly may start
-load-shedding in-flight requests if this situation is not resolved long enough.
+ means Gitaly handles more requests than the newly appointed limits. In addition
+ to queueing upcoming requests similarly to the above case, Gitaly may start
+ load-shedding in-flight requests if this situation is not resolved long enough.
At several points in time we have discussed whether we want to change queueing
semantics. Right now we admit queued processes from the head of the queue
@@ -181,16 +180,16 @@ Each system has its own set of signals, and in the case of Gitaly, there are two
aspects to consider:
- Lack of resources, particularly memory and CPU, which are essential for
-handling Git processes like `git-pack-objects(1)`. When these resources are limited
-or depleted, it doesn't make sense for Gitaly to accept more requests. Doing so
-would worsen the saturation, and Gitaly addresses this issue by applying cgroups
-extensively. The following section outlines how accounting can be carried out
-using cgroup.
+ handling Git processes like `git-pack-objects(1)`. When these resources are limited
+ or depleted, it doesn't make sense for Gitaly to accept more requests. Doing so
+ would worsen the saturation, and Gitaly addresses this issue by applying cgroups
+ extensively. The following section outlines how accounting can be carried out
+ using cgroup.
- Serious latency degradation. Gitaly offers various RPCs for different purposes
-besides serving Git data that is hard to reason about latencies. A significant
-overall latency decline is an indication that Gitaly should not accept more
-requests. Another section below describes how to assert latency degradation
-reasonably.
+ besides serving Git data that is hard to reason about latencies. A significant
+ overall latency decline is an indication that Gitaly should not accept more
+ requests. Another section below describes how to assert latency degradation
+ reasonably.
Apart from the above signals, we can consider adding more signals in the future
to make the system smarter. Some examples are Go garbage collector statistics,
diff --git a/doc/architecture/blueprints/gitlab_housekeeper/index.md b/doc/architecture/blueprints/gitlab_housekeeper/index.md
new file mode 100644
index 00000000000..fcb5590772d
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_housekeeper/index.md
@@ -0,0 +1,133 @@
+---
+status: implemented
+creation-date: "2023-10-18"
+authors: [ "@DylanGriffith" ]
+coach:
+approvers: [ "@rymai", "@tigerwnz" ]
+owning-stage: "~devops::tenant scale"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Housekeeper - automating merge requests
+
+## Summary
+
+This blueprint documents the philosophy behind the
+["GitLab Housekeeper" gem](https://gitlab.com/gitlab-org/gitlab/-/tree/master/gems/gitlab-housekeeper)
+which was introduced in
+<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/139492> and has already
+been used to create many merge requests.
+
+The tool should be used to save developers from mundane repetitive tasks that
+can be automated. The tool is scoped to any task where a developer needs to
+create a straightforward merge request and is known ahead of time.
+
+This tool should be useful for at least the following kinds of mundane MRs
+we create:
+
+1. Remove a feature flag after X date
+1. Remove an unused index where the unused index is identified by some
+ automation
+1. Remove an `ignore_column` after X date (part of renaming/removing columns
+ multi-step procedure)
+1. Populate sharding keys for organizations/cells on tables that are missing a
+ sharding key
+
+## Motivation
+
+We've observed there are many cases where developers are doing a lot of
+manual work for tasks that are entirely predictable and automatable. Often
+these manual tasks are done after waiting some known period of time. As such we
+usually create an issue and set the future milestone. Then in the future the
+developer remembers to followup on that issue and opens an MR to make the
+manual change.
+
+The biggest examples we've seen lately are:
+
+1. Feature flag removal: <https://gitlab.com/groups/gitlab-org/-/epics/5325>. We
+ have many opportunities for automation with feature flags but this blueprint
+ focuses on removing the feature flag after it's fully rolled out. A step
+ that is often forgotten leading to growing technical debt.
+1. Removing duplicated or unused indexes in Postgres:
+ <https://gitlab.com/gitlab-org/gitlab/-/issues/385701>. For now we're
+ developing automation that creates issues and assigns them to groups to
+ follow up and manually open MRs to remove them. This blueprint would take it
+ a step further and the automation would just create the MRs to remove them
+ once we have identified them.
+1. Removing out of date `ignore_column` references:
+ <https://docs.gitlab.com/ee/development/database/avoiding_downtime_in_migrations.html#removing-the-ignore-rule-release-m2>
+ . For now we leave a note in our code telling us the date it needs to be
+ removed and often create an issue as a reminder. This blueprint proposes
+ that automation just reads this note and opens the MR to remove it after the
+ date.
+1. Adding and backfilling sharding keys for organizations for Cells:
+ <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/133796>. The cells
+ architecture depends on all tables having a sharding key that is attributed
+ to an organization. We will need to backfill this for ~300 tables. Much of
+ this will be repetitive and mundane work that we can automate provided that
+ groups just identify what the name of the sharding key should be and how we
+ will backfill it. As such we can automate the creation of MRs that guess the
+ sharding key and owning groups can check and correct those MRs. Then we can
+ automate the MR creation for adding the columns and backfilling the data.
+ Some kind of automation like this will be necessary to finish this work in a
+ reasonable timeframe.
+
+### Goals
+
+1. Identify the common tasks that take development time and automate them.
+1. Focus on MR creation rather than issue creation as MRs are the results we
+ want and issues are a process for reminding us to get those results.
+1. Improve developer job satisfication by knowing that automation is doing the
+ busy work while we get to do the challenging and creative work.
+1. Developers should be encouraged to contribute to the automation framework
+ when they see a pattern rather than documenting the manual work for future
+ developers to do it again.
+1. Automation MRs should be very easily identified and reviewed and merged much
+ more quickly than other MRs. If our automation MRs cause too much effort for
+ reviewers we maybe will outweigh the benefits. This might mean that some
+ automations get disabled when they are just noisy.
+
+## Solution
+
+The
+[GitLab Housekeeper gem](https://gitlab.com/gitlab-org/gitlab/-/tree/master/gems/gitlab-housekeeper)
+should be used to automate creation of mundane merge requests.
+
+Using this tool reflects our
+[bias for action](https://handbook.gitlab.com/handbook/values/#bias-for-action)
+subvalue. As such, developers should preference contributing a new
+[keep](https://gitlab.com/gitlab-org/gitlab/-/tree/master/keeps) over the following:
+
+1. Documenting a process that involves creating several merge requests over a
+ period of time
+1. Setting up periodic reminders for developers (in Slack or issues) to create
+ some merge request
+
+The keeps may sometimes take more work to implement than documentation or
+reminders so judgement should be used to assess the likely time savings from
+using automation. The `gitlab-housekeeper` gem will evolve over time with many
+utilities that make it simpler to contribute new keeps and it is expected that
+over time the cost to implementing a keep should be small enough that we will
+mostly prefer this whenever developers need to do a repeatable task more than a
+few times.
+
+## Design and implementation details
+
+The key details for this architecture is:
+
+1. The design of this tool is like a combination of `rubocop -a` and Renovate
+ bot. It extends on `rubocop -a` to understand when things need to be removed
+ after certain deadlines as well as creating a steady stream of manageable
+ merge requests for the reviewer rather than leaving those decisions to the
+ developer. Like the renovate bot it attempts to create MRs periodically and
+ assign them to the right people to review.
+1. The keeps live in the GitLab repo which means that there are no
+ dependencies to update and the keeps can use code inside the
+ GitLab codebase.
+1. The script can be run locally by a developer or can be run periodically
+ in some automated way.
+1. The keeps are able to use any data sources (eg. local code, Prometheus,
+ Postgres database archive, logs) needed to determine whether and how to make
+ the change.
diff --git a/doc/architecture/blueprints/gitlab_steps/data.drawio.png b/doc/architecture/blueprints/gitlab_steps/data.drawio.png
index 59436093fb7..5ffe2964134 100644
--- a/doc/architecture/blueprints/gitlab_steps/data.drawio.png
+++ b/doc/architecture/blueprints/gitlab_steps/data.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/gitlab_steps/step-runner-sequence.drawio.png b/doc/architecture/blueprints/gitlab_steps/step-runner-sequence.drawio.png
index 9f6a6dcad9f..57029733b3c 100644
--- a/doc/architecture/blueprints/gitlab_steps/step-runner-sequence.drawio.png
+++ b/doc/architecture/blueprints/gitlab_steps/step-runner-sequence.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/runner_admission_controller/index.md b/doc/architecture/blueprints/runner_admission_controller/index.md
index 21dc1d53303..0a62b271901 100644
--- a/doc/architecture/blueprints/runner_admission_controller/index.md
+++ b/doc/architecture/blueprints/runner_admission_controller/index.md
@@ -140,7 +140,7 @@ Each runner has a tag identifier unique to that runner, e.g. `DiscoveryOne`, `tu
1. The `preparing` state will wait for a response from the webhook or until timeout.
1. The UI should be updated with the current status of the job prerequisites and admission
1. For jobs where the webhook times out (1 hour) their status should be set as though the admission was denied with a timeout reasoning. This should
-be rare in typical circumstances.
+ be rare in typical circumstances.
1. Jobs with denied admission can be retried. Retried jobs will be resent to the admission controller without tag mutations or runner filtering reset.
1. [`allow_failure`](../../../ci/yaml/index.md#allow_failure) should be updated to support jobs that fail on denied admissions, for example:
diff --git a/doc/architecture/blueprints/runner_tokens/index.md b/doc/architecture/blueprints/runner_tokens/index.md
index f2e9d624d20..c667a460f5c 100644
--- a/doc/architecture/blueprints/runner_tokens/index.md
+++ b/doc/architecture/blueprints/runner_tokens/index.md
@@ -284,11 +284,11 @@ not an issue per-se.
New records are created in 2 situations:
-- when the runner calls the `POST /api/v4/runners/verify` endpoint as part of the
-`gitlab-runner register` command, if the specified runner token is prefixed with `glrt-`.
-This allows the frontend to determine whether the user has successfully completed the registration and take an
-appropriate action;
-- when GitLab is pinged for new jobs and a record matching the `token`+`system_id` does not already exist.
+- When the runner calls the `POST /api/v4/runners/verify` endpoint as part of the
+ `gitlab-runner register` command, if the specified runner token is prefixed with `glrt-`.
+ This allows the frontend to determine whether the user has successfully completed the registration and take an
+ appropriate action;
+- When GitLab is pinged for new jobs and a record matching the `token`+`system_id` does not already exist.
Due to the time-decaying nature of the `ci_runner_machines` records, they are automatically
cleaned after 7 days after the last contact from the respective runner.
diff --git a/doc/architecture/blueprints/runway/index.md b/doc/architecture/blueprints/runway/index.md
index becb7914feb..af7f466cdc9 100644
--- a/doc/architecture/blueprints/runway/index.md
+++ b/doc/architecture/blueprints/runway/index.md
@@ -169,7 +169,7 @@ In order for runway to function, there are two JSON/YAML documents in use. They
1. The Runway Inventory Model. This covers what service projects are currently onboarded into Runway. It's located [here](https://gitlab.com/gitlab-com/gl-infra/platform/runway/provisioner/-/blob/main/inventory.json?ref_type=heads). The schema used to validate the docuemnt is located [here](https://gitlab.com/gitlab-com/gl-infra/platform/runway/runwayctl/-/blob/main/schemas/service-inventory/v1.0.0-beta/inventory.schema.json?ref_type=heads). There is no backwards compatibility guarenteed to changes to this document schema. This is because it's only used internally by the Runway team, and there is only a single document actually being used by Runway to provision/deprovision Runway services.
1. The runway Service Model. This is used by Runway users to pass through configuration needed to Runway in order to deploy their service. It's located inside their Service project, at `.runway/runway.yml`. [An example is here](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/.runway/runway.yml?ref_type=heads). The schema used to validate the document is located [here](https://gitlab.com/gitlab-com/gl-infra/platform/runway/runwayctl/-/blob/main/schemas/service-manifest/v1.0.0-beta/manifest.schema.json?ref_type=heads). We aim to continue to make improvements and changes to the model, but all changes to the model within the same `kind/apiVersion` must be backwards compatible. In order to
-make breaking changes, a new `apiVersion` of the schema will be released. The overall goal is to copy the [Kubernetes model for making API changes](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md).
+ make breaking changes, a new `apiVersion` of the schema will be released. The overall goal is to copy the [Kubernetes model for making API changes](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md).
There are also [GitLab CI templates](https://gitlab.com/gitlab-com/gl-infra/platform/runway/ci-tasks) used by Runway users in order to automate deployments via Runway through GitLab CI. Users will be encouraged to use tools such as [Renovate bot](https://gitlab.com/gitlab-com/gl-infra/common-ci-tasks/-/blob/main/renovate-bot.md) in order to make sure the CI templates and
version of Runway they are using is up to date. The Runway team will support all released versions of Runway, with the exception of when a security issue is identified. When this happens, Runway users will be expected to update to a version of Runway that contains a fix for the issue as soon as possible (once notification is received).
diff --git a/doc/architecture/blueprints/secret_detection/index.md b/doc/architecture/blueprints/secret_detection/index.md
index fb77fffee40..3e9421539e6 100644
--- a/doc/architecture/blueprints/secret_detection/index.md
+++ b/doc/architecture/blueprints/secret_detection/index.md
@@ -29,19 +29,18 @@ job logs, and project management features such as issues, epics, and MRs.
- Support platform-wide detection of tokens to avoid secret leaks
- Prevent exposure by rejecting detected secrets
- Provide scalable means of detection without harming end user experience
+- Unified list of token patterns and masking
See [target types](#target-types) for scan target priorities.
### Non-Goals
-Initial proposal is limited to detection and alerting across platform, with rejection only
-during [preceive Git interactions and browser-based detection](#iterations).
+Phase1 is limited to detection and alerting across platform, with rejection only
+during [prereceive Git interactions and browser-based detection](#iterations).
Secret revocation and rotation is also beyond the scope of this new capability.
-Scanned object types beyond the scope of this MVC include:
-
-See [target types](#target-types) for scan target priorities.
+Scanned object types beyond the scope of this MVC are included within [target types](#target-types).
#### Management UI
@@ -67,7 +66,7 @@ Target object types refer to the scanning targets prioritized for detection of l
In order of priority this includes:
-1. non-binary Git blobs
+1. non-binary Git blobs under 1 megabyte
1. job logs
1. issuable creation (issues, MRs, epics)
1. issuable updates (issues, MRs, epics)
@@ -75,30 +74,60 @@ In order of priority this includes:
Targets out of scope for the initial phases include:
+- non-binary Git blobs over 1 megabyte
+- binary Git blobs
- Media types (JPEG, PDF, ...)
- Snippets
- Wikis
- Container images
+- External media (Youtube platform videos)
### Token types
-The existing Secret Detection configuration covers ~100 rules across a variety
+The existing Secret Detection configuration covers 100+ rules across a variety
of platforms. To reduce total cost of execution and likelihood of false positives
-the dedicated service targets only well-defined tokens. A well-defined token is
-defined as a token with a precise definition, most often a fixed substring prefix or
-suffix and fixed length.
+the dedicated service targets only well-defined, low-FP tokens.
Token types to identify in order of importance:
1. Well-defined GitLab tokens (including Personal Access Tokens and Pipeline Trigger Tokens)
1. Verified Partner tokens (including AWS)
-1. Remainder tokens currently included in Secret Detection CI configuration
+1. Well-defined low-FP third party tokens
+1. Remainder tokens currently included in Secret Detection analyzer configuration
-## Proposal
+A well-defined token is a token with a precise definition, most often a fixed
+substring prefix (or suffix) and fixed length.
-### Decisions
+For GitLab and partner tokens, we have good domain understanding of our own tokens
+and by collaborating with partners verified the accuracy of their provided patterns.
-- [001: Use Ruby Push Check approach within monolith](decisions/001_use_ruby_push_check_approach_within_monolith.md)
+An observed low-FP token relies on user reports and dismissal reports. With delivery of
+[this data issue](https://gitlab.com/gitlab-data/product-analytics/-/issues/1225)
+we will have aggregates on FP-rates but primarily this is user-reported data, at present.
+
+In order to minimize false positives, there are no plans to introduce or alert on high-entropy,
+arbitrary strings; i.e. patterns such as `3lsjkw3a22`.
+
+#### Uniformity of rule configuration
+
+Rule pattern configuration should remain centralized in the `secrets` analyzer's packaged `gitleaks.toml`
+configuration, vendored to the monolith for Phase 1, and checksum-checked to ensure it matches the
+specific release version to avoid drift. Each token can be filtered by `tags` to form both high-confidence
+and blocking groupings. For example:
+
+```ruby
+prereceive_blocking_rules = toml.load_file('gitleaks.toml')['rules'].select do |r|
+ r.tags.include?('gitlab_blocking_p1') &&
+ r.tags.include?('gitlab_blocking')
+end
+```
+
+### Auditability
+
+A critical aspect of both secret detection and [suppression](#detection-suppression) is administrative visibility.
+With each phase we must include audit capabilities (events or logging) to enable event discovery.
+
+## Proposal
The first iteration of the experimental capability will feature a blocking
pre-receive hook implemented in the Rails application. This iteration
@@ -119,6 +148,10 @@ This service must be:
Platform-wide secret detection should be enabled by-default on GitLab SaaS as well
as self-managed instances.
+### Decisions
+
+- [001: Use Ruby Push Check approach within monolith](decisions/001_use_ruby_push_check_approach_within_monolith.md)
+
## Challenges
- Secure authentication to GitLab.com infrastructure
@@ -136,17 +169,15 @@ In expansion phases we must explore chunking or alternative strategies like the
## Design and implementation details
+The detection capability relies on a multiphase rollout, from an experimental component implemented directly in the monolith to a standalone service capable of scanning text blobs generically.
+
The implementation of the secret scanning service is highly dependent on the outcomes of our benchmarking
and capacity planning against both GitLab.com and our
[Reference Architectures](../../../administration/reference_architectures/index.md).
As the scanning capability must be an on-by-default component of both our SaaS and self-managed
-instances [the PoC](#iterations), the deployment characteristics must be considered to determine whether
-this is a standalone component or executed as a subprocess of the existing Sidekiq worker fleet
-(similar to the implementation of our Elasticsearch indexing service).
-
-Similarly, the scan target volume will require a robust and scalable enqueueing system to limit resource consumption.
-
-The detection capability relies on a multiphase rollout, from an experimental component implemented directly in the monolith to a standalone service capable of scanning text blobs generically.
+instances, [each iteration's](#iterations) deployment characteristic defines whether
+the service will act as a standalone component, or executed as a subprocess of the Rails architecture
+(as mirrors the implementation of our Elasticsearch indexing service).
See [technical discovery](https://gitlab.com/gitlab-org/gitlab/-/issues/376716)
for further background exploration.
@@ -154,14 +185,35 @@ for further background exploration.
See [this thread](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105142#note_1194863310)
for past discussion around scaling approaches.
+### Detection engine
+
+Our current secret detection offering uses [Gitleaks](https://github.com/zricethezav/gitleaks/)
+for all secret scanning in pipeline contexts. By using its `--no-git` configuration
+we can scan arbitrary text blobs outside of a repository context and continue to
+use it for non-pipeline scanning.
+
+Changes to the detection engine are out of scope until benchmarking unveils performance concerns.
+
+For the long-term direction of GitLab Secret Detection, the scope is greater than that of the Gitleaks tool. As such, we should consider feature encapsulation to limit the Gitleaks domain to the relevant build context only.
+
+In the case of pre-receive detection, we rely on a combination of keyword/substring matches
+for pre-filtering and `re2` for regex detections. See [spike issue](https://gitlab.com/gitlab-org/gitlab/-/issues/423832) for initial benchmarks.
+
+Notable alternatives include high-performance regex engines such as [Hyperscan](https://github.com/intel/hyperscan) or it's portable fork [Vectorscan](https://github.com/VectorCamp/vectorscan).
+These systems may be worth exploring in the future if our performance characteristics show a need to grow beyond the existing stack, however the team's velocity in building an independently scalable and generic scanning engine was prioritized, see [ADR 001](decisions/001_use_ruby_push_check_approach_within_monolith.md) for more on the implementation language considerations.
+
+### Organization-level Controls
+
+Configuration and workflows should be oriented around [Organizations](../organization/index.md). Detection controls and governance patterns should support configuration across multiple projects and groups in a uniform way that emphasizes shared allowlists, organization-wide policies (i.e. disablement of push option bypass), and auditability.
+
+Each phase documents the paradigm used as we iterate from Instance-level to Organization-level controls.
+
### Phase 1 - Ruby pushcheck pre-receive integration
The critical paths as outlined under [goals above](#goals) cover two major object
types: Git text blobs (corresponding to push events) and arbitrary text blobs. In Phase 1,
we focus entirely on Git text blobs.
-This phase will be considered "Experimental" with limited availability for customer opt-in, through instance level application settings.
-
The detection flow for push events relies on subscribing to the PreReceive hook
to scan commit data using the [PushCheck interface](https://gitlab.com/gitlab-org/gitlab/blob/3f1653f5706cd0e7bbd60ed7155010c0a32c681d/lib/gitlab/checks/push_check.rb). This `SecretScanningService`
service fetches the specified blob contents from Gitaly, scans
@@ -170,6 +222,10 @@ See [Push event detection flow](#push-event-detection-flow) for sequence.
In the case of a push detection, the commit is rejected inline and error returned to the end user.
+#### Configuration
+
+This phase will be considered "Experimental" with limited availability for customer opt-in, through instance level application settings.
+
#### High-Level Architecture
The Phase 1 architecture involves no additional components and is entirely encapsulated in the Rails application server. This provides a rapid deployment with tight integration within auth boundaries and no distribution coordination.
@@ -204,7 +260,7 @@ sidekiq .[#ff8dd1]----> postgres
@enduml
```
-#### Push event detection flow
+#### Push Event Detection Flow
```mermaid
sequenceDiagram
@@ -237,7 +293,7 @@ sequenceDiagram
The critical paths as outlined under [goals above](#goals) cover two major object
types: Git text blobs (corresponding to push events) and arbitrary text blobs. In Phase 2,
-we focus entirely on Git text blobs.
+we continue to focus on Git text blobs.
This phase emphasizes scaling the service outside of the monolith for general availability and to allow
an on-by-default behavior. The architecture is adapted to provide an isolated and independently
@@ -245,13 +301,17 @@ scalable service outside of the Rails monolith.
In the case of a push detection, the commit is rejected inline and error returned to the end user.
+#### Configuration
+
+This phase will be considered "Generally Available" and on-by-default, with disablement configuration through organization-level settings.
+
#### High-Level Architecture
The Phase 2 architecture involves extracting the secret detection logic into a standalone service
which communicates directly with both the Rails application and Gitaly. This provides a means to scale
the secret detection nodes independently, and reduce resource usage overhead on the rails application.
-Scans still runs synchronously as a (potentially) blocking pre-receive transaction.
+Scans still runs synchronously as a (potentially) blocking pre-receive transaction. The blob size remains limited to 1MB.
Note that the node count is purely illustrative, but serves to emphasize the independent scaling requirements for the scanning service.
@@ -308,7 +368,7 @@ consul .[#e76a9b]-> prsd_cluster
@enduml
```
-#### Push event detection flow
+#### Push Event Detection Flow
```mermaid
sequenceDiagram
@@ -345,7 +405,7 @@ sequenceDiagram
Rails->>User: accepted
```
-### Phase 3 - Expansion beyond pre-
+### Phase 3 - Expansion beyond pre-receive service
The detection flow for arbitrary text blobs, such as issue comments, relies on
subscribing to `Notes::PostProcessService` (or equivalent service) to enqueue
@@ -364,11 +424,15 @@ In any other case of detection, the Rails application manually creates a vulnera
using the `Vulnerabilities::ManuallyCreateService` to surface the finding in the
existing Vulnerability Management UI.
-#### Architecture
+#### Configuration
+
+This phase will be considered "Generally Available" and on-by-default, with disablement configuration through organization-level settings.
+
+#### High-Level Architecture
There is no change to the architecture defined in Phase 2, however the individual load requirements may require scaling up the node counts for the detection service.
-#### Detection flow
+#### Push Event Detection Flow
There is no change to the push event detection flow defined in Phase 2, however the added capability to scan
arbitary text blobs directly from Rails allows us to emulate a pre-receive behavior for issuable creations,
@@ -403,52 +467,42 @@ sequenceDiagram
Rails->>User: rejected: secret found
```
-### Target types
+### Future Phases
-Target object types refer to the scanning targets prioritized for detection of leaked secrets.
+These are key items for delivering a feature-complete always-on experience but have not have yet been prioritized into phases.
-In order of priority this includes:
+### Large blob sizes (1mb+)
-1. non-binary Git blobs
-1. job logs
-1. issuable creation (issues, MRs, epics)
-1. issuable updates (issues, MRs, epics)
-1. issuable comments (issues, MRs, epics)
+Current phases do not include expansions of blob sizes beyond 1mb. While the main limitation was chosen [to conform to RPC transfer limits for future iterations](#transfer-optimizations-for-large-git-data-blobs) we should expand to supporting additional blob sizes. This can be achieved in two ways:
-Targets out of scope for the initial phases include:
+1. *Post-receive processing*
-- Media types (JPEG, PDF, ...)
-- Snippets
-- Wikis
-- Container images
+ Accept blobs in a non-blocking fashion, process scanning as background job and alert passively on detection of a given secret.
-### Token types
+1. *Improvements to scanning logic batching*
-The existing Secret Detection configuration covers ~100 rules across a variety
-of platforms. To reduce total cost of execution and likelihood of false positives
-the dedicated service targets only well-defined tokens. A well-defined token is
-defined as a token with a precise definition, most often a fixed substring prefix or
-suffix and fixed length.
+ Maintaining the constraint of 1MB is primarily futureproofing to match an expected transport protocol. This can be mitigated by using separate transport (http, reads from disk, ...) or by slicing blob sizes.
-Token types to identify in order of importance:
+### Detection Suppression
-1. Well-defined GitLab tokens (including Personal Access Tokens and Pipeline Trigger Tokens)
-1. Verified Partner tokens (including AWS)
-1. Remainder tokens included in Secret Detection CI configuration
+Suppression of detection and action on leaked secrets will be supported at several levels.
-### Detection engine
+1. *Global suppression* - If a secret is highly-likely to be a false token (i.e. `EXAMPLE`) it should be suppressed in workflow contexts where user would be seriously inconvenienced.
-Our current secret detection offering uses [Gitleaks](https://github.com/zricethezav/gitleaks/)
-for all secret scanning in pipeline contexts. By using its `--no-git` configuration
-we can scan arbitrary text blobs outside of a repository context and continue to
-use it for non-pipeline scanning.
+ We should still provide some means of triaging these results, whether via [audit events](#auditability) or as [automatic vulnerability resolution](../../../user/application_security/sast/index.md#automatic-vulnerability-resolution).
-In the case of pre-receive detection, we rely on a combination of keyword/substring matches
-for pre-filtering and `re2` for regex detections. See [spike issue](https://gitlab.com/gitlab-org/gitlab/-/issues/423832) for initial benchmarks
+1. *Organization suppression* - If a secret matches an organization's allowlist (or was previously flagged and remediated as irrelevant) it should not reoccur. See [Organization-level controls](#organization-level-controls).
-Changes to the detection engine are out of scope until benchmarking unveils performance concerns.
+1. *Inline suppression* - Inline annotations should be supported in later phases with the Organization-level configuration to ignore annotations.
-Notable alternatives include high-performance regex engines such as [Hyperscan](https://github.com/intel/hyperscan) or it's portable fork [Vectorscan](https://github.com/VectorCamp/vectorscan).
+### External Token Verification
+
+As a post-processing step for detection we should explore verification of detected secrets. This requires processors per supported token type in which we can distinguish tokens that are valid leaks from false positives. Similar to our [automatic response to leaked secrets](../../../user/application_security/secret_detection/automatic_response.md), we must externally verify a given token to give a high degree of confidence in our alerting.
+
+There are two token types: internal and external:
+
+- Internal tokens are verifiable and revocable as part of `ScanSecurityReportSecretsWorker` worker
+- External tokens require external verification, in which [the architecture](../../../user/application_security/secret_detection/automatic_response.md#high-level-architecture) will closely match the [Secret Revocation Service](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/secret-revocation-service/)
## Iterations
@@ -459,14 +513,14 @@ Notable alternatives include high-performance regex engines such as [Hyperscan](
- [Pre-Production Performance Profiling for pre-receive PoCs](https://gitlab.com/gitlab-org/gitlab/-/issues/428499)
- Profiling service capabilities
- ✓ [Benchmarking regex performance between Ruby and Go approaches](https://gitlab.com/gitlab-org/gitlab/-/issues/423832)
- - gRPC commit retrieval from Gitaly
- transfer latency, CPU, and memory footprint
-- Implementation of secret scanning service MVC (targeting individual commits)
+- ✓ Implementation of secret scanning gem integration MVC (targeting individual commits)
+- Phase1 - Deployment and monitoring
- Capacity planning for addition of service component to Reference Architectures headroom
- Security and readiness review
-- Deployment and monitoring
-- Implementation of secret scanning service MVC (targeting arbitrary text blobs)
-- Deployment and monitoring
+- Phase2 - Deployment and monitoring
+- Implementation of secret scanning service (targeting arbitrary text blobs)
+- Phase3 - Deployment and monitoring
- High priority domain object rollout (priority `TBD`)
- Issuable comments
- Issuable bodies
diff --git a/doc/architecture/blueprints/secret_manager/index.md b/doc/architecture/blueprints/secret_manager/index.md
index ac30f3399d8..3a538f58dde 100644
--- a/doc/architecture/blueprints/secret_manager/index.md
+++ b/doc/architecture/blueprints/secret_manager/index.md
@@ -114,10 +114,10 @@ the data keys mentioned above.
### Further investigations required
1. Management of identities stored in GCP Key Management.
-We need to investigate how we can correlate and de-multiplex GitLab identities into
-GCP identities that are used to allow access to cryptographic operations on GCP Key Management.
+ We need to investigate how we can correlate and de-multiplex GitLab identities into
+ GCP identities that are used to allow access to cryptographic operations on GCP Key Management.
1. Authentication of clients. Clients to the Secrets Manager could be GitLab Runner or external clients.
-For each of these, we need a secure and reliable method to authenticate requests to decrypt a secret.
+ For each of these, we need a secure and reliable method to authenticate requests to decrypt a secret.
1. Assignment of GCP backed private keys to each identity.
### Availability on SaaS and Self-Managed
diff --git a/doc/architecture/blueprints/tailwindcss/index.md b/doc/architecture/blueprints/tailwindcss/index.md
new file mode 100644
index 00000000000..0409f802038
--- /dev/null
+++ b/doc/architecture/blueprints/tailwindcss/index.md
@@ -0,0 +1,172 @@
+---
+status: proposed
+creation-date: "2023-12-21"
+authors: [ "@peterhegman", "@svedova", "@pgascouvaillancourt" ]
+approvers: [ "@samdbeckham" ]
+owning-stage: "~devops::manage"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# Delegating CSS utility classes generation to Tailwind CSS
+
+## Summary
+
+Styling elements in GitLab primarily relies on CSS utility classes. Those are classes that
+generally define a single CSS property and that can be applied additively to change an element's look.
+We have developed our own tooling in the [GitLab UI](https://gitlab.com/gitlab-org/gitlab-ui) project
+to generate the utils we need, but our approach has demonstrated a number of flaws that can be
+circumvented by delegating that task to the [Tailwind CSS](https://tailwindcss.com/) framework.
+
+This initiative requires that we deprecate existing utilities so that Tailwind CSS can replace them.
+
+## Motivation
+
+In June 2019, we have consolidated our usage of CSS utility classes through [RFC#4](https://gitlab.com/gitlab-org/frontend/rfcs/-/issues/4)
+which introduced the concept of silent classes, where utilities would be generated from a collection
+of manually defined SCSS mixins.
+
+This has served us well, but came with some caveats:
+
+- **Increased development overhead:** whenever a new utility is needed, it has to be manually added
+ to the [GitLab UI](https://gitlab.com/gitlab-org/gitlab-ui) project. One then needs to wait on a
+ new version of `@gitlab/ui` to be released and installed in the consumer project.
+- **Inconsistencies:** Without any tooling in place to check how utilities are named, we have seen
+ many inconsistencies make their way in the library, making it quite unpredictable. The most striking
+ example of this was the introduction of desktop-first utilities among a majority of mobile-first
+ utils, without any way of distinguishing the former from the latter, other than looking at the source.
+- **Disconnection between the utilities library and its consumers:** When a utility is added to the
+ library, it is made available to _any_ project that uses `@gitlab/ui`. As a result, some utils are
+ included in projects that don't need them. Conversely, if all consumers stop using a given util,
+ it could potentially be removed to decrease the CSS bundle size, but we have no visibility over this.
+- **Limited autocompletion:** Although it's possible to configure autocomplete for the existing
+ library, it is restricted to the utilities bundle. In contrast, Tailwind CSS autocomplete aligns
+ with an on-demand approach, ensuring that all utilities are readily available. Additionally, IDE
+ extensions can enhance understanding by revealing the values applied by a specific utility.
+
+As part of this architectural change, we are alleviating these issues by dropping our custom built
+solution for generating CSS utils, and delegating this task to [Tailwind CSS](https://tailwindcss.com/).
+
+It is worth noting that this was previously debated in [RFC#107](https://gitlab.com/gitlab-org/frontend/rfcs/-/issues/107).
+The RFC was well received. The few concerns that were raised were about the CSS utility approach as
+a whole, not the way we implemented it. This initiative's purpose _is not_ to question our reliance
+on utility classes, but to consolidate its implementation to improve engineers' efficiency when working
+with CSS utils.
+
+### Why Tailwind CSS?
+
+Here are a few reasons that led us to choosing Tailwind CSS over similar tools:
+
+- It is a long-standing project that has been battle-tested in many production apps and has a
+ healthy community around it.
+- Tailwind CSS is well maintained and keeps evolving without getting bloated.
+- It integrates well in all of our tech stacks
+ - Ruby on Rails projects can leverage the [`tailwindcss-rails` Gem](https://tailwindcss.com/docs/guides/ruby-on-rails).
+ - Nuxt apps can setup the [`tailwindcss` module](https://nuxt.com/modules/tailwindcss).
+ - More generic frontend stacks can use the [`tailwindcss` Node module](https://tailwindcss.com/docs/installation).
+
+### Goals
+
+This blueprint's goal is to improve the developer experience (DX) when working with CSS utility classes.
+As a result of this initiative, frontend engineers' efficiency should be increased thanks to a much
+lower development overhead.
+
+### Non-Goals
+
+As stated in the motivations above, this focuses on improving an existing architectural decision,
+not on replacing it with a new design. So this therefore:
+
+- _Is not_ aimed at revisiting the way we write CSS or how we apply styles within our projects.
+- _Does not_ focus on user-facing improvements. This change is mostly a developer experience enhancement.
+ The resulting increase in efficiency could certainly indirectly improve user experience, but that
+ is not our primary intent.
+
+## Proposal
+
+We will be setting up Tailwind CSS in GitLab UI _and_ GitLab. The intent is to have the main
+Tailwind CSS configuration in GitLab UI. This step is where we'll be maintaining the Pajamas-compliant
+configuration properties (color, spacing scale, etc.). The Tailwind CSS setup in GitLab will inherit from
+GitLab UI's setup. The subtlety here is that, in GitLab, we will be scanning both the GitLab codebase
+and the `@gitlab/ui` Node module. This will ensure that GitLab UI does not need to expose any CSS
+utilities anymore, but the ones it relies on are still generated in GitLab. A similar setup will
+need to be introduced in other projects that use CSS utilities and need to upgrade to the Tailwind
+CSS-based version.
+
+### Pros
+
+- We are removing the cumbersome workflow for adding new utilities. One should be able to use any
+ utility right away without contributing to another project and waiting through the release cycle.
+- We are introducing a predictable library, where the naming is decided upon in the overarching
+ Tailwind CSS project. As engineers know, naming things is difficult, and it's best that we defer
+ this to a well-established project.
+- Engineers should be able to refer to Tailwind CSS documentation to know what utils are available
+ and how to use them. No need to read through GitLab UI's source code anymore.
+- Because Tailwind CSS generates the required utils by scanning the consumer's codebase, we'll be
+ sure to only generate the utilities we actually need, keeping CSS bundle sizes under control. This
+ must be taken with a grain of salt though: Tailwind CSS is extremely flexible and makes it possible
+ to generate all sorts of utils, sometimes with developer-defined values, which could result in
+ large utils bundles depending on how we'll adopt Tailwind CSS' features.
+- We'll benefit from a robust IDE integration providing auto-completion and previews for the utils
+ we support.
+
+### Cons
+
+- More setup: each project that requires CSS utils would need to have Tailwind CSS set up,
+ which might be more or less tedious depending on the environment.
+- One more dev dependency in each project.
+- Inability to use string interpolation to build class names dynamically (Tailwind CSS
+ needs to see the full names to generate the required classes).
+- We'll need a migration: we'll need to ensure usages of the existing CSS utilities library
+ don't break, which implies a deprecation/migration process.
+
+## Design and implementation details
+
+In order to prevent breakages, we are taking an iterative approach to moving away from the current
+library. The proposed path here is purposefully rough around the edges. We acknowledge that it's
+not a one size fits all solution and that we might need to adjust to some cases along the way.
+
+Here's the basic process:
+
+1. Deprecate a collection of utility mixins in GitLab UI. This entails replacing the `gl-` prefix
+ with `gl-deprecated-` in the mixin's name, and updating all usages in both GitLab UI _and_ GitLab
+ accordingly. We will typically focus on a single mixins file at a time, though we might want to
+ deprecate several files at once if they are small enough. Conversely, some files might be too big
+ to be deprecated in one go and would require several iterations.
+1. Enable the corresponding [Tailwind CSS core plugins](https://tailwindcss.com/docs/configuration#core-plugins) so that we can immediately start using the
+ newer utilities.
+1. Migrate deprecated utilities to their Tailwind CSS equivalents.
+
+```mermaid
+flowchart TD
+ RequiresDeprecation(Is the mixins collection widely used in GitLab?)
+ DeprecateMixins[Mark the mixins as deprecated with the `gl-deprecated-` prefix]
+
+ HasTailwindEq(Does Tailwind CSS have equivalents?)
+ EnableCorePlugin["Enable the corresponding Tailwind CSS core plugin(s)"]
+ WriteCustomUtil[Write a custom Tailwind CSS utility]
+
+ MigrateUtils[Migrate legacy utils to Tailwind CSS]
+
+ RequiresDeprecation -- Yes --> DeprecateMixins
+ DeprecateMixins --> HasTailwindEq
+ RequiresDeprecation -- No --> HasTailwindEq
+ HasTailwindEq -- Yes --> EnableCorePlugin
+ HasTailwindEq -- No --> WriteCustomUtil
+ EnableCorePlugin --> MigrateUtils
+ WriteCustomUtil --> MigrateUtils
+```
+
+The deprecation step gives us some margin to evaluate each migration without risking breakages in
+production. It does have some drawbacks:
+
+- We might cause merge conflicts for others as we will be touching several areas of the product in
+ our deprecation MRs. We will make sure to communicate these changes efficiently to not make things
+ too confusing. We will also use our best judgement to split MRs when we feel like their scope gets
+ too large.
+- Deprecation MRs might require approval from several departments, which is another reason to
+ be transparent and iterative throughout the process.
+- We are purposefully introducing technical debt which we are committed to pay in a reasonable time frame.
+ We acknowledge that the actual duration of this initiative may be affected by a number of factors (uncovering
+ edge-cases, DRIs' capacity, department-wide involvement, etc.), but we expect to have it completed in 6-12 months.