Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2023-08-18 13:50:51 +0300
committerGitLab Bot <gitlab-bot@gitlab.com>2023-08-18 13:50:51 +0300
commitdb384e6b19af03b4c3c82a5760d83a3fd79f7982 (patch)
tree34beaef37df5f47ccbcf5729d7583aae093cffa0 /doc/architecture
parent54fd7b1bad233e3944434da91d257fa7f63c3996 (diff)
Add latest changes from gitlab-org/gitlab@16-3-stable-eev16.3.0-rc42
Diffstat (limited to 'doc/architecture')
-rw-r--r--doc/architecture/blueprints/ai_gateway/img/architecture.pngbin378194 -> 142929 bytes
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-admin-area.md58
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-backups.md29
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-ci-runners.md161
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-container-registry.md72
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-contributions-forks.md127
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-data-migration.md100
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-database-sequences.md67
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-explore.md71
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-git-access.md38
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-global-search.md23
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-graphql.md28
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-organizations.md45
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-personal-access-tokens.md (renamed from doc/architecture/blueprints/cells/cells-feature-dashboard.md)9
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-router-endpoints-classification.md21
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-schema-changes.md36
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-secrets.md26
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-snippets.md28
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-user-profile.md52
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-your-work.md58
-rw-r--r--doc/architecture/blueprints/cells/diagrams/cells-and-fulfillment.drawio.pngbin0 -> 192221 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/index.md35
-rw-r--r--doc/architecture/blueprints/cells/diagrams/term-cell.drawio.pngbin0 -> 93379 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/term-cluster.drawio.pngbin0 -> 436724 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/term-organization.drawio.pngbin0 -> 169719 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/term-top-level-group.drawio.pngbin0 -> 65137 bytes
-rw-r--r--doc/architecture/blueprints/cells/glossary.md8
-rw-r--r--doc/architecture/blueprints/cells/goals.md6
-rw-r--r--doc/architecture/blueprints/cells/images/pods-and-fulfillment.pngbin20899 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/cells/images/term-cell.pngbin26613 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/cells/images/term-cluster.pngbin91814 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/cells/images/term-organization.pngbin29527 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/cells/images/term-top-level-group.pngbin15122 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/cells/impact.md2
-rw-r--r--doc/architecture/blueprints/cells/index.md265
-rw-r--r--doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md132
-rw-r--r--doc/architecture/blueprints/ci_pipeline_processing/index.md448
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database/index.md2
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md2
-rw-r--r--doc/architecture/blueprints/git_data_offloading/index.md221
-rw-r--r--doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/adaptive_concurrency_limit_flow.pngbin0 -> 129675 bytes
-rw-r--r--doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md372
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/index.md32
-rw-r--r--doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md2
-rw-r--r--doc/architecture/blueprints/modular_monolith/proof_of_concepts.md2
-rw-r--r--doc/architecture/blueprints/observability_tracing/index.md2
-rw-r--r--doc/architecture/blueprints/organization/index.md128
-rw-r--r--doc/architecture/blueprints/rate_limiting/index.md18
-rw-r--r--doc/architecture/blueprints/remote_development/index.md34
-rw-r--r--doc/architecture/blueprints/runner_admission_controller/index.md2
-rw-r--r--doc/architecture/blueprints/ssh_certificates/index.md211
51 files changed, 2278 insertions, 695 deletions
diff --git a/doc/architecture/blueprints/ai_gateway/img/architecture.png b/doc/architecture/blueprints/ai_gateway/img/architecture.png
index dea8b5ddb45..e63b4ba45d1 100644
--- a/doc/architecture/blueprints/ai_gateway/img/architecture.png
+++ b/doc/architecture/blueprints/ai_gateway/img/architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/cells-feature-admin-area.md b/doc/architecture/blueprints/cells/cells-feature-admin-area.md
index 31d5388d40b..a9cd170b2a7 100644
--- a/doc/architecture/blueprints/cells/cells-feature-admin-area.md
+++ b/doc/architecture/blueprints/cells/cells-feature-admin-area.md
@@ -15,21 +15,16 @@ we can document the reasons for not choosing this approach.
# Cells: Admin Area
-In our Cells architecture proposal we plan to share all admin related tables in
-GitLab. This allows simpler management of all Cells in one interface and reduces
-the risk of settings diverging in different Cells. This introduces challenges
-with admin pages that allow you to manage data that will be spread across all
-Cells.
+In our Cells architecture proposal we plan to share all admin related tables in GitLab.
+This allows for simpler management of all Cells in one interface and reduces the risk of settings diverging in different Cells.
+This introduces challenges with Admin Area pages that allow you to manage data that will be spread across all Cells.
## 1. Definition
-There are consequences for admin pages that contain data that spans "the whole
-instance" as the Admin pages may be served by any Cell or possibly just 1 cell.
-There are already many parts of the Admin interface that will have data that
-spans many cells. For example lists of all Groups, Projects, Topics, Jobs,
-Analytics, Applications and more. There are also administrative monitoring
-capabilities in the Admin page that will span many cells such as the "Background
-Jobs" and "Background Migrations" pages.
+There are consequences for Admin Area pages that contain data that span "the whole instance" as the Admin Area pages may be served by any Cell or possibly just one Cell.
+There are already many parts of the Admin Area that will have data that span many Cells.
+For example lists of all Groups, Projects, Topics, Jobs, Analytics, Applications and more.
+There are also administrative monitoring capabilities in the Admin Area that will span many Cells such as the "Background Jobs" and "Background Migrations" pages.
## 2. Data flow
@@ -38,20 +33,47 @@ Jobs" and "Background Migrations" pages.
We will need to decide how to handle these exceptions with a few possible
options:
-1. Move all these pages out into a dedicated per-cell Admin section. Probably
+1. Move all these pages out into a dedicated per-Cell admin section. Probably
the URL will need to be routable to a single Cell like `/cells/<cell_id>/admin`,
- then we can display this data per Cell. These pages will be distinct from
- other Admin pages which control settings that are shared across all Cells. We
+ then we can display these data per Cell. These pages will be distinct from
+ other Admin Area pages which control settings that are shared across all Cells. We
will also need to consider how this impacts self-managed customers and
- whether, or not, this should be visible for single-cell instances of GitLab.
+ whether, or not, this should be visible for single-Cell instances of GitLab.
1. Build some aggregation interfaces for this data so that it can be fetched
from all Cells and presented in a single UI. This may be beneficial to an
administrator that needs to see and filter all data at a glance, especially
when they don't know which Cell the data is on. The downside, however, is
- that building this kind of aggregation is very tricky when all the Cells are
- designed to be totally independent, and it does also enforce more strict
+ that building this kind of aggregation is very tricky when all Cells are
+ designed to be totally independent, and it does also enforce stricter
requirements on compatibility between Cells.
+The following overview describes at what level each feature contained in the current Admin Area will be managed:
+
+| Feature | Cluster | Cell | Organization |
+| --- | --- | --- | --- |
+| Abuse reports | | | |
+| Analytics | | | |
+| Applications | | | |
+| Deploy keys | | | |
+| Labels | | | |
+| Messages | ✓ | | |
+| Monitoring | | ✓ | |
+| Subscription | | | |
+| System hooks | | | |
+| Overview | | | |
+| Settings - General | ✓ | | |
+| Settings - Integrations | ✓ | | |
+| Settings - Repository | ✓ | | |
+| Settings - CI/CD (1) | ✓ | ✓ | |
+| Settings - Reporting | ✓ | | |
+| Settings - Metrics | ✓ | | |
+| Settings - Service usage data | | ✓ | |
+| Settings - Network | ✓ | | |
+| Settings - Appearance | ✓ | | |
+| Settings - Preferences | ✓ | | |
+
+(1) Depending on the specific setting, some will be managed at the cluster-level, and some at the Cell-level.
+
## 4. Evaluation
## 4.1. Pros
diff --git a/doc/architecture/blueprints/cells/cells-feature-backups.md b/doc/architecture/blueprints/cells/cells-feature-backups.md
index b5d5d7afdcf..3d20d6e2caa 100644
--- a/doc/architecture/blueprints/cells/cells-feature-backups.md
+++ b/doc/architecture/blueprints/cells/cells-feature-backups.md
@@ -15,47 +15,38 @@ we can document the reasons for not choosing this approach.
# Cells: Backups
-Each cells will take its own backups, and consequently have its own isolated
-backup / restore procedure.
+Each Cell will take its own backups, and consequently have its own isolated backup/restore procedure.
## 1. Definition
-GitLab Backup takes a backup of the PostgreSQL database used by the application,
-and also Git repository data.
+GitLab backup takes a backup of the PostgreSQL database used by the application, and also Git repository data.
## 2. Data flow
-Each cell has a number of application databases to back up (for example, `main`, and `ci`).
-
-Additionally, there may be cluster-wide metadata tables (for example, `users` table)
-which is directly accessible via PostgreSQL.
+Each Cell has a number of application databases to back up (for example, `main`, and `ci`).
+Additionally, there may be cluster-wide metadata tables (for example, `users` table) which is directly accessible via PostgreSQL.
## 3. Proposal
### 3.1. Cluster-wide metadata
-It is currently unknown how cluster-wide metadata tables will be accessible. We
-may choose to have cluster-wide metadata tables backed up separately, or have
-each cell back up its copy of cluster-wide metdata tables.
+It is currently unknown how cluster-wide metadata tables will be accessible.
+We may choose to have cluster-wide metadata tables backed up separately, or have each Cell back up its copy of cluster-wide metadata tables.
### 3.2 Consistency
#### 3.2.1 Take backups independently
-As each cell will communicate with each other via API, and there will be no joins
-to the users table, it should be acceptable for each cell to take a backup
-independently of each other.
+As each Cell will communicate with each other via API, and there will be no joins to the `users` table, it should be acceptable for each Cell to take a backup independently of each other.
#### 3.2.2 Enforce snapshots
-We can require that each cell take a snapshot for the PostgreSQL databases at
-around the same time to allow for a consistent-enough backup.
+We can require that each Cell take a snapshot for the PostgreSQL databases at around the same time to allow for a consistent enough backup.
## 4. Evaluation
-As the number of cells increases, it will likely not be feasible to take a
-snapshot at the same time for all cells. Hence taking backups independently is
-the better option.
+As the number of Cells increases, it will likely not be feasible to take a snapshot at the same time for all Cells.
+Hence taking backups independently is the better option.
## 4.1. Pros
diff --git a/doc/architecture/blueprints/cells/cells-feature-ci-runners.md b/doc/architecture/blueprints/cells/cells-feature-ci-runners.md
index 8a6790ae49f..4e7cea5bfd5 100644
--- a/doc/architecture/blueprints/cells/cells-feature-ci-runners.md
+++ b/doc/architecture/blueprints/cells/cells-feature-ci-runners.md
@@ -15,156 +15,129 @@ we can document the reasons for not choosing this approach.
# Cells: CI Runners
-GitLab in order to execute CI jobs [GitLab Runner](https://gitlab.com/gitlab-org/gitlab-runner/),
-very often managed by customer in their infrastructure.
-
-All CI jobs created as part of CI pipeline are run in a context of project
-it poses a challenge how to manage GitLab Runners.
+GitLab executes CI jobs via [GitLab Runner](https://gitlab.com/gitlab-org/gitlab-runner/), very often managed by customers in their infrastructure.
+All CI jobs created as part of the CI pipeline are run in the context of a Project.
+This poses a challenge how to manage GitLab Runners.
## 1. Definition
There are 3 different types of runners:
-- instance-wide: runners that are registered globally with specific tags (selection criteria)
-- group runners: runners that execute jobs from a given top-level group or subprojects of that group
-- project runners: runners that execute jobs from projects or many projects: some runners might
- have projects assigned from projects in different top-level groups.
+- Instance-wide: Runners that are registered globally with specific tags (selection criteria)
+- Group runners: Runners that execute jobs from a given top-level Group or Projects in that Group
+- Project runners: Runners that execute jobs from one Projects or many Projects: some runners might
+ have Projects assigned from Projects in different top-level Groups.
-This alongside with existing data structure where `ci_runners` is a table describing
-all types of runners poses a challenge how the `ci_runners` should be managed in a Cells environment.
+This, alongside with the existing data structure where `ci_runners` is a table describing all types of runners, poses a challenge as to how the `ci_runners` should be managed in a Cells environment.
## 2. Data flow
-GitLab Runners use a set of globally scoped endpoints to:
+GitLab runners use a set of globally scoped endpoints to:
-- registration of a new runner via registration token `https://gitlab.com/api/v4/runners`
+- Register a new runner via registration token `https://gitlab.com/api/v4/runners`
([subject for removal](../runner_tokens/index.md)) (`registration token`)
-- creation of a new runner in the context of a user `https://gitlab.com/api/v4/user/runners` (`runner token`)
-- requests jobs via an authenticated `https://gitlab.com/api/v4/jobs/request` endpoint (`runner token`)
-- upload job status via `https://gitlab.com/api/v4/jobs/:job_id` (`build token`)
-- upload trace via `https://gitlab.com/api/v4/jobs/:job_id/trace` (`build token`)
-- download and upload artifacts via `https://gitlab.com/api/v4/jobs/:job_id/artifacts` (`build token`)
+- Create a new runner in the context of a user `https://gitlab.com/api/v4/user/runners` (`runner token`)
+- Request jobs via an authenticated `https://gitlab.com/api/v4/jobs/request` endpoint (`runner token`)
+- Upload job status via `https://gitlab.com/api/v4/jobs/:job_id` (`build token`)
+- Upload trace via `https://gitlab.com/api/v4/jobs/:job_id/trace` (`build token`)
+- Download and upload artifacts via `https://gitlab.com/api/v4/jobs/:job_id/artifacts` (`build token`)
Currently three types of authentication tokens are used:
-- runner registration token ([subject for removal](../runner_tokens/index.md))
-- runner token representing an registered runner in a system with specific configuration (`tags`, `locked`, etc.)
-- build token representing an ephemeral token giving a limited access to updating a specific
- job, uploading artifacts, downloading dependent artifacts, downloading and uploading
- container registry images
+- Runner registration token ([subject for removal](../runner_tokens/index.md))
+- Runner token representing a registered runner in a system with specific configuration (`tags`, `locked`, etc.)
+- Build token representing an ephemeral token giving limited access to updating a specific job, uploading artifacts, downloading dependent artifacts, downloading and uploading container registry images
-Each of those endpoints do receive an authentication token via header (`JOB-TOKEN` for `/trace`)
-or body parameter (`token` all other endpoints).
+Each of those endpoints receive an authentication token via header (`JOB-TOKEN` for `/trace`) or body parameter (`token` all other endpoints).
-Since the CI pipeline would be created in a context of a specific Cell it would be required
-that pick of a build would have to be processed by that particular Cell. This requires
-that build picking depending on a solution would have to be either:
+Since the CI pipeline would be created in the context of a specific Cell, it would be required that pick of a build would have to be processed by that particular Cell.
+This requires that build picking depending on a solution would have to be either:
-- routed to correct Cell for a first time
-- be made to be two phase: request build from global pool, claim build on a specific Cell using a Cell specific URL
+- Routed to the correct Cell for the first time
+- Be two-phased: Request build from global pool, claim build on a specific Cell using a Cell specific URL
## 3. Proposal
-This section describes various proposals. Reader should consider that those
-proposals do describe solutions for different problems. Many or some aspects
-of those proposals might be the solution to the stated problem.
-
### 3.1. Authentication tokens
-Even though the paths for CI Runners are not routable they can be made routable with
-those two possible solutions:
+Even though the paths for CI runners are not routable, they can be made routable with these two possible solutions:
- The `https://gitlab.com/api/v4/jobs/request` uses a long polling mechanism with
- a ticketing mechanism (based on `X-GitLab-Last-Update` header). Runner when first
- starts sends a request to GitLab to which GitLab responds with either a build to pick
+ a ticketing mechanism (based on `X-GitLab-Last-Update` header). When the runner first
+ starts, it sends a request to GitLab to which GitLab responds with either a build to pick
by runner. This value is completely controlled by GitLab. This allows GitLab
- to use JWT or any other means to encode `cell` identifier that could be easily
+ to use JWT or any other means to encode a `cell` identifier that could be easily
decodable by Router.
-- The majority of communication (in terms of volume) is using `build token` making it
- the easiest target to change since GitLab is sole owner of the token that Runner later
- uses for specific job. There were prior discussions about not storing `build token`
- but rather using `JWT` token with defined scopes. Such token could encode the `cell`
- to which router could easily route all requests.
+- The majority of communication (in terms of volume) is using `build token`, making it
+ the easiest target to change since GitLab is the sole owner of the token that the runner later
+ uses for a specific job. There were prior discussions about not storing the `build token`
+ but rather using a `JWT` token with defined scopes. Such a token could encode the `cell`
+ to which the Router could route all requests.
### 3.2. Request body
-- The most of used endpoints pass authentication token in request body. It might be desired
- to use HTTP Headers as an easier way to access this information by Router without
+- The most used endpoints pass the authentication token in the request body. It might be desired
+ to use HTTP headers as an easier way to access this information by Router without
a need to proxy requests.
-### 3.3. Instance-wide are Cell local
+### 3.3. Instance-wide are Cell-local
We can pick a design where all runners are always registered and local to a given Cell:
-- Each Cell has it's own set of instance-wide runners that are updated at it's own pace
-- The project runners can only be linked to projects from the same organization
- creating strong isolation.
+- Each Cell has its own set of instance-wide runners that are updated at its own pace
+- The Project runners can only be linked to Projects from the same Organization, creating strong isolation.
- In this model the `ci_runners` table is local to the Cell.
-- In this model we would require the above endpoints to be scoped to a Cell in some way
- or made routable. It might be via prefixing them, adding additional Cell parameter,
- or providing much more robust way to decode runner token and match it to Cell.
-- If routable token is used, we could move away from cryptographic random stored in
- database to rather prefer to use JWT tokens that would encode
-- The Admin Area showing registered Runners would have to be scoped to a Cell
-
-This model might be desired since it provides strong isolation guarantees.
-This model does significantly increase maintenance overhead since each Cell is managed
-separately.
+- In this model we would require the above endpoints to be scoped to a Cell in some way, or be made routable. It might be via prefixing them, adding additional Cell parameters, or providing much more robust ways to decode runner tokens and match it to a Cell.
+- If a routable token is used, we could move away from cryptographic random stored in database to rather prefer to use JWT tokens.
+- The Admin Area showing registered runners would have to be scoped to a Cell.
-This model may require adjustments to runner tags feature so that projects have consistent runner experience across cells.
+This model might be desired because it provides strong isolation guarantees.
+This model does significantly increase maintenance overhead because each Cell is managed separately.
+This model may require adjustments to the runner tags feature so that Projects have a consistent runner experience across Cells.
### 3.4. Instance-wide are cluster-wide
-Contrary to proposal where all runners are Cell local, we can consider that runners
+Contrary to the proposal where all runners are Cell-local, we can consider that runners
are global, or just instance-wide runners are global.
-However, this requires significant overhaul of system and to change the following aspects:
+However, this requires significant overhaul of the system and we would have to change the following aspects:
-- `ci_runners` table would likely have to be split decomposed into `ci_instance_runners`, ...
-- all interfaces would have to be adopted to use correct table
-- build queuing would have to be reworked to be two phase where each Cell would know of all pending
- and running builds, but the actual claim of a build would happen against a Cell containing data
-- likely `ci_pending_builds` and `ci_running_builds` would have to be made `cluster-wide` tables
- increasing likelihood of creating hotspots in a system related to CI queueing
+- The `ci_runners` table would likely have to be decomposed into `ci_instance_runners`, ...
+- All interfaces would have to be adopted to use the correct table.
+- Build queuing would have to be reworked to be two-phased where each Cell would know of all pending and running builds, but the actual claim of a build would happen against a Cell containing data.
+- It is likely that `ci_pending_builds` and `ci_running_builds` would have to be made `cluster-wide` tables, increasing the likelihood of creating hotspots in a system related to CI queueing.
-This model makes it complex to implement from engineering side. Does make some data being shared
-between Cells. Creates hotspots / scalability issues in a system (ex. during abuse) that
-might impact experience of organizations on other Cells.
+This model is complex to implement from an engineering perspective.
+Some data are shared between Cells.
+It creates hotspots/scalability issues in a system that might impact the experience of Organizations on other Cells, for instance during abuse.
### 3.5. GitLab CI Daemon
-Another potential solution to explore is to have a dedicated service responsible for builds queueing
-owning it's database and working in a model of either sharded or celled service. There were prior
-discussions about [CI/CD Daemon](https://gitlab.com/gitlab-org/gitlab/-/issues/19435).
+Another potential solution to explore is to have a dedicated service responsible for builds queueing, owning its database and working in a model of either sharded or Cell-ed service.
+There were prior discussions about [CI/CD Daemon](https://gitlab.com/gitlab-org/gitlab/-/issues/19435).
-If the service would be sharded:
+If the service is sharded:
-- depending on a model if runners are cluster-wide or cell-local this service would have to fetch
- data from all Cells
-- if the sharded service would be used we could adapt a model of either sharing database containing
- `ci_pending_builds/ci_running_builds` with the service
-- if the sharded service would be used we could consider a push model where each Cell pushes to CI/CD Daemon
- builds that should be picked by Runner
-- the sharded service would be aware which Cell is responsible for processing the given build and could
- route processing requests to designated Cell
+- Depending on the model, if runners are cluster-wide or Cell-local, this service would have to fetch data from all Cells.
+- If the sharded service is used we could adapt a model of sharing a database containing `ci_pending_builds/ci_running_builds` with the service.
+- If the sharded service is used we could consider a push model where each Cell pushes to CI/CD Daemon builds that should be picked by runner.
+- The sharded service would be aware which Cell is responsible for processing the given build and could route processing requests to the designated Cell.
-If the service would be celled:
+If the service is Cell-ed:
-- all expectations of routable endpoints are still valid
+- All expectations of routable endpoints are still valid.
-In general usage of CI Daemon does not help significantly with the stated problem. However, this offers
-a few upsides related to more efficient processing and decoupling model: push model and it opens a way
-to offer stateful communication with GitLab Runners (ex. gRPC or Websockets).
+In general usage of CI Daemon does not help significantly with the stated problem.
+However, this offers a few upsides related to more efficient processing and decoupling model: push model and it opens a way to offer stateful communication with GitLab runners (ex. gRPC or Websockets).
## 4. Evaluation
-Considering all solutions it appears that solution giving the most promise is:
+Considering all options it appears that the most promising solution is to:
-- use "instance-wide are Cell local"
-- refine endpoints to have routable identities (either via specific paths, or better tokens)
+- Use [Instance-wide are Cell-local](#33-instance-wide-are-cell-local)
+- Refine endpoints to have routable identities (either via specific paths, or better tokens)
-Other potential upsides is to get rid of `ci_builds.token` and rather use a `JWT token`
-that can much better and easier encode wider set of scopes allowed by CI runner.
+Another potential upside is to get rid of `ci_builds.token` and rather use a `JWT token` that can much better and easier encode a wider set of scopes allowed by CI runner.
## 4.1. Pros
diff --git a/doc/architecture/blueprints/cells/cells-feature-container-registry.md b/doc/architecture/blueprints/cells/cells-feature-container-registry.md
index a5761808941..25af65a8700 100644
--- a/doc/architecture/blueprints/cells/cells-feature-container-registry.md
+++ b/doc/architecture/blueprints/cells/cells-feature-container-registry.md
@@ -15,46 +15,37 @@ we can document the reasons for not choosing this approach.
# Cells: Container Registry
-GitLab Container Registry is a feature allowing to store Docker Container Images
-in GitLab. You can read about GitLab integration [here](../../../user/packages/container_registry/index.md).
+GitLab [Container Registry](../../../user/packages/container_registry/index.md) is a feature allowing to store Docker container images in GitLab.
## 1. Definition
-GitLab Container Registry is a complex service requiring usage of PostgreSQL, Redis
-and Object Storage dependencies. Right now there's undergoing work to introduce
-[Container Registry Metadata](../container_registry_metadata_database/index.md)
-to optimize data storage and image retention policies of Container Registry.
+GitLab Container Registry is a complex service requiring usage of PostgreSQL, Redis and Object Storage dependencies.
+Right now there's undergoing work to introduce [Container Registry Metadata](../container_registry_metadata_database/index.md) to optimize data storage and image retention policies of Container Registry.
-GitLab Container Registry is serving as a container for stored data,
-but on it's own does not authenticate `docker login`. The `docker login`
-is executed with user credentials (can be `personal access token`)
-or CI build credentials (ephemeral `ci_builds.token`).
+GitLab Container Registry is serving as a container for stored data, but on its own does not authenticate `docker login`.
+The `docker login` is executed with user credentials (can be `personal access token`) or CI build credentials (ephemeral `ci_builds.token`).
-Container Registry uses data deduplication. It means that the same blob
-(image layer) that is shared between many projects is stored only once.
+Container Registry uses data deduplication.
+It means that the same blob (image layer) that is shared between many Projects is stored only once.
Each layer is hashed by `sha256`.
-The `docker login` does request JWT time-limited authentication token that
-is signed by GitLab, but validated by Container Registry service. The JWT
-token does store all authorized scopes (`container repository images`)
-and operation types (`push` or `pull`). A single JWT authentication token
-can be have many authorized scopes. This allows container registry and client
-to mount existing blobs from another scopes. GitLab responds only with
-authorized scopes. Then it is up to GitLab Container Registry to validate
-if the given operation can be performed.
+The `docker login` does request a JWT time-limited authentication token that is signed by GitLab, but validated by Container Registry service.
+The JWT token does store all authorized scopes (`container repository images`) and operation types (`push` or `pull`).
+A single JWT authentication token can have many authorized scopes.
+This allows Container Registry and client to mount existing blobs from other scopes.
+GitLab responds only with authorized scopes.
+Then it is up to GitLab Container Registry to validate if the given operation can be performed.
-The GitLab.com pages are always scoped to project. Each project can have many
-container registry images attached.
+The GitLab.com pages are always scoped to a Project.
+Each Project can have many container registry images attached.
-Currently in case of GitLab.com the actual registry service is served
-via `https://registry.gitlab.com`.
+Currently, on GitLab.com the actual registry service is served via `https://registry.gitlab.com`.
The main identifiable problems are:
-- the authentication request (`https://gitlab.com/jwt/auth`) that is processed by GitLab.com
-- the `https://registry.gitlab.com` that is run by external service and uses it's own data store
-- the data deduplication, the Cells architecture with registry run in a Cell would reduce
- efficiency of data storage
+- The authentication request (`https://gitlab.com/jwt/auth`) that is processed by GitLab.com.
+- The `https://registry.gitlab.com` that is run by an external service and uses its own data store.
+- Data deduplication. The Cells architecture with registry run in a Cell would reduce efficiency of data storage.
## 2. Data flow
@@ -99,33 +90,24 @@ curl \
### 3.1. Shard Container Registry separately to Cells architecture
-Due to it's architecture it extensive architecture and in general highly scalable
-horizontal architecture it should be evaluated if the GitLab Container Registry
-should be run not in Cell, but in a Cluster and be scaled independently.
-
+Due to its extensive and in general highly scalable horizontal architecture it should be evaluated if the GitLab Container Registry should be run not in Cell, but in a Cluster and be scaled independently.
This might be easier, but would definitely not offer the same amount of data isolation.
### 3.2. Run Container Registry within a Cell
-It appears that except `/jwt/auth` which would likely have to be processed by Router
-(to decode `scope`) the container registry could be run as a local service of a Cell.
-
-The actual data at least in case of GitLab.com is not forwarded via registry,
-but rather served directly from Object Storage / CDN.
+It appears that except `/jwt/auth` which would likely have to be processed by Router (to decode `scope`) the Container Registry could be run as a local service of a Cell.
+The actual data at least in case of GitLab.com is not forwarded via registry, but rather served directly from Object Storage / CDN.
Its design encodes container repository image in a URL that is easily routable.
-It appears that we could re-use the same stateless Router service in front of Container Registry
-to serve manifests and blobs redirect.
+It appears that we could re-use the same stateless Router service in front of Container Registry to serve manifests and blobs redirect.
-The only downside is increased complexity of managing standalone registry for each Cell,
-but this might be desired approach.
+The only downside is increased complexity of managing standalone registry for each Cell, but this might be desired approach.
## 4. Evaluation
-There do not seem any theoretical problems with running GitLab Container Registry in a Cell.
-Service seems that can be easily made routable to work well.
-
-The practical complexities are around managing complex service from infrastructure side.
+There do not seem to be any theoretical problems with running GitLab Container Registry in a Cell.
+It seems that the service can be easily made routable to work well.
+The practical complexities are around managing a complex service from an infrastructure side.
## 4.1. Pros
diff --git a/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md b/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
index 8a67383c5e4..8e144386908 100644
--- a/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
+++ b/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
@@ -15,37 +15,33 @@ we can document the reasons for not choosing this approach.
# Cells: Contributions: Forks
-[Forking workflow](../../../user/project/repository/forking_workflow.md) allows users
-to copy existing project sources into their own namespace of choice (personal or group).
+The [Forking workflow](../../../user/project/repository/forking_workflow.md) allows users to copy existing Project sources into their own namespace of choice (Personal or Group).
## 1. Definition
-[Forking workflow](../../../user/project/repository/forking_workflow.md) is common workflow
-with various usage patterns:
+The [Forking workflow](../../../user/project/repository/forking_workflow.md) is a common workflow with various usage patterns:
-- allows users to contribute back to upstream project
-- persist repositories into their personal namespace
-- copy to make changes and release as modified project
+- It allows users to contribute back to upstream Project.
+- It persists repositories into their Personal Namespace.
+- Users can copy to make changes and release as modified Project.
-Forks allow users not having write access to parent project to make changes. The forking workflow
-is especially important for the Open Source community which is able to contribute back
-to public projects. However, it is equally important in some companies which prefer the strong split
-of responsibilities and tighter access control. The access to project is restricted
-to designated list of developers.
+Forks allow users not having write access to a parent Project to make changes.
+The forking workflow is especially important for the open source community to contribute back to public Projects.
+However, it is equally important in some companies that prefer a strong split of responsibilities and tighter access control.
+The access to a Project is restricted to a designated list of developers.
Forks enable:
-- tighter control of who can modify the upstream project
-- split of the responsibilities: parent project might use CI configuration connecting to production systems
-- run CI pipelines in context of fork in much more restrictive environment
-- consider all forks to be unveted which reduces risks of leaking secrets, or any other information
- tied with the project
+- Tighter control of who can modify the upstream Project.
+- Split of responsibilities: Parent Project might use CI configuration connecting to production systems.
+- To run CI pipelines in the context of a fork in a much more restrictive environment.
+- To consider all forks to be unvetted which reduces risks of leaking secrets, or any other information tied to the Project.
-The forking model is problematic in Cells architecture for following reasons:
+The forking model is problematic in a Cells architecture for the following reasons:
-- Forks are clones of existing repositories, forks could be created across different organizations, Cells and Gitaly shards.
-- User can create merge request and contribute back to upstream project, this upstream project might in a different organization and Cell.
-- The merge request CI pipeline is to executed in a context of source project, but presented in a context of target project.
+- Forks are clones of existing repositories. Forks could be created across different Organizations, Cells and Gitaly shards.
+- Users can create merge requests and contribute back to an upstream Project. This upstream Project might in a different Organization and Cell.
+- The merge request CI pipeline is executed in the context of the source Project, but presented in the context of the target Project.
## 2. Data flow
@@ -53,66 +49,55 @@ The forking model is problematic in Cells architecture for following reasons:
### 3.1. Intra-Cluster forks
-This proposal makes us to implement forks as a intra-ClusterCell forks where communication is done via API
-between all trusted Cells of a cluster:
-
-- Forks when created, they are created always in context of user choice of group.
-- Forks are isolated from Organization.
-- Organization or group owner could disable forking across organizations or forking in general.
-- When a Merge Request is created it is created in context of target project, referencing
- external project on another Cell.
-- To target project the merge reference is transfered that is used for presenting information
- in context of target project.
-- CI pipeline is fetched in context of source project as it-is today, the result is fetched into
- Merge Request of target project.
-- The Cell holding target project internally uses GraphQL to fetch status of source project
- and include in context of the information for merge request.
+This proposal implements forks as intra-Cluster forks where communication is done via API between all trusted Cells of a cluster:
+
+- Forks are created always in the context of a user's choice of Group.
+- Forks are isolated from the Organization.
+- Organization or Group owner could disable forking across Organizations, or forking in general.
+- A merge request is created in the context of the target Project, referencing the external Project on another Cell.
+- To target Project the merge reference is transferred that is used for presenting information in context of the target Project.
+- CI pipeline is fetched in the context of the source Project as it is today, the result is fetched into the merge request of the target Project.
+- The Cell holding the target Project internally uses GraphQL to fetch the status of the source Project and includes in context of the information for merge request.
Upsides:
-- All existing forks continue to work as-is, as they are treated as intra-Cluster forks.
+- All existing forks continue to work as they are, as they are treated as intra-Cluster forks.
Downsides:
-- The purpose of Organizations is to provide strong isolation between organizations
- allowing to fork across does break security boundaries.
-- However, this is no different to ability of users today to clone repository to local computer
- and push it to any repository of choice.
-- Access control of source project can be lower than those of target project. System today
- requires that in order to contribute back the access level needs to be the same for fork and upstream.
-
-### 3.2. Forks are created in a personal namespace of the current organization
-
-Instead of creating projects across organizations, the forks are created in a user personal namespace
-tied with the organization. Example:
-
-- Each user that is part of organization receives their personal namespace. For example for `GitLab Inc.`
- it could be `gitlab.com/organization/gitlab-inc/@ayufan`.
-- The user has to fork into it's own personal namespace of the organization.
-- The user has that many personal namespaces as many organizations it belongs to.
-- The personal namespace behaves similar to currently offered personal namespace.
-- The user can manage and create projects within a personal namespace.
-- The organization can prevent or disable usage of personal namespaces disallowing forks.
-- All current forks are migrated into personal namespace of user in Organization.
-- All forks are part of to the organization.
-- The forks are not federated features.
-- The personal namespace and forked project do not share configuration with parent project.
-
-### 3.3. Forks are created as internal projects under current project
-
-Instead of creating projects across organizations, the forks are attachments to existing projects.
-Each user forking a project receives their unique project. Example:
-
-- For project: `gitlab.com/gitlab-org/gitlab`, forks would be created in `gitlab.com/gitlab-org/gitlab/@kamil-gitlab`.
-- Forks are created in a context of current organization, they do not cross organization boundaries
- and are managed by the organization.
+- The purpose of Organizations is to provide strong isolation between Organizations. Allowing to fork across does break security boundaries.
+- However, this is no different to the ability of users today to clone a repository to a local computer and push it to any repository of choice.
+- Access control of source Project can be lower than those of target Project. Today, the system requires that in order to contribute back, the access level needs to be the same for fork and upstream.
+
+### 3.2. Forks are created in a Personal Namespace of the current Organization
+
+Instead of creating Projects across Organizations, forks are created in a user's Personal Namespace tied to the Organization. Example:
+
+- Each user that is part of an Organization receives their Personal Namespace. For example for `GitLab Inc.` it could be `gitlab.com/organization/gitlab-inc/@ayufan`.
+- The user has to fork into their own Personal Namespace of the Organization.
+- The user has as many Personal Namespaces as Organizations they belongs to.
+- The Personal Namespace behaves similar to the currently offered Personal Namespace.
+- The user can manage and create Projects within a Personal Namespace.
+- The Organization can prevent or disable usage of Personal Namespaces, disallowing forks.
+- All current forks are migrated into the Personal Namespace of user in an Organization.
+- All forks are part of the Organization.
+- Forks are not federated features.
+- The Personal Namespace and forked Project do not share configuration with the parent Project.
+
+### 3.3. Forks are created as internal Projects under current Projects
+
+Instead of creating Projects across Organizations, forks are attachments to existing Projects.
+Each user forking a Project receives their unique Project. Example:
+
+- For Project: `gitlab.com/gitlab-org/gitlab`, forks would be created in `gitlab.com/gitlab-org/gitlab/@kamil-gitlab`.
+- Forks are created in the context of the current Organization, they do not cross Organization boundaries and are managed by the Organization.
- Tied to the user (or any other user-provided name of the fork).
-- The forks are not federated features.
+- Forks are not federated features.
Downsides:
-- Does not answer how to handle and migrate all exisiting forks.
-- Might share current group / project settings - breaking some security boundaries.
+- Does not answer how to handle and migrate all existing forks.
+- Might share current Group/Project settings, which could be breaking some security boundaries.
## 4. Evaluation
diff --git a/doc/architecture/blueprints/cells/cells-feature-data-migration.md b/doc/architecture/blueprints/cells/cells-feature-data-migration.md
index ef0865b4081..9ff661ddf68 100644
--- a/doc/architecture/blueprints/cells/cells-feature-data-migration.md
+++ b/doc/architecture/blueprints/cells/cells-feature-data-migration.md
@@ -6,15 +6,6 @@ description: 'Cells: Data migration'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,26 +15,18 @@ we can document the reasons for not choosing this approach.
# Cells: Data migration
-It is essential for Cells architecture to provide a way to migrate data out of big Cells
-into smaller ones. This describes various approaches to provide this type of split.
-
-We also need to handle for cases where data is already violating the expected
-isolation constraints of Cells (ie. references cannot span multiple
-organizations). We know that existing features like linked issues allowed users
-to link issues across any projects regardless of their hierarchy. There are many
-similar features. All of this data will need to be migrated in some way before
-it can be split across different cells. This may mean some data needs to be
-deleted, or the feature changed and modelled slightly differently before we can
-properly split or migrate the organizations between cells.
-
-Having schema deviations across different Cells, which is a necessary
-consequence of different databases, will also impact our ability to migrate
-data between cells. Different schemas impact our ability to reliably replicate
-data across cells and especially impact our ability to validate that the data is
-correctly replicated. It might force us to only be able to move data between
-cells when the schemas are all in sync (slowing down deployments and the
-rebalancing process) or possibly only migrate from newer to older schemas which
-would be complex.
+It is essential for a Cells architecture to provide a way to migrate data out of big Cells into smaller ones.
+This document describes various approaches to provide this type of split.
+
+We also need to handle cases where data is already violating the expected isolation constraints of Cells, for example references cannot span multiple Organizations.
+We know that existing features like linked issues allowed users to link issues across any Projects regardless of their hierarchy.
+There are many similar features.
+All of this data will need to be migrated in some way before it can be split across different Cells.
+This may mean some data needs to be deleted, or the feature needs to be changed and modelled slightly differently before we can properly split or migrate Organizations between Cells.
+
+Having schema deviations across different Cells, which is a necessary consequence of different databases, will also impact our ability to migrate data between Cells.
+Different schemas impact our ability to reliably replicate data across Cells and especially impact our ability to validate that the data is correctly replicated.
+It might force us to only be able to move data between Cells when the schemas are all in sync (slowing down deployments and the rebalancing process) or possibly only migrate from newer to older schemas which would be complex.
## 1. Definition
@@ -53,34 +36,27 @@ would be complex.
### 3.1. Split large Cells
-A single Cell can only be divided into many Cells. This is based on principle
-that it is easier to create exact clone of an existing Cell in many replicas
-out of which some will be made authoritative once migrated. Keeping those
-replicas up-to date with Cell 0 is also much easier due to pre-existing
-replication solutions that can replicate the whole systems: Geo, PostgreSQL
-physical replication, etc.
+A single Cell can only be divided into many Cells.
+This is based on the principle that it is easier to create an exact clone of an existing Cell in many replicas out of which some will be made authoritative once migrated.
+Keeping those replicas up-to-date with Cell 0 is also much easier due to pre-existing replication solutions that can replicate the whole systems: Geo, PostgreSQL physical replication, etc.
-1. All data of an organization needs to not be divided across many Cells.
+1. All data of an Organization needs to not be divided across many Cells.
1. Split should be doable online.
1. New Cells cannot contain pre-existing data.
1. N Cells contain exact replica of Cell 0.
1. The data of Cell 0 is live replicated to as many Cells it needs to be split.
-1. Once consensus is achieved between Cell 0 and N-Cells the organizations to be migrated away
- are marked as read-only cluster-wide.
-1. The `routes` is updated on for all organizations to be split to indicate an authoritative
- Cell holding the most recent data, like `gitlab-org` on `cell-100`.
-1. The data for `gitlab-org` on Cell 0, and on other non-authoritative N-Cells are dormant
- and will be removed in the future.
-1. All accesses to `gitlab-org` on a given Cell are validated about `cell_id` of `routes`
- to ensure that given Cell is authoritative to handle the data.
+1. Once consensus is achieved between Cell 0 and N-Cells, the Organizations to be migrated away are marked as read-only cluster-wide.
+1. The `routes` is updated on for all Organizations to be split to indicate an authoritative Cell holding the most recent data, like `gitlab-org` on `cell-100`.
+1. The data for `gitlab-org` on Cell 0, and on other non-authoritative N-Cells are dormant and will be removed in the future.
+1. All accesses to `gitlab-org` on a given Cell are validated about `cell_id` of `routes` to ensure that given Cell is authoritative to handle the data.
#### More challenges of this proposal
1. There is no streaming replication capability for Elasticsearch, but you could
snapshot the whole Elasticsearch index and recreate, but this takes hours.
- It could be handled by pausing Elasticsearch indexing on the initial cell during
+ It could be handled by pausing Elasticsearch indexing on the initial Cell during
the migration as indexing downtime is not a big issue, but this still needs
- to be coordinated with the migration process
+ to be coordinated with the migration process.
1. Syncing Redis, Gitaly, CI Postgres, Main Postgres, registry Postgres, other
new data stores snapshots in an online system would likely lead to gaps
without a long downtime. You need to choose a sync point and at the sync
@@ -88,39 +64,31 @@ physical replication, etc.
there are to migrate at the same time the longer the write downtime for the
failover. We would also need to find a reliable place in the application to
actually block updates to all these systems with a high degree of
- confidence. In the past we've only been confident by shutting down all rails
- services because any rails process could write directly to any of these at
+ confidence. In the past we've only been confident by shutting down all Rails
+ services because any Rails process could write directly to any of these at
any time due to async workloads or other surprising code paths.
1. How to efficiently delete all the orphaned data. Locating all `ci_builds`
- associated with half the organizations would be very expensive if we have to
+ associated with half the Organizations would be very expensive if we have to
do joins. We haven't yet determined if we'd want to store an `organization_id`
column on every table, but this is the kind of thing it would be helpful for.
-### 3.2. Migrate organization from an existing Cell
-
-This is different to split, as we intend to perform logical and selective replication
-of data belonging to a single organization.
+### 3.2. Migrate Organization from an existing Cell
-Today this type of selective replication is only implemented by Gitaly where we can migrate
-Git repository from a single Gitaly node to another with minimal downtime.
+This is different to split, as we intend to perform logical and selective replication of data belonging to a single Organization.
+Today this type of selective replication is only implemented by Gitaly where we can migrate Git repository from a single Gitaly node to another with minimal downtime.
-In this model we would require identifying all resources belonging to a given organization:
-database rows, object storage files, Git repositories, etc. and selectively copy them over
-to another (likely) existing Cell importing data into it. Ideally ensuring that we can
-perform logical replication live of all changed data, but change similarly to split
-which Cell is authoritative for this organization.
+In this model we would require identifying all resources belonging to a given Organization: database rows, object storage files, Git repositories, etc. and selectively copy them over to another (likely) existing Cell importing data into it.
+Ideally ensuring that we can perform logical replication live of all changed data, but change similarly to split which Cell is authoritative for this Organization.
-1. It is hard to identify all resources belonging to organization.
-1. It requires either downtime for organization or a robust system to identify
- live changes made.
-1. It likely will require a full database structure analysis (more robust than project import/export)
- to perform selective PostgreSQL logical replication.
+1. It is hard to identify all resources belonging to an Organization.
+1. It requires either downtime for the Organization or a robust system to identify live changes made.
+1. It likely will require a full database structure analysis (more robust than Project import/export) to perform selective PostgreSQL logical replication.
#### More challenges of this proposal
1. Logical replication is still not performant enough to keep up with our
scale. Even if we could use logical replication we still don't have an
- efficient way to filter data related to a single organization without
+ efficient way to filter data related to a single Organization without
joining all the way to the `organizations` table which will slow down
logical replication dramatically.
diff --git a/doc/architecture/blueprints/cells/cells-feature-database-sequences.md b/doc/architecture/blueprints/cells/cells-feature-database-sequences.md
index d94dc3be864..2aeaaed7d64 100644
--- a/doc/architecture/blueprints/cells/cells-feature-database-sequences.md
+++ b/doc/architecture/blueprints/cells/cells-feature-database-sequences.md
@@ -6,15 +6,6 @@ description: 'Cells: Database Sequences'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,14 +15,10 @@ we can document the reasons for not choosing this approach.
# Cells: Database Sequences
-GitLab today ensures that every database row create has unique ID, allowing
-to access Merge Request, CI Job or Project by a known global ID.
-
-Cells will use many distinct and not connected databases, each of them having
-a separate IDs for most of entities.
-
-It might be desirable to retain globally unique IDs for all database rows
-to allow migrating resources between Cells in the future.
+GitLab today ensures that every database row create has a unique ID, allowing to access a merge request, CI Job or Project by a known global ID.
+Cells will use many distinct and not connected databases, each of them having a separate ID for most entities.
+At a minimum, any ID referenced between a Cell and the shared schema will need to be unique across the cluster to avoid ambiguous references.
+Further to required global IDs, it might also be desirable to retain globally unique IDs for all database rows to allow migrating resources between Cells in the future.
## 1. Definition
@@ -39,54 +26,46 @@ to allow migrating resources between Cells in the future.
## 3. Proposal
-This are some preliminary ideas how we can retain unique IDs across the system.
+These are some preliminary ideas how we can retain unique IDs across the system.
### 3.1. UUID
-Instead of using incremental sequences use UUID (128 bit) that is stored in database.
+Instead of using incremental sequences, use UUID (128 bit) that is stored in the database.
-- This might break existing IDs and requires adding UUID column for all existing tables.
+- This might break existing IDs and requires adding a UUID column for all existing tables.
- This makes all indexes larger as it requires storing 128 bit instead of 32/64 bit in index.
### 3.2. Use Cell index encoded in ID
-Since significant number of tables already use 64 bit ID numbers we could use MSB to encode
-Cell ID effectively enabling
+Because a significant number of tables already use 64 bit ID numbers we could use MSB to encode the Cell ID:
-- This might limit amount of Cells that can be enabled in system, as we might decide to only
- allocate 1024 possible Cell numbers.
-- This might make IDs to be migratable between Cells, since even if entity from Cell 1 is migrated to Cell 100
- this ID would still be unique.
-- If resources are migrated the ID itself will not be enough to decode Cell number and we would need
- lookup table.
+- This might limit the amount of Cells that can be enabled in a system, as we might decide to only allocate 1024 possible Cell numbers.
+- This would make it possible to migrate IDs between Cells, because even if an entity from Cell 1 is migrated to Cell 100 this ID would still be unique.
+- If resources are migrated the ID itself will not be enough to decode the Cell number and we would need a lookup table.
- This requires updating all IDs to 32 bits.
### 3.3. Allocate sequence ranges from central place
-Each Cell might receive its own range of the sequences as they are consumed from a centrally managed place.
-Once Cell consumes all IDs assigned for a given table it would be replenished and a next range would be allocated.
+Each Cell might receive its own range of sequences as they are consumed from a centrally managed place.
+Once a Cell consumes all IDs assigned for a given table it would be replenished and a next range would be allocated.
Ranges would be tracked to provide a faster lookup table if a random access pattern is required.
-- This might make IDs to be migratable between Cells, since even if entity from Cell 1 is migrated to Cell 100
- this ID would still be unique.
-- If resources are migrated the ID itself will not be enough to decode Cell number and we would need
- much more robust lookup table as we could be breaking previously assigned sequence ranges.
+- This might make IDs migratable between Cells, because even if an entity from Cell 1 is migrated to Cell 100 this ID would still be unique.
+- If resources are migrated the ID itself will not be enough to decode the Cell number and we would need a much more robust lookup table as we could be breaking previously assigned sequence ranges.
- This does not require updating all IDs to 64 bits.
-- This adds some performance penalty to all `INSERT` statements in Postgres or at least from Rails as we need to check for the sequence number and potentially wait for our range to be refreshed from the ID server
+- This adds some performance penalty to all `INSERT` statements in Postgres or at least from Rails as we need to check for the sequence number and potentially wait for our range to be refreshed from the ID server.
- The available range will need to be stored and incremented in a centralized place so that concurrent transactions cannot possibly get the same value.
### 3.4. Define only some tables to require unique IDs
-Maybe this is acceptable only for some tables to have a globally unique IDs. It could be projects, groups
-and other top-level entities. All other tables like `merge_requests` would only offer Cell-local ID,
-but when referenced outside it would rather use IID (an ID that is monotonic in context of a given resource, like project).
+Maybe it is acceptable only for some tables to have a globally unique IDs. It could be Projects, Groups and other top-level entities.
+All other tables like `merge_requests` would only offer a Cell-local ID, but when referenced outside it would rather use an IID (an ID that is monotonic in context of a given resource, like a Project).
-- This makes the ID 10000 for `merge_requests` be present on all Cells, which might be sometimes confusing
- as for uniqueness of the resource.
-- This might make random access by ID (if ever needed) be impossible without using composite key, like: `project_id+merge_request_id`.
-- This would require us to implement a transformation/generation of new ID if we need to migrate records to another cell. This can lead to very difficult migration processes when these IDs are also used as foreign keys for other records being migrated.
-- If IDs need to change when moving between cells this means that any links to records by ID would no longer work even if those links included the `project_id`.
-- If we plan to allow these ids to not be unique and change the unique constraint to be based on a composite key then we'd need to update all foreign key references to be based on the composite key
+- This makes the ID 10000 for `merge_requests` be present on all Cells, which might be sometimes confusing regarding the uniqueness of the resource.
+- This might make random access by ID (if ever needed) impossible without using a composite key, like: `project_id+merge_request_id`.
+- This would require us to implement a transformation/generation of new ID if we need to migrate records to another Cell. This can lead to very difficult migration processes when these IDs are also used as foreign keys for other records being migrated.
+- If IDs need to change when moving between Cells this means that any links to records by ID would no longer work even if those links included the `project_id`.
+- If we plan to allow these IDs to not be unique and change the unique constraint to be based on a composite key then we'd need to update all foreign key references to be based on the composite key.
## 4. Evaluation
diff --git a/doc/architecture/blueprints/cells/cells-feature-explore.md b/doc/architecture/blueprints/cells/cells-feature-explore.md
new file mode 100644
index 00000000000..4eab99d63e7
--- /dev/null
+++ b/doc/architecture/blueprints/cells/cells-feature-explore.md
@@ -0,0 +1,71 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Explore'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the
+Cells design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Cells, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Cells: Explore
+
+Explore may not play a critical role in GitLab as it functions today, but GitLab today is not isolated. It is the isolation that makes Explore or some viable replacement necessary.
+
+The existing Group and Project Explore will initially be scoped to an Organization. However, there is a need for a global Explore that spans across Organizations to support the discoverability of public Groups and Projects, in particular in the context of discovering open source Projects. See user feedback [here](https://gitlab.com/gitlab-org/gitlab/-/issues/21582#note_1458298192) and [here](https://gitlab.com/gitlab-org/gitlab/-/issues/418228#note_1470045468).
+
+## 1. Definition
+
+The Explore functionality helps users in discovering Groups and Projects. Unauthenticated Users are only able to explore public Groups and Projects, authenticated Users can see all the Groups and Projects that they have access to, including private and internal Groups and Projects.
+
+## 2. Data flow
+
+## 3. Proposal
+
+The Explore feature problem falls under the broader umbrella of solving inter-Cell communication. [This topic warrants deeper research](index.md#can-different-cells-communicate-with-each-other).
+
+Below are possible directions for further investigation.
+
+### 3.1. Read only table mirror
+
+- Create a `shared_projects` table in the shared cluster-wide database.
+- The model for this table is read-only. No inserts/updates/deletes are allowed.
+- The table is filled with data (or a subset of data) from the Projects Cell-local table.
+ - The write model Project (which is Cell-local) writes to the local database. We will primarily use this model for anything Cell-local.
+ - This data is synchronized with `shared_projects` via a background job any time something changes.
+ - The data in `shared_projects` is stored normalized, so that all the information necessary to display the Project Explore is there.
+- The Project Explore (as of today) is part of an instance-wide functionality, since it's not namespaced to any organizations/groups.
+ - This section will read data using the read model for `shared_projects`.
+- Once the user clicks on a Project, they are redirected to the Cell containing the Organization.
+
+Downsides:
+
+- Need to have an explicit pattern to access instance-wide data. This however may be useful for admin functionalities too.
+- The Project Explore may not be as rich in features as it is today (various filtering options, role you have on that Project, etc.).
+- Extra complexity in managing CQRS.
+
+### 3.2 Explore scoped to an Organization
+
+The Project Explore and Group Explore are scoped to an Organization.
+
+Downsides:
+
+- No global discoverability of Groups and Projects.
+
+## 4. Evaluation
+
+The existing Group and Project Explore will initially be scoped to an Organization. Considering the [current usage of the Explore feature](https://gitlab.com/gitlab-data/product-analytics/-/issues/1302#note_1491215521), we deem this acceptable. Since all existing Users, Groups and Projects will initially be part of the default Organization, Groups and Projects will remain explorable and accessible as they are today. Only once existing Groups and Projects are moved out of the default Organization into different Organizations will this become a noticeable problem. Solutions to mitigate this are discussed in [issue #418228](https://gitlab.com/gitlab-org/gitlab/-/issues/418228). Ultimately, Explore could be replaced with a better search experience altogether.
+
+## 4.1. Pros
+
+- Initially the lack of discoverability will not be a problem.
+- Only around [1.5% of all exisiting Users are using the Explore functionality on a monthly basis](https://gitlab.com/gitlab-data/product-analytics/-/issues/1302#note_1491215521).
+
+## 4.2. Cons
+
+- The GitLab owned top-level Groups would be some of the first to be moved into their own Organization and thus be detached from the explorability of the default Organization.
diff --git a/doc/architecture/blueprints/cells/cells-feature-git-access.md b/doc/architecture/blueprints/cells/cells-feature-git-access.md
index 70b3f136904..611b4db5f43 100644
--- a/doc/architecture/blueprints/cells/cells-feature-git-access.md
+++ b/doc/architecture/blueprints/cells/cells-feature-git-access.md
@@ -15,35 +15,30 @@ we can document the reasons for not choosing this approach.
# Cells: Git Access
-This document describes impact of Cells architecture on all Git access (over HTTPS and SSH)
-patterns providing explanation of how potentially those features should be changed
-to work well with Cells.
+This document describes impact of Cells architecture on all Git access (over HTTPS and SSH) patterns providing explanation of how potentially those features should be changed to work well with Cells.
## 1. Definition
-Git access is done through out the application. It can be an operation performed by the system
-(read Git repository) or by user (create a new file via Web IDE, `git clone` or `git push` via command line).
+Git access is done throughout the application.
+It can be an operation performed by the system (read Git repository) or by a user (create a new file via Web IDE, `git clone` or `git push` via command line).
+The Cells architecture defines that all Git repositories will be local to the Cell, so no repository could be shared with another Cell.
-The Cells architecture defines that all Git repositories will be local to the Cell,
-so no repository could be shared with another Cell.
-
-The Cells architecture will require that any Git operation done can only be handled by a Cell holding
-the data. It means that any operation either via Web interface, API, or GraphQL needs to be routed
-to the correct Cell. It means that any `git clone` or `git push` operation can only be performed
-in a context of a Cell.
+The Cells architecture will require that any Git operation can only be handled by a Cell holding the data.
+It means that any operation either via Web interface, API, or GraphQL needs to be routed to the correct Cell.
+It means that any `git clone` or `git push` operation can only be performed in the context of a Cell.
## 2. Data flow
-The are various operations performed today by the GitLab on a Git repository. This describes
-the data flow how they behave today to better represent the impact.
+The are various operations performed today by GitLab on a Git repository.
+This describes the data flow how they behave today to better represent the impact.
-It appears that Git access does require changes only to a few endpoints that are scoped to project.
+It appears that Git access does require changes only to a few endpoints that are scoped to a Project.
There appear to be different types of repositories:
- Project: assigned to Group
- Wiki: additional repository assigned to Project
- Design: similar to Wiki, additional repository assigned to Project
-- Snippet: creates a virtual project to hold repository, likely tied to the User
+- Snippet: creates a virtual Project to hold repository, likely tied to the User
### 2.1. Git clone over HTTPS
@@ -131,9 +126,8 @@ sequenceDiagram
## 3. Proposal
-The Cells stateless router proposal requires that any ambiguous path (that is not routable)
-will be made to be routable. It means that at least the following paths will have to be updated
-do introduce a routable entity (project, group, or organization).
+The Cells stateless router proposal requires that any ambiguous path (that is not routable) will be made routable.
+It means that at least the following paths will have to be updated to introduce a routable entity (Project, Group, or Organization).
Change:
@@ -150,9 +144,7 @@ Where:
## 4. Evaluation
Supporting Git repositories if a Cell can access only its own repositories does not appear to be complex.
-
-The one major complication is supporting snippets, but this likely falls in the same category as for the approach
-to support user's personal namespaces.
+The one major complication is supporting snippets, but this likely falls in the same category as for the approach to support a user's Personal Namespace.
## 4.1. Pros
@@ -161,4 +153,4 @@ to support user's personal namespaces.
## 4.2. Cons
1. The sharing of repositories objects is limited to the given Cell and Gitaly node.
-1. The across-Cells forks are likely impossible to be supported (discover: how this work today across different Gitaly node).
+1. Cross-Cells forks are likely impossible to be supported (discover: How this works today across different Gitaly node).
diff --git a/doc/architecture/blueprints/cells/cells-feature-global-search.md b/doc/architecture/blueprints/cells/cells-feature-global-search.md
index c1e2b93bc2d..475db381ff5 100644
--- a/doc/architecture/blueprints/cells/cells-feature-global-search.md
+++ b/doc/architecture/blueprints/cells/cells-feature-global-search.md
@@ -6,15 +6,6 @@ description: 'Cells: Global search'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,12 +15,9 @@ we can document the reasons for not choosing this approach.
# Cells: Global search
-When we introduce multiple Cells we intend to isolate all services related to
-those Cells. This will include Elasticsearch which means our current global
-search functionality will not work. It may be possible to implement aggregated
-search across all cells, but it is unlikely to be performant to do fan-out
-searches across all cells especially once you start to do pagination which
-requires setting the correct offset and page number for each search.
+When we introduce multiple Cells we intend to isolate all services related to those Cells.
+This will include Elasticsearch which means our current global search functionality will not work.
+It may be possible to implement aggregated search across all Cells, but it is unlikely to be performant to do fan-out searches across all Cells especially once you start to do pagination which requires setting the correct offset and page number for each search.
## 1. Definition
@@ -37,9 +25,8 @@ requires setting the correct offset and page number for each search.
## 3. Proposal
-Likely first versions of Cells will simply not support global searches and then
-we may later consider if building global searches to support popular use cases
-is worthwhile.
+Likely the first versions of Cells will not support global searches.
+Later, we may consider if building global searches to support popular use cases is worthwhile.
## 4. Evaluation
diff --git a/doc/architecture/blueprints/cells/cells-feature-graphql.md b/doc/architecture/blueprints/cells/cells-feature-graphql.md
index d936a1b81ba..e8850dfbee3 100644
--- a/doc/architecture/blueprints/cells/cells-feature-graphql.md
+++ b/doc/architecture/blueprints/cells/cells-feature-graphql.md
@@ -6,15 +6,6 @@ description: 'Cells: GraphQL'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -25,9 +16,8 @@ we can document the reasons for not choosing this approach.
# Cells: GraphQL
GitLab extensively uses GraphQL to perform efficient data query operations.
-GraphQL due to it's nature is not directly routable. The way how GitLab uses
-it calls the `/api/graphql` endpoint, and only query or mutation of body request
-might define where the data can be accessed.
+GraphQL due to it's nature is not directly routable.
+The way GitLab uses it calls the `/api/graphql` endpoint, and only the query or mutation of the body request might define where the data can be accessed.
## 1. Definition
@@ -35,21 +25,19 @@ might define where the data can be accessed.
## 3. Proposal
-There are at least two main ways to implement GraphQL in Cells architecture.
+There are at least two main ways to implement GraphQL in a Cells architecture.
### 3.1. GraphQL routable by endpoint
Change `/api/graphql` to `/api/organization/<organization>/graphql`.
-- This breaks all existing usages of `/api/graphql` endpoint
- since the API URI is changed.
+- This breaks all existing usages of `/api/graphql` endpoint because the API URI is changed.
### 3.2. GraphQL routable by body
As part of router parse GraphQL body to find a routable entity, like `project`.
-- This still makes the GraphQL query be executed only in context of a given Cell
- and not allowing the data to be merged.
+- This still makes the GraphQL query be executed only in context of a given Cell and not allowing the data to be merged.
```json
# Good example
@@ -71,11 +59,9 @@ As part of router parse GraphQL body to find a routable entity, like `project`.
### 3.3. Merging GraphQL Proxy
-Implement as part of router GraphQL Proxy which can parse body
-and merge results from many Cells.
+Implement as part of router GraphQL Proxy which can parse body and merge results from many Cells.
-- This might make pagination hard to achieve, or we might assume that
- we execute many queries of which results are merged across all Cells.
+- This might make pagination hard to achieve, or we might assume that we execute many queries of which results are merged across all Cells.
```json
{
diff --git a/doc/architecture/blueprints/cells/cells-feature-organizations.md b/doc/architecture/blueprints/cells/cells-feature-organizations.md
index 03178d9e6ce..f1527b40ef4 100644
--- a/doc/architecture/blueprints/cells/cells-feature-organizations.md
+++ b/doc/architecture/blueprints/cells/cells-feature-organizations.md
@@ -6,15 +6,6 @@ description: 'Cells: Organizations'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,36 +15,22 @@ we can document the reasons for not choosing this approach.
# Cells: Organizations
-One of the major designs of Cells architecture is strong isolation between Groups.
-Organizations as described by this blueprint provides a way to have plausible UX
-for joining together many Groups that are isolated from the rest of systems.
+One of the major designs of a Cells architecture is strong isolation between Groups.
+Organizations as described by the [Organization blueprint](../organization/index.md) provides a way to have plausible UX for joining together many Groups that are isolated from the rest of the system.
## 1. Definition
-Cells do require that all groups and projects of a single organization can
-only be stored on a single Cell since a Cell can only access data that it holds locally
-and has very limited capabilities to read information from other Cells.
-
-Cells with Organizations do require strong isolation between organizations.
-
-It will have significant implications on various user-facing features,
-like Todos, dropdowns allowing to select projects, references to other issues
-or projects, or any other social functions present at GitLab. Today those functions
-were able to reference anything in the whole system. With the introduction of
-organizations such will be forbidden.
-
-This problem definition aims to answer effort and implications required to add
-strong isolation between organizations to the system. Including features affected
-and their data processing flow. The purpose is to ensure that our solution when
-implemented consistently avoids data leakage between organizations residing on
-a single Cell.
+Cells do require that all Groups and Projects of a single Organization can only be stored on a single Cell because a Cell can only access data that it holds locally and has very limited capabilities to read information from other Cells.
-## 2. Data flow
+Cells with Organizations do require strong isolation between Organizations.
-## 3. Proposal
+It will have significant implications on various user-facing features, like Todos, dropdowns allowing to select Projects, references to other issues or Projects, or any other social functions present at GitLab.
+Today those functions were able to reference anything in the whole system.
+With the introduction of Organizations this will be forbidden.
-## 4. Evaluation
+This problem definition aims to answer effort and implications required to add strong isolation between Organizations to the system, including features affected and their data processing flow.
+The purpose is to ensure that our solution when implemented consistently avoids data leakage between Organizations residing on a single Cell.
-## 4.1. Pros
+## 2. Proposal
-## 4.2. Cons
+See the [Organization blueprint](../organization/index.md).
diff --git a/doc/architecture/blueprints/cells/cells-feature-dashboard.md b/doc/architecture/blueprints/cells/cells-feature-personal-access-tokens.md
index 135f69c6ed3..3aca9f1e116 100644
--- a/doc/architecture/blueprints/cells/cells-feature-dashboard.md
+++ b/doc/architecture/blueprints/cells/cells-feature-personal-access-tokens.md
@@ -1,7 +1,7 @@
---
stage: enablement
group: Tenant Scale
-description: 'Cells: Dashboard'
+description: 'Cells: Personal Access Tokens'
---
<!-- vale gitlab.FutureTense = NO -->
@@ -13,12 +13,13 @@ contrast this with alternatives before deciding which approach to implement.
This documentation will be kept even if we decide not to implement this so that
we can document the reasons for not choosing this approach.
-# Cells: Dashboard
-
-> TL;DR
+# Cells: Personal Access Tokens
## 1. Definition
+Personal Access Tokens associated with a User are a way for Users to interact with the API of GitLab to perform operations.
+Personal Access Tokens today are scoped to the User, and can access all Groups that a User has access to.
+
## 2. Data flow
## 3. Proposal
diff --git a/doc/architecture/blueprints/cells/cells-feature-router-endpoints-classification.md b/doc/architecture/blueprints/cells/cells-feature-router-endpoints-classification.md
index 7c2974ca258..d403d6ff963 100644
--- a/doc/architecture/blueprints/cells/cells-feature-router-endpoints-classification.md
+++ b/doc/architecture/blueprints/cells/cells-feature-router-endpoints-classification.md
@@ -6,15 +6,6 @@ description: 'Cells: Router Endpoints Classification'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,15 +15,11 @@ we can document the reasons for not choosing this approach.
# Cells: Router Endpoints Classification
-Classification of all endpoints is essential to properly route request
-hitting load balancer of a GitLab installation to a Cell that can serve it.
-
-Each Cell should be able to decode each request and classify for which Cell
-it belongs to.
+Classification of all endpoints is essential to properly route requests hitting the load balancer of a GitLab installation to a Cell that can serve it.
+Each Cell should be able to decode each request and classify which Cell it belongs to.
-GitLab currently implements hundreds of endpoints. This document tries
-to describe various techniques that can be implemented to allow the Rails
-to provide this information efficiently.
+GitLab currently implements hundreds of endpoints.
+This document tries to describe various techniques that can be implemented to allow the Rails to provide this information efficiently.
## 1. Definition
diff --git a/doc/architecture/blueprints/cells/cells-feature-schema-changes.md b/doc/architecture/blueprints/cells/cells-feature-schema-changes.md
index d712b24a8a0..dd0f6c0705c 100644
--- a/doc/architecture/blueprints/cells/cells-feature-schema-changes.md
+++ b/doc/architecture/blueprints/cells/cells-feature-schema-changes.md
@@ -6,15 +6,6 @@ description: 'Cells: Schema changes'
<!-- vale gitlab.FutureTense = NO -->
-DISCLAIMER:
-This page may contain information related to upcoming products, features and
-functionality. It is important to note that the information presented is for
-informational purposes only, so please do not rely on the information for
-purchasing or planning purposes. Just like with all projects, the items
-mentioned on the page are subject to change or delay, and the development,
-release, and timing of any products, features, or functionality remain at the
-sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
@@ -24,24 +15,15 @@ we can document the reasons for not choosing this approach.
# Cells: Schema changes
-When we introduce multiple Cells that own their own databases this will
-complicate the process of making schema changes to Postgres and Elasticsearch.
-Today we already need to be careful to make changes comply with our zero
-downtime deployments. For example,
-[when removing a column we need to make changes over 3 separate deployments](../../../development/database/avoiding_downtime_in_migrations.md#dropping-columns).
-We have tooling like `post_migrate` that helps with these kinds of changes to
-reduce the number of merge requests needed, but these will be complicated when
-we are dealing with deploying multiple rails applications that will be at
-different versions at any one time. This problem will be particularly tricky to
-solve for shared databases like our plan to share the `users` related tables
-among all Cells.
-
-A key benefit of Cells may be that it allows us to run different
-customers on different versions of GitLab. We may choose to update our own cell
-before all our customers giving us even more flexibility than our current
-canary architecture. But doing this means that schema changes need to have even
-more versions of backward compatibility support which could slow down
-development as we need extra steps to make schema changes.
+When we introduce multiple Cells that own their own databases this will complicate the process of making schema changes to Postgres and Elasticsearch.
+Today we already need to be careful to make changes comply with our zero downtime deployments.
+For example, [when removing a column we need to make changes over 3 separate deployments](../../../development/database/avoiding_downtime_in_migrations.md#dropping-columns).
+We have tooling like `post_migrate` that helps with these kinds of changes to reduce the number of merge requests needed, but these will be complicated when we are dealing with deploying multiple Rails applications that will be at different versions at any one time.
+This problem will be particularly tricky to solve for shared databases like our plan to share the `users` related tables among all Cells.
+
+A key benefit of Cells may be that it allows us to run different customers on different versions of GitLab.
+We may choose to update our own Cell before all our customers giving us even more flexibility than our current canary architecture.
+But doing this means that schema changes need to have even more versions of backward compatibility support which could slow down development as we need extra steps to make schema changes.
## 1. Definition
diff --git a/doc/architecture/blueprints/cells/cells-feature-secrets.md b/doc/architecture/blueprints/cells/cells-feature-secrets.md
index 50ccf926b4d..681c229711d 100644
--- a/doc/architecture/blueprints/cells/cells-feature-secrets.md
+++ b/doc/architecture/blueprints/cells/cells-feature-secrets.md
@@ -15,32 +15,26 @@ we can document the reasons for not choosing this approach.
# Cells: Secrets
-Where possible, each cell should have its own distinct set of secrets.
-However, there will be some secrets that will be required to be the same for all
-cells in the cluster
+Where possible, each Cell should have its own distinct set of secrets.
+However, there will be some secrets that will be required to be the same for all Cells in the cluster.
## 1. Definition
-GitLab has a lot of
-[secrets](https://docs.gitlab.com/charts/installation/secrets.html) that needs
-to be configured.
-
-Some secrets are for inter-component communication, for example, `GitLab Shell secret`,
-and used only within a cell.
-
+GitLab has a lot of [secrets](https://docs.gitlab.com/charts/installation/secrets.html) that need to be configured.
+Some secrets are for inter-component communication, for example, `GitLab Shell secret`, and used only within a Cell.
Some secrets are used for features, for example, `ci_jwt_signing_key`.
## 2. Data flow
## 3. Proposal
-1. Secrets used for features will need to be consistent across all cells, so that the UX is consistent.
+1. Secrets used for features will need to be consistent across all Cells, so that the UX is consistent.
1. This is especially true for the `db_key_base` secret which is used for
- encrypting data at rest in the database - so that projects that are
- transferred to another cell will continue to work. We do not want to have
- to re-encrypt such rows when we move projects/groups between cells.
-1. Secrets which are used for intra-cell communication only should be uniquely generated
- per-cell.
+ encrypting data at rest in the database - so that Projects that are
+ transferred to another Cell will continue to work. We do not want to have
+ to re-encrypt such rows when we move Projects/Groups between Cells.
+1. Secrets which are used for intra-Cell communication only should be uniquely generated
+ per Cell.
## 4. Evaluation
diff --git a/doc/architecture/blueprints/cells/cells-feature-snippets.md b/doc/architecture/blueprints/cells/cells-feature-snippets.md
index f5e72c0e3a0..bde0b098609 100644
--- a/doc/architecture/blueprints/cells/cells-feature-snippets.md
+++ b/doc/architecture/blueprints/cells/cells-feature-snippets.md
@@ -15,16 +15,42 @@ we can document the reasons for not choosing this approach.
# Cells: Snippets
-> TL;DR
+Snippets will be scoped to an Organization. Initially it will not be possible to aggregate snippet collections across Organizations. See also [issue #416954](https://gitlab.com/gitlab-org/gitlab/-/issues/416954).
## 1. Definition
+Two different types of snippets exist:
+
+- [Project snippets](../../../api/project_snippets.md). These snippets have URLs
+ like `/<group>/<project>/-/snippets/123`
+- [Personal snippets](../../../user/snippets.md). These snippets have URLs like
+ `/-/snippets/123`
+
+Snippets are backed by a Git repository.
+
## 2. Data flow
## 3. Proposal
+### 3.1. Scoped to an organization
+
+Both project and personal snippets will be scoped to an Organization.
+
+- Project snippets URLs will remain unchanged, as the URLs are routable.
+- Personal snippets URLs will need to change to be `/-/organizations/<organization>/snippets/123`,
+ so that the URL is routeable
+
+Creation of snippets will also be scoped to a User's current Organization. Because of that, we recommend renaming `personal snippets` to `organization snippets` once the Organization is rolled out. A User can create many independent snippet collections across multiple Organizations.
+
## 4. Evaluation
+Snippets are scoped to an Organization because Gitaly is confined to a Cell.
+
## 4.1. Pros
+- No need to have clusterwide Gitaly.
+
## 4.2. Cons
+
+- We will break [snippet discovery](/ee/user/snippets.md#discover-snippets).
+- Snippet access may become subordinate to the visibility of the Organization.
diff --git a/doc/architecture/blueprints/cells/cells-feature-user-profile.md b/doc/architecture/blueprints/cells/cells-feature-user-profile.md
new file mode 100644
index 00000000000..fc02548f371
--- /dev/null
+++ b/doc/architecture/blueprints/cells/cells-feature-user-profile.md
@@ -0,0 +1,52 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: User Profile'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the
+Cells design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Cells, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Cells: User Profile
+
+The existing User Profiles will initially be scoped to an Organization. Long-term, we should consider aggregating parts of the User activity across Organizations to enable Users a global view of their contributions.
+
+## 1. Definition
+
+Each GitLab account has a [User Profile](../../../user/profile/index.md), which contains information about the User and their GitLab activity.
+
+## 2. Data flow
+
+## 3. Proposal
+
+User Profiles will be scoped to an Organization.
+
+- Users can set a Home Organization as their main Organization.
+- Users who do not exist in the database at all display a 404 not found error when trying to access their User Profile.
+- User who haven't contributed to an Organization display their User Profile with an empty state.
+- When displaying a User Profile empty state, if the profile has a Home Organization set to another Organization, we display a call-to-action allowing navigation to the main Organization.
+- User Profile URLs will not reference the Organization and remain as: `/<username>`. We follow the same pattern as is used for `Your Work`, meaning that profiles are always seen in the context of an Organization.
+- Breadcrumbs on the User Profile will present as `[Organization Name] / [Username]`.
+
+See [issue #411931](https://gitlab.com/gitlab-org/gitlab/-/issues/411931) for design proposals.
+
+## 4. Evaluation
+
+We expect the [majority of Users to perform most of their activity in one single Organization](../organization/index.md#data-exploration).
+This is why we deem it acceptable to scope the User Profile to an Organization at first.
+More discovery is necessary to understand which aspects of the current User Profile are relevant to showcase contributions in a global context.
+
+## 4.1. Pros
+
+- Viewing a User Profile scoped to an Organization allows you to focus on contributions that are most relevant to your Organization, filtering out the User's other activities.
+- Existing User Profile URLs do not break.
+
+## 4.2. Cons
+
+- Users will lose the ability to display their entire activity, which may lessen the effectiveness of using their User Profile as a resume of achievements when working across multiple Organizations.
diff --git a/doc/architecture/blueprints/cells/cells-feature-your-work.md b/doc/architecture/blueprints/cells/cells-feature-your-work.md
new file mode 100644
index 00000000000..08bb0bed709
--- /dev/null
+++ b/doc/architecture/blueprints/cells/cells-feature-your-work.md
@@ -0,0 +1,58 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Your Work'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the
+Cells design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Cells, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Cells: Your Work
+
+Your Work will be scoped to an Organization.
+Counts presented in the individual dashboards will relate to the selected Organization.
+
+## 1. Definition
+
+When accessing `gitlab.com/dashboard/`, users can find a [focused view of items that they have access to](../../../tutorials/left_sidebar/index.md#use-a-more-focused-view).
+This overview contains dashboards relating to:
+
+- Projects
+- Groups
+- Issues
+- Merge requests
+- To-Do list
+- Milestones
+- Snippets
+- Activity
+- Workspaces
+- Environments
+- Operations
+- Security
+
+## 2. Data flow
+
+## 3. Proposal
+
+Your Work will be scoped to an Organization, giving the user an overview of all the items they can access in the Organization they are currently viewing.
+
+- Issue, Merge request and To-Do list counts will refer to the selected Organization.
+
+## 4. Evaluation
+
+Scoping Your Work to an Organization makes sense in the context of the [proposed Organization navigation](https://gitlab.com/gitlab-org/gitlab/-/issues/417778).
+Considering that [we expect most users to work in a single Organization](../organization/index.md#data-exploration), we deem this impact acceptable.
+
+## 4.1. Pros
+
+- Viewing Your Work scoped to an Organization allows Users to focus on content that is most relevant to their currently selected Organization.
+
+## 4.2. Cons
+
+- Users working across multiple Organizations will have to navigate to each Organization to access all of their work items.
diff --git a/doc/architecture/blueprints/cells/diagrams/cells-and-fulfillment.drawio.png b/doc/architecture/blueprints/cells/diagrams/cells-and-fulfillment.drawio.png
new file mode 100644
index 00000000000..c5fff9dbca5
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/cells-and-fulfillment.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/index.md b/doc/architecture/blueprints/cells/diagrams/index.md
new file mode 100644
index 00000000000..77d12612819
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/index.md
@@ -0,0 +1,35 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Diagrams'
+---
+
+# Diagrams
+
+Diagrams used in Cells are created with [draw.io](https://draw.io).
+
+## Edit existing diagrams
+
+Load the `.drawio.png` or `.drawio.svg` file directly into **draw.io**, which you can use in several ways:
+
+- Best: Use the [draw.io integration in VSCode](https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio).
+- Good: Install on MacOS with `brew install drawio` or download the [draw.io desktop](https://github.com/jgraph/drawio-desktop/releases).
+- Good: Install on Linux by downloading the [draw.io desktop](https://github.com/jgraph/drawio-desktop/releases).
+- Discouraged: Use the [draw.io website](https://draw.io) to load and save files.
+
+## Create a diagram
+
+To create a diagram from a file:
+
+1. Copy existing file and rename it. Ensure that the extension is `.drawio.png` or `.drawio.svg`.
+1. Edit the diagram.
+1. Save the file.
+
+To create a diagram from scratch using [draw.io desktop](https://github.com/jgraph/drawio-desktop/releases):
+
+1. In **File > New > Create new diagram**, select **Blank diagram**.
+1. In **File > Save As**, select **Editable Bitmap .png**, and save with `.drawio.png` extension.
+1. To improve image quality, in **File > Properties**, set **Zoom** to **400%**.
+1. To save the file with the new zoom setting, select **File > Save**.
+
+DO NOT use the **File > Export** function. The diagram should be embedded into `.png` for easy editing.
diff --git a/doc/architecture/blueprints/cells/diagrams/term-cell.drawio.png b/doc/architecture/blueprints/cells/diagrams/term-cell.drawio.png
new file mode 100644
index 00000000000..84a6d6d1745
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/term-cell.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/term-cluster.drawio.png b/doc/architecture/blueprints/cells/diagrams/term-cluster.drawio.png
new file mode 100644
index 00000000000..a6fd790ba5e
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/term-cluster.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/term-organization.drawio.png b/doc/architecture/blueprints/cells/diagrams/term-organization.drawio.png
new file mode 100644
index 00000000000..f1cb7cd92fe
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/term-organization.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/term-top-level-group.drawio.png b/doc/architecture/blueprints/cells/diagrams/term-top-level-group.drawio.png
new file mode 100644
index 00000000000..f5535409945
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/term-top-level-group.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/glossary.md b/doc/architecture/blueprints/cells/glossary.md
index c3ec5fd12e4..11a1fc5acc9 100644
--- a/doc/architecture/blueprints/cells/glossary.md
+++ b/doc/architecture/blueprints/cells/glossary.md
@@ -14,7 +14,7 @@ We use the following terms to describe components and properties of the Cells ar
A Cell is a set of infrastructure components that contains multiple top-level groups that belong to different organizations. The components include both datastores (PostgreSQL, Redis etc.) and stateless services (web etc.). The infrastructure components provided within a Cell are shared among organizations and their top-level groups but not shared with other Cells. This isolation of infrastructure components means that Cells are independent from each other.
-<img src="images/term-cell.png" height="200">
+<img src="diagrams/term-cell.drawio.png" height="200">
### Cell properties
@@ -32,7 +32,7 @@ Discouraged synonyms: GitLab instance, cluster, shard
A cluster is a collection of Cells.
-<img src="images/term-cluster.png" height="300">
+<img src="diagrams/term-cluster.drawio.png" height="300">
### Cluster properties
@@ -56,7 +56,7 @@ Organizations work under the following assumptions:
1. Users understand that the majority of pages they view are only scoped to a single organization at a time.
1. Organizations are located on a single cell.
-![Term Organization](images/term-organization.png)
+![Term Organization](diagrams/term-organization.drawio.png)
### Organization properties
@@ -83,7 +83,7 @@ Over time there won't be a distinction between a top-level group and a group. Al
Discouraged synonyms: Root-level namespace
-![Term Top-level Group](images/term-top-level-group.png)
+![Term Top-level Group](diagrams/term-top-level-group.drawio.png)
### Top-level group properties
diff --git a/doc/architecture/blueprints/cells/goals.md b/doc/architecture/blueprints/cells/goals.md
index 67dc25625c7..3f3923aa255 100644
--- a/doc/architecture/blueprints/cells/goals.md
+++ b/doc/architecture/blueprints/cells/goals.md
@@ -8,7 +8,11 @@ description: 'Cells: Goals'
## Scalability
-The main goal of this new shared-infrastructure architecture is to provide additional scalability for our SaaS Platform. GitLab.com is largely monolithic and we have estimated (internal) that the current architecture has scalability limitations, even when database partitioning and decomposition are taken into account.
+The main goal of this new shared-infrastructure architecture is to provide additional scalability for our SaaS Platform.
+GitLab.com is largely monolithic and we have estimated (internally) that the current architecture has scalability limitations,
+particularly for the [PostgreSQL database](https://gitlab-com.gitlab.io/gl-infra/tamland/patroni.html), and
+[Redis](https://gitlab-com.gitlab.io/gl-infra/tamland/redis.html) non-horizontally scalable resources,
+even when database partitioning and decomposition are taken into account.
Cells provide a horizontally scalable solution because additional Cells can be created based on demand. Cells can be provisioned and tuned as needed for optimal scalability.
diff --git a/doc/architecture/blueprints/cells/images/pods-and-fulfillment.png b/doc/architecture/blueprints/cells/images/pods-and-fulfillment.png
deleted file mode 100644
index fea32d1800e..00000000000
--- a/doc/architecture/blueprints/cells/images/pods-and-fulfillment.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/cells/images/term-cell.png b/doc/architecture/blueprints/cells/images/term-cell.png
deleted file mode 100644
index 799b2eccd95..00000000000
--- a/doc/architecture/blueprints/cells/images/term-cell.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/cells/images/term-cluster.png b/doc/architecture/blueprints/cells/images/term-cluster.png
deleted file mode 100644
index 03c92850b64..00000000000
--- a/doc/architecture/blueprints/cells/images/term-cluster.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/cells/images/term-organization.png b/doc/architecture/blueprints/cells/images/term-organization.png
deleted file mode 100644
index dd6367ad84a..00000000000
--- a/doc/architecture/blueprints/cells/images/term-organization.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/cells/images/term-top-level-group.png b/doc/architecture/blueprints/cells/images/term-top-level-group.png
deleted file mode 100644
index 4af2468f50d..00000000000
--- a/doc/architecture/blueprints/cells/images/term-top-level-group.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/cells/impact.md b/doc/architecture/blueprints/cells/impact.md
index 878af4d1a5e..30c70dca0cc 100644
--- a/doc/architecture/blueprints/cells/impact.md
+++ b/doc/architecture/blueprints/cells/impact.md
@@ -51,7 +51,7 @@ We synced with Fulfillment ([recording](https://youtu.be/FkQF3uF7vTY)) to discus
A rough representation of this is:
-![Cells and Fulfillment](images/pods-and-fulfillment.png)
+![Cells and Fulfillment](diagrams/cells-and-fulfillment.drawio.png)
### Potential conflicts with Cells
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
index dcd28707890..0e93b9d5d3b 100644
--- a/doc/architecture/blueprints/cells/index.md
+++ b/doc/architecture/blueprints/cells/index.md
@@ -1,7 +1,7 @@
---
status: accepted
creation-date: "2022-09-07"
-authors: [ "@ayufan", "@fzimmer", "@DylanGriffith", "@lohrc" ]
+authors: [ "@ayufan", "@fzimmer", "@DylanGriffith", "@lohrc", "@tkuah" ]
coach: "@ayufan"
approvers: [ "@lohrc" ]
owning-stage: "~devops::enablement"
@@ -14,7 +14,7 @@ participating-stages: []
This document is a work-in-progress and represents a very early state of the Cells design. Significant aspects are not documented, though we expect to add them in the future.
-Cells is a new architecture for our Software as a Service platform. This architecture is horizontally-scalable, resilient, and provides a more consistent user experience. It may also provide additional features in the future, such as data residency control (regions) and federated features.
+Cells is a new architecture for our software as a service platform. This architecture is horizontally scalable, resilient, and provides a more consistent user experience. It may also provide additional features in the future, such as data residency control (regions) and federated features.
For more information about Cells, see also:
@@ -28,8 +28,7 @@ We can't ship the entire Cells architecture in one go - it is too large.
Instead, we are defining key work streams required by the project.
Not all objectives need to be fulfilled to reach production readiness.
-It is expected that some objectives will not be completed for General Availability (GA),
-but will be enough to run Cells in production.
+It is expected that some objectives will not be completed for General Availability (GA), but will be enough to run Cells in production.
### 1. Data access layer
@@ -45,8 +44,7 @@ Under this objective the following steps are expected:
1. **Allow to share cluster-wide data with database-level data access layer.**
- Cells can connect to a database containing shared data. For example:
- application settings, users, or routing information.
+ Cells can connect to a database containing shared data. For example: application settings, users, or routing information.
1. **Evaluate the efficiency of database-level access vs. API-oriented access layer.**
@@ -54,7 +52,7 @@ Under this objective the following steps are expected:
1. **Cluster-unique identifiers**
- Every object has a unique identifier that can be used to access data across the cluster. The IDs for allocated projects, issues and any other objects are cluster-unique.
+ Every object has a unique identifier that can be used to access data across the cluster. The IDs for allocated Projects, issues and any other objects are cluster-unique.
1. **Cluster-wide deletions**
@@ -62,7 +60,7 @@ Under this objective the following steps are expected:
1. **Data access layer**
- Ensure that a stable data-access (versioned) layer that allows to share cluster-wide data is implemented.
+ Ensure that a stable data access (versioned) layer is implemented that allows to share cluster-wide data.
1. **Database migration**
@@ -70,48 +68,38 @@ Under this objective the following steps are expected:
### 2. Essential workflows
-To make Cells viable we require to define and support
-essential workflows before we can consider the Cells
-to be of Beta quality. Essential workflows are meant
-to cover the majority of application functionality
-that makes the product mostly useable, but with some caveats.
+To make Cells viable we require to define and support essential workflows before we can consider the Cells to be of Beta quality.
+Essential workflows are meant to cover the majority of application functionality that makes the product mostly useable, but with some caveats.
The current approach is to define workflows from top to bottom.
The order defines the presumed priority of the items.
-This list is not exhaustive as we would be expecting
-other teams to help and fix their workflows after
-the initial phase, in which we fix the fundamental ones.
-
-To consider a project ready for the Beta phase, it is expected
-that all features defined below are supported by Cells.
-In the cases listed below, the workflows define a set of tables
-to be properly attributed to the feature. In some cases,
-a table with an ambiguous usage has to be broken down.
-For example: `uploads` are used to store user avatars,
-as well as uploaded attachments for comments. It would be expected
-that `uploads` is split into `uploads` (describing group/project-level attachments)
-and `global_uploads` (describing, for example, user avatars).
-
-Except for initial 2-3 quarters this work is highly parallel.
-It would be expected that **group::tenant scale** would help other
-teams to fix their feature set to work with Cells. The first 2-3 quarters
-would be required to define a general split of data and build required tooling.
+This list is not exhaustive as we would be expecting other teams to help and fix their workflows after the initial phase, in which we fix the fundamental ones.
+
+To consider a project ready for the Beta phase, it is expected that all features defined below are supported by Cells.
+In the cases listed below, the workflows define a set of tables to be properly attributed to the feature.
+In some cases, a table with an ambiguous usage has to be broken down.
+For example: `uploads` are used to store user avatars, as well as uploaded attachments for comments.
+It would be expected that `uploads` is split into `uploads` (describing Group/Project-level attachments) and `global_uploads` (describing, for example, user avatars).
+
+Except for the initial 2-3 quarters this work is highly parallel.
+It is expected that **group::tenant scale** will help other teams to fix their feature set to work with Cells.
+The first 2-3 quarters are required to define a general split of data and build the required tooling.
1. **Instance-wide settings are shared across cluster.**
- The Admin Area section for most part is shared across a cluster.
+ The Admin Area section for the most part is shared across a cluster.
1. **User accounts are shared across cluster.**
The purpose is to make `users` cluster-wide.
-1. **User can create group.**
+1. **User can create Group.**
- The purpose is to perform a targeted decomposition of `users` and `namespaces`, because the `namespaces` will be stored locally in the Cell.
+ The purpose is to perform a targeted decomposition of `users` and `namespaces`, because `namespaces` will be stored locally in the Cell.
-1. **User can create project.**
+1. **User can create Project.**
- The purpose is to perform a targeted decomposition of `users` and `projects`, because the `projects` will be stored locally in the Cell.
+ The purpose is to perform a targeted decomposition of `users` and `projects`, because `projects` will be stored locally in the Cell.
1. **User can change profile avatar that is shared in cluster.**
@@ -119,8 +107,7 @@ would be required to define a general split of data and build required tooling.
1. **User can push to Git repository.**
- The purpose is to ensure that essential joins from the projects table are properly attributed to be
- Cell-local, and as a result the essential Git workflow is supported.
+ The purpose is to ensure that essential joins from the Projects table are properly attributed to be Cell-local, and as a result the essential Git workflow is supported.
1. **User can run CI pipeline.**
@@ -130,26 +117,26 @@ would be required to define a general split of data and build required tooling.
The purpose is to ensure that `issues` and `merge requests` are properly attributed to be `Cell-local`.
-1. **User can manage group and project members.**
+1. **User can manage Group and Project members.**
The `members` table is properly attributed to be either `Cell-local` or `cluster-wide`.
1. **User can manage instance-wide runners.**
- The purpose is to scope all CI Runners to be Cell-local. Instance-wide runners in fact become Cell-local runners. The expectation is to provide a user interface view and manage all runners per Cell, instead of per cluster.
+ The purpose is to scope all CI runners to be Cell-local. Instance-wide runners in fact become Cell-local runners. The expectation is to provide a user interface view and manage all runners per Cell, instead of per cluster.
-1. **User is part of organization and can only see information from the organization.**
+1. **User is part of Organization and can only see information from the Organization.**
- The purpose is to have many organizations per Cell, but never have a single organization spanning across many Cells. This is required to ensure that information shown within an organization is isolated, and does not require fetching information from other Cells.
+ The purpose is to have many Organizations per Cell, but never have a single Organization spanning across many Cells. This is required to ensure that information shown within an Organization is isolated, and does not require fetching information from other Cells.
### 3. Additional workflows
Some of these additional workflows might need to be supported, depending on the group decision.
This list is not exhaustive of work needed to be done.
-1. **User can use all group-level features.**
-1. **User can use all project-level features.**
-1. **User can share groups with other groups in an organization.**
+1. **User can use all Group-level features.**
+1. **User can use all Project-level features.**
+1. **User can share Groups with other Groups in an Organization.**
1. **User can create system webhook.**
1. **User can upload and manage packages.**
1. **User can manage security detection features.**
@@ -158,13 +145,11 @@ This list is not exhaustive of work needed to be done.
### 4. Routing layer
-The routing layer is meant to offer a consistent user experience where all Cells are presented
-under a single domain (for example, `gitlab.com`), instead of
-having to navigate to separate domains.
+The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, `gitlab.com`), instead of having to navigate to separate domains.
-The user will able to use `https://gitlab.com` to access Cell-enabled GitLab. Depending
-on the URL access, it will be transparently proxied to the correct Cell that can serve this particular
-information. For example:
+The user will be able to use `https://gitlab.com` to access Cell-enabled GitLab.
+Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information.
+For example:
- All requests going to `https://gitlab.com/users/sign_in` are randomly distributed to all Cells.
- All requests going to `https://gitlab.com/gitlab-org/gitlab/-/tree/master` are always directed to Cell 5, for example.
@@ -173,9 +158,8 @@ information. For example:
1. **Technology.**
We decide what technology the routing service is written in.
- The choice is dependent on the best performing language, and the expected way
- and place of deployment of the routing layer. If it is required to make
- the service multi-cloud it might be required to deploy it to the CDN provider.
+ The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer.
+ If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider.
Then the service needs to be written using a technology compatible with the CDN provider.
1. **Cell discovery.**
@@ -184,35 +168,29 @@ information. For example:
1. **Router endpoints classification.**
- The stateless routing service will fetch and cache information about endpoints
- from one of the Cells. We need to implement a protocol that will allow us to
- accurately describe the incoming request (its fingerprint), so it can be classified
- by one of the Cells, and the results of that can be cached. We also need to implement
- a mechanism for negative cache and cache eviction.
+ The stateless routing service will fetch and cache information about endpoints from one of the Cells.
+ We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached.
+ We also need to implement a mechanism for negative cache and cache eviction.
1. **GraphQL and other ambiguous endpoints.**
- Most endpoints have a unique sharding key: the organization, which directly
- or indirectly (via a group or project) can be used to classify endpoints.
- Some endpoints are ambiguous in their usage (they don't encode the sharding key),
- or the sharding key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like `/api/graphql`.
+ Most endpoints have a unique sharding key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints.
+ Some endpoints are ambiguous in their usage (they don't encode the sharding key), or the sharding key is stored deep in the payload.
+ In these cases, we need to decide how to handle endpoints like `/api/graphql`.
### 5. Cell deployment
-We will run many Cells. To manage them easier, we need to have consistent
-deployment procedures for Cells, including a way to deploy, manage, migrate,
-and monitor.
+We will run many Cells.
+To manage them easier, we need to have consistent deployment procedures for Cells, including a way to deploy, manage, migrate, and monitor.
-We are very likely to use tooling made for [GitLab Dedicated](https://about.gitlab.com/dedicated/)
-with its control planes.
+We are very likely to use tooling made for [GitLab Dedicated](https://about.gitlab.com/dedicated/) with its control planes.
1. **Extend GitLab Dedicated to support GCP.**
1. TBD
### 6. Migration
-When we reach production and are able to store new organizations on new Cells, we need
-to be able to divide big Cells into many smaller ones.
+When we reach production and are able to store new Organizations on new Cells, we need to be able to divide big Cells into many smaller ones.
1. **Use GitLab Geo to clone Cells.**
@@ -220,14 +198,13 @@ to be able to divide big Cells into many smaller ones.
1. **Split Cells by cloning them.**
- Once Cell is cloned we change routing information for organizations.
- Organization will encode `cell_id`. When we update `cell_id` it will automatically
- make the given Cell to be authoritative to handle the traffic for the given organization.
+ Once a Cell is cloned we change the routing information for Organizations.
+ Organizations will encode a `cell_id`.
+ When we update the `cell_id` it will automatically make the given Cell authoritative to handle traffic for the given Organization.
1. **Delete redundant data from previous Cells.**
- Since the organization is now stored on many Cells, once we change `cell_id`
- we will have to remove data from all other Cells based on `organization_id`.
+ Since the Organization is now stored on many Cells, once we change `cell_id` we will have to remove data from all other Cells based on `organization_id`.
## Availability of the feature
@@ -237,11 +214,10 @@ We are following the [Support for Experiment, Beta, and Generally Available feat
Expectations:
-- We can deploy a Cell on staging or another testing environment by using a separate domain (ex. `cell2.staging.gitlab.com`)
- using [Cell deployment](#5-cell-deployment) tooling.
-- User can create organization, group and project, and run some of the [essential workflows](#2-essential-workflows).
+- We can deploy a Cell on staging or another testing environment by using a separate domain (for example `cell2.staging.gitlab.com`) using [Cell deployment](#5-cell-deployment) tooling.
+- User can create Organization, Group and Project, and run some of the [essential workflows](#2-essential-workflows).
- It is not expected to be able to run a router to serve all requests under a single domain.
-- We expect data-loss of data stored on additional Cells.
+- We expect data loss of data stored on additional Cells.
- We expect to tear down and create many new Cells to validate tooling.
### 2. Beta
@@ -250,7 +226,7 @@ Expectations:
- We can run many Cells under a single domain (ex. `staging.gitlab.com`).
- All features defined in [essential workflows](#2-essential-workflows) are supported.
-- Not all aspects of [Routing layer](#4-routing-layer) are finalized.
+- Not all aspects of the [routing layer](#4-routing-layer) are finalized.
- We expect additional Cells to be stable with minimal data loss.
### 3. GA
@@ -259,119 +235,137 @@ Expectations:
- We can run many Cells under a single domain (for example, `staging.gitlab.com`).
- All features defined in [essential workflows](#2-essential-workflows) are supported.
-- All features of [routing layer](#4-routing-layer) are supported.
-- Most of [additional workflows](#3-additional-workflows) are supported.
-- We don't expect to support any of [migration](#6-migration) aspects.
+- All features of the [routing layer](#4-routing-layer) are supported.
+- Most of the [additional workflows](#3-additional-workflows) are supported.
+- We don't expect to support any of the [migration](#6-migration) aspects.
### 4. Post GA
Expectations:
- We support all [additional workflows](#3-additional-workflows).
-- We can [migrate](#6-migration) existing organizations onto new Cells.
+- We can [migrate](#6-migration) existing Organizations onto new Cells.
## Iteration plan
-The delivered iterations will focus on solving particular steps of a given
-key work stream.
-
-It is expected that initial iterations will rather
-be slow, because they require substantially more
-changes to prepare the codebase for data split.
+The delivered iterations will focus on solving particular steps of a given key work stream.
+It is expected that initial iterations will be rather slow, because they require substantially more changes to prepare the codebase for data split.
One iteration describes one quarter's worth of work.
-1. [Iteration 1](https://gitlab.com/groups/gitlab-org/-/epics/9667) - FY24Q1
+1. [Iteration 1](https://gitlab.com/groups/gitlab-org/-/epics/9667) - FY24Q1 - Complete
- Data access layer: Initial Admin Area settings are shared across cluster.
- Essential workflows: Allow to share cluster-wide data with database-level data access layer
-1. [Iteration 2](https://gitlab.com/groups/gitlab-org/-/epics/9813) - FY24Q2
+1. [Iteration 2](https://gitlab.com/groups/gitlab-org/-/epics/9813) - FY24Q2 - In progress
- Essential workflows: User accounts are shared across cluster.
- - Essential workflows: User can create group.
+ - Essential workflows: User can create Group.
-1. [Iteration 3](https://gitlab.com/groups/gitlab-org/-/epics/10997) - FY24Q3
+1. [Iteration 3](https://gitlab.com/groups/gitlab-org/-/epics/10997) - FY24Q3 - Planned
- - Essential workflows: User can create project.
- - Essential workflows: User can push to Git repository.
- - Cell deployment: Extend GitLab Dedicated to support GCP
+ - Essential workflows: User can create Project.
- Routing: Technology.
+ - Data access layer: Evaluate the efficiency of database-level access vs. API-oriented access layer
1. [Iteration 4](https://gitlab.com/groups/gitlab-org/-/epics/10998) - FY24Q4
- - Essential workflows: User can run CI pipeline.
+ - Essential workflows: User can push to Git repository.
- Essential workflows: User can create issue, merge request, and merge it after it is green.
- - Data access layer: Evaluate the efficiency of database-level access vs. API-oriented access layer
- Data access layer: Cluster-unique identifiers.
- Routing: Cell discovery.
- Routing: Router endpoints classification.
+ - Cell deployment: Extend GitLab Dedicated to support GCP
1. Iteration 5 - FY25Q1
+ - Essential workflows: User can run CI pipeline.
+ - Essential workflows: Instance-wide settings are shared across cluster.
+ - Essential workflows: User can change profile avatar that is shared in cluster.
+ - Essential workflows: User can create issue, merge request, and merge it after it is green.
+ - Essential workflows: User can manage Group and Project members.
+ - Essential workflows: User can manage instance-wide runners.
+ - Essential workflows: User is part of Organization and can only see information from the Organization.
+ - Routing: GraphQL and other ambiguous endpoints.
+ - Data access layer: Allow to share cluster-wide data with database-level data access layer.
+ - Data access layer: Cluster-wide deletions.
+ - Data access layer: Data access layer.
+ - Data access layer: Database migrations.
+
+1. Iteration 6 - FY25Q2
+ - TBD
+
+1. Iteration 7 - FY25Q3
+ - TBD
+
+1. Iteration 8 - FY25Q4
- TBD
## Technical Proposals
-The Cells architecture do have long lasting implications to data processing, location, scalability and the GitLab architecture.
+The Cells architecture has long lasting implications to data processing, location, scalability and the GitLab architecture.
This section links all different technical proposals that are being evaluated.
- [Stateless Router That Uses a Cache to Pick Cell and Is Redirected When Wrong Cell Is Reached](proposal-stateless-router-with-buffering-requests.md)
-
- [Stateless Router That Uses a Cache to Pick Cell and pre-flight `/api/v4/cells/learn`](proposal-stateless-router-with-routes-learning.md)
## Impacted features
The Cells architecture will impact many features requiring some of them to be rewritten, or changed significantly.
-This is the list of known affected features with the proposed solutions.
+Below is a list of known affected features with preliminary proposed solutions.
-- [Cells: Git Access](cells-feature-git-access.md)
-- [Cells: Data Migration](cells-feature-data-migration.md)
-- [Cells: Database Sequences](cells-feature-database-sequences.md)
-- [Cells: GraphQL](cells-feature-graphql.md)
-- [Cells: Organizations](cells-feature-organizations.md)
-- [Cells: Router Endpoints Classification](cells-feature-router-endpoints-classification.md)
-- [Cells: Schema changes (Postgres and Elasticsearch migrations)](cells-feature-schema-changes.md)
+- [Cells: Admin Area](cells-feature-admin-area.md)
- [Cells: Backups](cells-feature-backups.md)
-- [Cells: Global Search](cells-feature-global-search.md)
- [Cells: CI Runners](cells-feature-ci-runners.md)
-- [Cells: Admin Area](cells-feature-admin-area.md)
-- [Cells: Secrets](cells-feature-secrets.md)
- [Cells: Container Registry](cells-feature-container-registry.md)
- [Cells: Contributions: Forks](cells-feature-contributions-forks.md)
-- [Cells: Personal Namespaces](cells-feature-personal-namespaces.md)
-- [Cells: Dashboard: Projects, Todos, Issues, Merge Requests, Activity, ...](cells-feature-dashboard.md)
+- [Cells: Database Sequences](cells-feature-database-sequences.md)
+- [Cells: Data Migration](cells-feature-data-migration.md)
+- [Cells: Explore](cells-feature-explore.md)
+- [Cells: Git Access](cells-feature-git-access.md)
+- [Cells: Global Search](cells-feature-global-search.md)
+- [Cells: GraphQL](cells-feature-graphql.md)
+- [Cells: Organizations](cells-feature-organizations.md)
+- [Cells: Secrets](cells-feature-secrets.md)
- [Cells: Snippets](cells-feature-snippets.md)
-- [Cells: Uploads](cells-feature-uploads.md)
-- [Cells: GitLab Pages](cells-feature-gitlab-pages.md)
+- [Cells: User Profile](cells-feature-user-profile.md)
+- [Cells: Your Work](cells-feature-your-work.md)
+
+### Impacted features: Placeholders
+
+The following list of impacted features only represents placeholders that still require work to estimate the impact of Cells and develop solution proposals.
+
- [Cells: Agent for Kubernetes](cells-feature-agent-for-kubernetes.md)
+- [Cells: GitLab Pages](cells-feature-gitlab-pages.md)
+- [Cells: Personal Access Tokens](cells-feature-personal-access-tokens.md)
+- [Cells: Personal Namespaces](cells-feature-personal-namespaces.md)
+- [Cells: Router Endpoints Classification](cells-feature-router-endpoints-classification.md)
+- [Cells: Schema changes (Postgres and Elasticsearch migrations)](cells-feature-schema-changes.md)
+- [Cells: Uploads](cells-feature-uploads.md)
+- ...
## Frequently Asked Questions
### What's the difference between Cells architecture and GitLab Dedicated?
-The new Cells architecture is meant to scale GitLab.com. And the way to achieve this is by moving
-organizations into cells, but different organizations can still share each other server resources, even
-if the application provides isolation from other organizations. But all of them still operate under the
-existing GitLab SaaS domain name `gitlab.com`. Also, cells still share some common data, like `users`, and
-routing information of groups and projects. For example, no two users can have the same username
-even if they belong to different organizations that exist on different cells.
+The new Cells architecture is meant to scale GitLab.com.
+The way to achieve this is by moving Organizations into Cells, but different Organizations can still share server resources, even if the application provides isolation from other Organizations.
+But all of them still operate under the existing GitLab SaaS domain name `gitlab.com`.
+Also, Cells still share some common data, like `users`, and routing information of Groups and Projects.
+For example, no two users can have the same username even if they belong to different Organizations that exist on different Cells.
-With the aforementioned differences, GitLab Dedicated is still offered at higher costs due to the fact
-that it's provisioned via dedicated server resources for each customer, while Cells use shared resources. Which
-makes GitLab Dedicated more suited for bigger customers, and GitLab Cells more suitable for small to mid size
-companies that are starting on GitLab.com.
+With the aforementioned differences, [GitLab Dedicated](https://about.gitlab.com/dedicated/) is still offered at higher costs due to the fact that it's provisioned via dedicated server resources for each customer, while Cells use shared resources.
+This makes GitLab Dedicated more suited for bigger customers, and GitLab Cells more suitable for small to mid-size companies that are starting on GitLab.com.
-On the other hand, [GitLab Dedicated](https://about.gitlab.com/dedicated/) is meant to provide completely
-isolated GitLab instance for any organization. Where this instance is running on its own custom domain name, and
-totally isolated from any other GitLab instance, including GitLab SaaS. For example, users on GitLab dedicated
-don't have to have a different and unique username that was already taken on GitLab.com.
+On the other hand, GitLab Dedicated is meant to provide a completely isolated GitLab instance for any Organization.
+This instance is running on its own custom domain name, and is totally isolated from any other GitLab instance, including GitLab SaaS.
+For example, users on GitLab Dedicated don't have to have a different and unique username that was already taken on GitLab.com.
-### Can different cells communicate with each other?
+### Can different Cells communicate with each other?
-Up until iteration 3, cells communicate with each other only via a shared database that contains common
-data. In iteration 4 we are going to evaluate the option of cells calling each other via API to provide more
-isolation and reliability.
+Up until iteration 3, Cells communicate with each other only via a shared database that contains common data.
+In iteration 4 we are going to evaluate the option of Cells calling each other via API to provide more isolation and reliability.
## Decision log
@@ -380,8 +374,7 @@ isolation and reliability.
## Links
- [Internal Pods presentation](https://docs.google.com/presentation/d/1x1uIiN8FR9fhL7pzFh9juHOVcSxEY7d2_q4uiKKGD44/edit#slide=id.ge7acbdc97a_0_155)
-- [Internal link to all diagrams](https://drive.google.com/file/d/13NHzbTrmhUM-z_Bf0RjatUEGw5jWHSLt/view?usp=sharing)
- [Cells Epic](https://gitlab.com/groups/gitlab-org/-/epics/7582)
-- [Database Group investigation](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html)
+- [Database group investigation](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html)
- [Shopify Pods architecture](https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale)
- [Opstrace architecture](https://gitlab.com/gitlab-org/opstrace/opstrace/-/blob/main/docs/architecture/overview.md)
diff --git a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
new file mode 100644
index 00000000000..29b2bd0fd28
--- /dev/null
+++ b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
@@ -0,0 +1,132 @@
+---
+status: proposed
+creation-date: "2023-01-25"
+authors: [ "@pedropombeiro", "@vshushlin"]
+coach: "@grzesiek"
+approvers: [ ]
+stage: Verify
+group: Runner
+participating-stages: []
+---
+
+# CI Builds and Runner Fleet metrics database architecture
+
+The CI section envisions new value-added features in GitLab for CI Builds and Runner Fleet focused on observability and automation. However, implementing these features and delivering on the product vision of observability, automation, and AI optimization using the current database architecture in PostgreSQL is very hard because:
+
+- CI-related transactional tables are huge, so any modification to them can increase the load on the database and subsequently cause incidents.
+- PostgreSQL is not optimized for running aggregation queries.
+- We also want to add more information from the build environment, making CI tables even larger.
+- We also need a data model to aggregate data sets for the GitLab CI efficiency machine learning models - the basis of the Runner Fleet AI solution
+
+We want to create a new flexible database architecture which:
+
+- will support known reporting requirements for CI builds and Runner Fleet.
+- can be used to ingest data from the CI build environment.
+
+We may also use this database architecture to facilitate development of AI features in the future.
+
+Our recent usability research on navigation and other areas suggests that the GitLab UI is overloaded with information and navigational elements.
+This results from trying to add as much information as possible and attempting to place features in the most discoverable places.
+Therefore, while developing these new observability features, we will rely on the jobs to be done research, and solution validation, to ensure that the features deliver the most value.
+
+## Runner Fleet
+
+### Metrics - MVC
+
+#### What is the estimated wait time in queue for an instance runner?
+
+The following customer problems should be solved when addressing this question. Most of them are quotes from our usability research
+
+**UI**
+
+- "There is no visibility for expected Runner queue wait times."
+- "I got here looking for a view that makes it more obvious if I have a bottleneck on my specific runner."
+
+**Types of metrics**
+
+- "Is it possible to get metrics out of GitLab to check for the runners availability & pipeline wait times?
+ Goal - we need the data to evaluate the data to determine if to scale up the Runner fleet so that there is no waiting times for developer’s pipelines."
+- "What is the estimated time in the Runner queue before a job can start?"
+
+**Interpreting metrics**
+
+- "What metrics for Runner queue performance should I look at and how do I interpret the metrics and take action?"
+- "I want to be able to analyze data on Runner queue performance over time so that I can determine if the reports are from developers are really just rare cases regarding availability."
+
+#### What is the estimated wait time in queue on a group runner?
+
+#### What is the mean estimated wait time in queue for all instance runners?
+
+#### What is the mean estimated wait time in queue for all group runners?
+
+#### Which runners have failures in the past hour?
+
+## Implementation
+
+The current implementation plan is based on a
+[Proof of Concept](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126863).
+For an up to date status, see [epic 10682](https://gitlab.com/groups/gitlab-org/-/epics/10682).
+
+### Database selection
+
+In FY23, ClickHouse [was selected as GitLab standard datastore](https://about.gitlab.com/company/team/structure/working-groups/clickhouse-datastore/#context)
+for features with big data and insert-heavy requirements.
+So we have chosen it for our CI analytics as well.
+
+### Scope of data
+
+We're starting with the denormalized version of the `ci_builds` table in the main database,
+which will include fields from some other tables. For example, `ci_runners` and `ci_runner_machines`.
+
+[Immutability is a key constraint in ClickHouse](../../../development/database/clickhouse/index.md#how-it-differs-from-postgresql),
+so we only use `finished` builds.
+
+### Developing behind feature flags
+
+It's hard to fully test data ingestion and query performance in development/staging environments.
+That's why we plan to deliver those features to production behing feature flags and test the performance on real data.
+Feature flags for data ingestion and API's will be separate.
+
+### Data ingestion
+
+A background worker will push `ci_builds` sorted by `(finished_at, id)` from Posgres to ClickHouse.
+Every time the worker starts, it will find the most recently inserted build and continue from there.
+
+At some point we most likely will need to
+[parallelize this worker because of the number of processed builds](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126863#note_1494922639).
+
+We will start with most recent builds and will not upload all historical data.
+
+### "Raw data", materialized views and queries
+
+Ingested data will go to the "raw data" table in ClickHouse.
+This table will use `ReplacingMergeTree` engine to deduplicate rows in case data ingestion mechanism accidentally submits the same batch twice.
+
+Raw data can be used directly do execute queries, but most of the time we will create specialized materialized views
+using `AggregatingMergeTree` engine.
+This will allow us to read significantly less data when performing queries.
+
+### Limitations and open questions
+
+The topics below require further investigation.
+
+#### Efficient way of querying data for namespaces
+
+We start with the PoC available only for administrators,
+but very soon we will need to implement features on the group level.
+
+We can't just put denormalized "path" in the source table because it can be changed when groups or projects are moved.
+
+The simplest way of solving this is to always filter builds by `project_id`,
+but this may be inefficient and require reading a significant portion of all data because ClickHouse stores data in big batches.
+
+#### Keeping the database schema up to date
+
+Right now we don't have any mechanism equivalent to migrations we use for PostgreSQL.
+While developing our first features we will maintain database schema by hand and
+continue developing mechanisms for migrations.
+
+#### Re-uploading data after changing the schema
+
+If we need to modify database schema, old data maybe incomplete.
+In that case we can simply truncate the ClickHouse tables and reupload (part of) the data.
diff --git a/doc/architecture/blueprints/ci_pipeline_processing/index.md b/doc/architecture/blueprints/ci_pipeline_processing/index.md
new file mode 100644
index 00000000000..a1e3092905c
--- /dev/null
+++ b/doc/architecture/blueprints/ci_pipeline_processing/index.md
@@ -0,0 +1,448 @@
+---
+status: proposed
+creation-date: "2023-05-15"
+authors: [ "@furkanayhan" ]
+coach: "@ayufan"
+approvers: [ "@jreporter", "@cheryl.li" ]
+owning-stage: "~devops::verify"
+participating-stages: []
+---
+
+# Future of CI Pipeline Processing
+
+## Summary
+
+GitLab CI is one of the oldest and most complex features in GitLab.
+Over the years its YAML syntax has considerably grown in size and complexity.
+In order to keep the syntax highly stable over the years, we have primarily been making additive changes
+on top of the existing design and patterns.
+Our user base has grown exponentially over the past years. With that, the need to support
+their use cases and customization of the workflows.
+
+While delivering huge value over the years, the various additive changes to the syntax have also caused
+some surprising behaviors in the pipeline processing logic.
+Some keywords accumulated a number of responsibilities, and some ambiguous overlaps were discovered among
+keywords and subtle differences in behavior were introduced over time.
+The current implementation and YAML syntax also make it challenging to implement new features.
+
+In this design document, we will discuss the problems and propose
+a new architecture for pipeline processing. Most of these problems have been discussed before in the
+["Restructure CI job when keyword"](https://gitlab.com/groups/gitlab-org/-/epics/6788) epic.
+
+## Goals
+
+- We want to make the pipeline processing more understandable, predictable and consistent.
+- We want to unify the behaviors of DAG and STAGE. STAGE can be written as DAG and vice versa.
+- We want to decouple the manual jobs' blocking behavior from the `allow_failure` keyword.
+- We want to clarify the responsibilities of the `when` keyword.
+
+## Non-Goals
+
+We will not discuss how to avoid breaking changes for now.
+
+## Motivation
+
+The list of problems is the main motivation for this design document.
+
+### Problem 1: The responsibility of the `when` keyword
+
+Right now, the [`when`](../../../ci/yaml/index.md#when) keyword has many responsibilities;
+
+> - `on_success` (default): Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
+> - `on_failure`: Run the job only when at least one job in an earlier stage fails. A job in an earlier stage
+> with `allow_failure: true` is always considered successful.
+> - `never`: Don't run the job regardless of the status of jobs in earlier stages.
+> Can only be used in a [`rules`](../../../ci/yaml/index.md#rules) section or `workflow: rules`.
+> - `always`: Run the job regardless of the status of jobs in earlier stages. Can also be used in `workflow:rules`.
+> - `manual`: Run the job only when [triggered manually](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually).
+> - `delayed`: [Delay the execution of a job](../../../ci/jobs/job_control.md#run-a-job-after-a-delay)
+> for a specified duration.
+
+It answers three questions;
+
+- What's required to run? => `on_success`, `on_failure`, `always`
+- How to run? => `manual`, `delayed`
+- Add to the pipeline? => `never`
+
+As a result, for example; we cannot create a `manual` job with `when: on_failure`.
+This can be useful when persona wants to create a job that is only available on failure, but needs to be manually played.
+For example; publishing failures to dedicated page or dedicated external service.
+
+### Problem 2: Abuse of the `allow_failure` keyword
+
+We control the blocker behavior of a manual job by the [`allow_failure`](../../../ci/yaml/index.md#allow_failure) keyword.
+Actually, it has other responsibilities; _"determine whether a pipeline should continue running when a job fails"_.
+
+Currently, a [manual job](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually);
+
+- is not a blocker when it has `allow_failure: true` (by default)
+- a blocker when it has `allow_failure: false`.
+
+As a result, for example; we cannot create a `manual` job that is `allow_failure: false` and not a blocker.
+
+```yaml
+job1:
+ stage: test
+ when: manual
+ allow_failure: true # default
+
+job2:
+ stage: deploy
+```
+
+Currently;
+
+- `job1` is skipped.
+- `job2` runs because `job1` is ignored since it has `allow_failure: true`.
+- When we run/play `job1`;
+ - if it fails, it's marked as "success with warning".
+
+#### `allow_failure` with `rules`
+
+`allow_failure` becomes more confusing when using `rules`.
+
+From [docs](../../../ci/yaml/index.md#when):
+
+> The default behavior of `allow_failure` changes to true with `when: manual`.
+> However, if you use `when: manual` with `rules`, `allow_failure` defaults to `false`.
+
+From [docs](../../../ci/yaml/index.md#allow_failure):
+
+> The default value for `allow_failure` is:
+>
+> - `true` for manual jobs.
+> - `false` for jobs that use `when: manual` inside `rules`.
+> - `false` in all other cases.
+
+For example;
+
+```yaml
+job1:
+ script: ls
+ when: manual
+
+job2:
+ script: ls
+ rules:
+ - if: $ALWAYS_TRUE
+ when: manual
+```
+
+`job1` and `job2` behave differently;
+
+- `job1` is not a blocker because it has `allow_failure: true` by default.
+- `job2` is a blocker `rules: when: manual` does not return `allow_failure: true` by default.
+
+### Problem 3: Different behaviors in DAG/needs
+
+The main behavioral difference between DAG and STAGE is about the "skipped" and "ignored" states.
+
+**Background information:**
+
+- skipped:
+ - When a job is `when: on_success` and its previous status is failed, it's skipped.
+ - When a job is `when: on_failure` and its previous status is not "failed", it's skipped.
+- ignored:
+ - When a job is `when: manual` with `allow_failure: true`, it's ignored.
+
+**Problem:**
+
+The `skipped` and `ignored` states are considered successful in the STAGE processing but not in the DAG processing.
+
+#### Problem 3.1. Handling of ignored status with manual jobs
+
+**Example 1:**
+
+```yaml
+build:
+ stage: build
+ script: exit 0
+ when: manual
+ allow_failure: true # by default
+
+test:
+ stage: test
+ script: exit 0
+ needs: [build]
+```
+
+- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
+- `test` is skipped because "ignored" is not a successful state in the DAG processing.
+
+**Example 2:**
+
+```yaml
+build:
+ stage: build
+ script: exit 0
+ when: manual
+ allow_failure: true # by default
+
+test:
+ stage: test
+ script: exit 0
+```
+
+- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
+- `test2` runs and succeeds.
+
+#### Problem 3.2. Handling of skipped status with when: on_failure
+
+**Example 1:**
+
+```yaml
+build_job:
+ stage: build
+ script: exit 1
+
+test_job:
+ stage: test
+ script: exit 0
+
+rollback_job:
+ stage: deploy
+ needs: [build_job, test_job]
+ script: exit 0
+ when: on_failure
+```
+
+- `build_job` runs and fails.
+- `test_job` is skipped.
+- Even though `rollback_job` is `when: on_failure` and there is a failed job, it is skipped because the `needs` list has a "skipped" job.
+
+**Example 2:**
+
+```yaml
+build_job:
+ stage: build
+ script: exit 1
+
+test_job:
+ stage: test
+ script: exit 0
+
+rollback_job:
+ stage: deploy
+ script: exit 0
+ when: on_failure
+```
+
+- `build_job` runs and fails.
+- `test_job` is skipped.
+- `rollback_job` runs because there is a failed job before.
+
+### Problem 4: The skipped and ignored states
+
+Let's assume that we solved the problem 3 and the "skipped" and "ignored" states are not different in DAG and STAGE.
+How should they behave in general? Are they successful or not? Should "skipped" and "ignored" be different?
+Let's examine some examples;
+
+**Example 4.1. The ignored status with manual jobs**
+
+```yaml
+build:
+ stage: build
+ script: exit 0
+ when: manual
+ allow_failure: true # by default
+
+test:
+ stage: test
+ script: exit 0
+```
+
+- `build` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
+- `test` runs because "skipped" is a successful state.
+
+Alternatively;
+
+```yaml
+build1:
+ stage: build
+ script: exit 0
+ when: manual
+ allow_failure: true # by default
+
+build2:
+ stage: build
+ script: exit 0
+
+test:
+ stage: test
+ script: exit 0
+```
+
+- `build1` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
+- `build2` runs and succeeds.
+- `test` runs because "success" + "skipped" is a successful state.
+
+**Example 4.2. The skipped status with when: on_failure**
+
+```yaml
+build:
+ stage: build
+ script: exit 0
+ when: on_failure
+
+test:
+ stage: test
+ script: exit 0
+```
+
+- `build` is skipped because it's `when: on_failure` and its previous status is not "failed".
+- `test` runs because "skipped" is a successful state.
+
+Alternatively;
+
+```yaml
+build1:
+ stage: build
+ script: exit 0
+ when: on_failure
+
+build2:
+ stage: build
+ script: exit 0
+
+test:
+ stage: test
+ script: exit 0
+```
+
+- `build1` is skipped because it's `when: on_failure` and its previous status is not "failed".
+- `build2` runs and succeeds.
+- `test` runs because "success" + "skipped" is a successful state.
+
+### Problem 5: The `dependencies` keyword
+
+The [`dependencies`](../../../ci/yaml/index.md#dependencies) keyword is used to define a list of jobs to fetch
+[artifacts](../../../ci/yaml/index.md#artifacts) from. It is a shared responsibility with the `needs` keyword.
+Moreover, they can be used together in the same job. We may not need to discuss all possible scenarios but this example
+is enough to show the confusion;
+
+```yaml
+test2:
+ script: exit 0
+ dependencies: [test1]
+ needs:
+ - job: test1
+ artifacts: false
+```
+
+### Information 1: Canceled jobs
+
+Are a canceled job and a failed job the same? They have many differences so we could easily say "no".
+However, they have one similarity; they can be "allowed to fail".
+
+Let's define their differences first;
+
+- A canceled job;
+ - It is not a finished job.
+ - Canceled is a user requested interruption of the job. The intent is to abort the job or stop pipeline processing as soon as possible.
+ - We don't know the result, there is no artifacts, etc.
+ - Since it's never run, the `after_script` is not run.
+ - Its eventual state is "canceled" so no job can run after it.
+ - There is no `when: on_canceled`.
+ - Even `when: always` is not run.
+- A failed job;
+ - It is a machine response of the CI system to executing the job content. It indicates that execution failed for some reason.
+ - It is equal answer of the system to success. The fact that something is failed is relative,
+ and might be desired outcome of CI execution, like in when executing tests that some are failing.
+ - We know the result and [there can be artifacts](../../../ci/yaml/index.md#artifactswhen).
+ - `after_script` is run.
+ - Its eventual state is "failed" so subsequent jobs can run depending on their `when` values.
+ - `when: on_failure` and `when: always` are run.
+
+**The one similarity is; they can be "allowed to fail".**
+
+```yaml
+build:
+ stage: build
+ script: sleep 10
+ allow_failure: true
+
+test:
+ stage: test
+ script: exit 0
+ when: on_success
+```
+
+- If `build` runs and gets `canceled`, then `test` runs.
+- If `build` runs and gets `failed`, then `test` runs.
+
+#### An idea on using `canceled` instead of `failed` for some cases
+
+There is another aspect. We often drop jobs with a `failure_reason` before they get executed,
+for example when the namespace ran out of Compute Credits (CI minutes) or when limits are exceeded.
+Dropping jobs in the `failed` state has been handy because we could communicate to the user the `failure_reason`
+for better feedback. When canceling jobs for various reasons we don't have a way to indicate that.
+We cancel jobs because the user ran out of Compute Credits while the pipeline was running,
+or because the pipeline is auto-canceled by another pipeline or other reasons.
+If we had a `stop_reason` instead of `failure_reason` we could use that for both cancelled and failed jobs
+and we could also use the `canceled` status more appropriately.
+
+### Information 2: Empty state
+
+We [recently updated](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/117856) the documentation of
+[the `when` keyword](../../../ci/yaml/index.md#when) for clarification;
+
+> - `on_success`: Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
+> - `on_failure`: Run the job only when at least one job in an earlier stage fails.
+
+For example;
+
+```yaml
+test1:
+ when: on_success
+ script: exit 0
+ # needs: [] would lead to the same result
+
+test2:
+ when: on_failure
+ script: exit 0
+ # needs: [] would lead to the same result
+```
+
+- `test1` runs because there is no job failed in the previous stages.
+- `test2` does not run because there is no job failed in the previous stages.
+
+The `on_success` means that "nothing failed", it does not mean that everything succeeded.
+The same goes to `on_failure`, it does not mean that everything failed, but does mean that "something failed".
+This semantic goes by a expectation that your pipeline succeeds, and this is happy path.
+Not that your pipeline fails, because then it requires user intervention to fix it.
+
+## Technical expectations
+
+All proposals or future decisions must follow these goals;
+
+1. The `allow_failure` keyword must only responsible for marking **failed** jobs as "success with warning".
+ - Why: It should not have another responsibility, such as determining a manual job is a blocker or not.
+ - How: Another keyword will be introduced to control the blocker behavior of a manual job.
+1. With `allow_failure`, **canceled** jobs must not be marked as "success with warning".
+ - Why: "canceled" is a different state than "failed".
+ - How: Canceled with `allow_failure: true` jobs will not be marked as "success with warning".
+1. The `when` keyword must only answer the question "What's required to run?". And it must be the only source of truth
+ for deciding if a job should run or not.
+1. The `when` keyword must not control if a job is added to the pipeline or not.
+ - Why: It is not its responsibility.
+ - How: Another keyword will be introduced to control if a job is added to the pipeline or not.
+1. The "skipped" and "ignored" states must be reconsidered.
+ - TODO: We need to discuss this more.
+1. A new keyword structure must be introduced to specify if a job is an "automatic", "manual", or "delayed" job.
+ - Why: It is not the responsibility of the `when` keyword.
+ - How: A new keyword will be introduced to control the behavior of a job.
+1. The `needs` keyword must only control the order of the jobs. It must not be used to control the behavior of the jobs
+ or to decide if a job should run or not. The DAG and STAGE behaviors must be the same.
+ - Why: It leads to different behaviors and confuses users.
+ - How: The `needs` keyword will only define previous jobs, like stage does.
+1. The `needs` and `dependencies` keywords must not be used together in the same job.
+ - Why: It is confusing.
+ - How: The `needs` and `dependencies` keywords will be mutually exclusive.
+
+## Proposal
+
+N/A
+
+## Design and implementation details
+
+N/A
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index a538910f553..243270afdb2 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -174,7 +174,7 @@ The diagram below illustrates the architecture of the database cluster:
[Rate](https://gitlab.com/gitlab-org/container-registry/-/issues/94) and [size](https://gitlab.com/gitlab-org/container-registry/-/issues/61#note_446609886) requirements for the GitLab.com database were extrapolated based on the `dev.gitlab.org` registry and are available in the linked issues.
-#### Self-Managed Instances
+#### Self-managed instances
By default, for self-managed instances, the registry will have a separate logical database in the same PostgreSQL instance/cluster as the GitLab database. However, it will be possible to configure the registry to use a separate instance/cluster if needed.
diff --git a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
index a73f6335218..0987b317af8 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
@@ -135,7 +135,7 @@ drivers, we could have the importer retry more time and for more errors. There's
a risk we would retry several times on non-retryable errors, but since no writes
are being made to object storage, this should not ultimately be harmful.
-Additionally, implementing [Validate Self-Managed Imports](https://gitlab.com/gitlab-org/container-registry/-/issues/938)
+Additionally, implementing [Validate self-managed imports](https://gitlab.com/gitlab-org/container-registry/-/issues/938)
would perform a consistency check against a sample of images before and after
import which would lead to greater consistency across all storage driver implementations.
diff --git a/doc/architecture/blueprints/git_data_offloading/index.md b/doc/architecture/blueprints/git_data_offloading/index.md
new file mode 100644
index 00000000000..44df2c9a09d
--- /dev/null
+++ b/doc/architecture/blueprints/git_data_offloading/index.md
@@ -0,0 +1,221 @@
+---
+status: proposed
+creation-date: "2023-05-19"
+authors: [ "@jcai-gitlab", "@toon" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: "~devops::systems"
+---
+
+# Offload data to cheaper storage
+
+## Summary
+
+Managing Git data storage costs is a critical part of our business. Offloading
+Git data to cheaper storage can save on storage cost.
+
+## Motivation
+
+At GitLab, we keep most Git data stored on SSDs to keep data access fast. This
+makes sense for data that we frequently need to access. However, given that
+storage costs scale with data growth, we can be a lot smarter about what kinds
+of data we keep on SSDs and what kinds of data we can afford to offload to
+cheaper storage.
+
+For example, large files (or in Git nomenclature, "blobs") are not frequently
+modified since they are usually non-text files (images, videos, binaries, etc).
+Often, [Git LFS](https://git-lfs.com/) is used for repositories that contain
+these large blobs in order to avoid having to push a large file onto the Git
+server. However, this relies on client side setup.
+
+Or, if a project is "stale" and hasn't been accessed in a long time, there is no
+need to keep paying for fast storage for that project.
+
+Instead, we can choose to put **all** blobs of that stale project onto cheaper
+storage. This way, the application still has access to the commit history and
+trees so the project browsing experience is not affected, but all files are on
+slower storage since they are rarely accessed.
+
+If we had a way to separate Git data into different categories, we could then
+offload certain data to a secondary location that is cheaper. For example, we
+could separate large files that may not be accessed as frequently from the rest
+of the Git data and save it to an HDD rather than an SDD mount.
+
+## Requirements
+
+There are a set of requirements and invariants that must be given for any
+particular solution.
+
+### Saves storage cost
+
+Ultimately, this solution needs to save on storage cost. Separating out certain
+Git data for cheaper storage can go towards this savings.
+
+We need to evaluate the solution's added cost against the projected savings from
+offloading data to cheaper storage. Here are some criteria to consider:
+
+- How much money would we save if all large blobs larger than X were put on HDD?
+- How much money would we save if all stale projects had their blobs on HDD?
+- What's the operational overhead for running the offloading mechanism in terms
+ of additional CPU/Memory cost?
+- What's the network overhead? e.g. is there an extra roundtrip to a different
+ node via the network to retrieve large blobs.
+- Access cost, e.g. when blobs would be stored in an object store.
+
+### Opaque to downstream consumers of Gitaly
+
+This feature is purely storage optimization and, except for potential
+performance slowdown, shouldn't affect downstream consumers of Gitaly. For
+example, the GitLab application should not have to change any of its logic in
+order to support this feature.
+
+This feature should be completely invisible to any callers of Gitaly. Rails or
+any consumer should not need to know about this or manage it in any way.
+
+### Operationally Simple
+
+When working with Git data, we want to keep things as simple as possible to
+minimize risk of repository corruption. Keep things operationally simple and
+keep moving pieces outside of Git itself to a minimum. Any logic that modifies
+repository data should be upstreamed in Git itself.
+
+## Proposal
+
+We will maintain a separate object database for each repository connected
+through the [Git alternates mechansim](https://git-scm.com/docs/gitrepository-layout#Documentation/gitrepository-layout.txt-objects).
+We can choose to filter out certain Git objects for this secondary object
+database (ODB).
+
+Place Git data into this secondary ODB based on a filter. We have
+options based on [filters in Git](https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt---filterltfilter-specgt).
+
+We can choose to place large blobs based on some limit into a secondary ODB, or
+we can choose to place all blobs onto the secondary ODB.
+
+## Design and implementation details
+
+### Git
+
+We need to add a feature to `git-repack(1)` that will allow us to segment
+different kinds of blobs into different object databases. We're tracking this
+effort in [this issue](https://gitlab.com/gitlab-org/git/-/issues/159).
+
+### Gitaly
+
+During Gitaly housekeeping, we can do the following:
+
+1. Use `git-repack(1)` to write packfiles into both the main repository's object
+ database, and a secondary object database. Each repository has its own
+ secondary object database for offloading blobs based on some criteria.
+1. Ensure the `.git/objects/info/alternates` file points to the secondary
+ object database from step (1).
+
+### Criteria
+
+Whether objects are offloaded to another object database can be determined based
+on one or many of the following criteria.
+
+#### By Tier
+
+Free projects might have many blobs offloaded to cheaper storage, while Ultimate
+projects have all their objects placed on the fastest storage.
+
+#### By history
+
+If a blob was added a long time ago and is not referred by any recent commit it
+can get offloaded, while new blobs remain on the main ODB.
+
+#### By size
+
+Large blobs are a quick win to reduce the expensive storage size, so they might
+get prioritized to move to cheaper storage.
+
+#### Frequency of Access
+
+Frequently used project might remain fully on fast storage, while inactive
+projects might have their blob offloaded.
+
+### Open Questions
+
+#### How do we delete objects?
+
+When we want to delete an unreachable object, the repack would need to be aware
+of both ODBs and be able to evict unreachable objects regardless of whether
+the objects are in the main ODB or in the secondary ODB. This picture is
+complicated if the main ODB also has an [object pool](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/object_pools.md)
+ODB, since we wouldn't ever want to delete an object from the pool ODB.
+
+#### Potential Solution: Modify Git to delete an object from alternates
+
+We would need to modify repack to give it the ability to delete unreachable
+objects in alternate ODBs. We could add repack configs `repack.alternates.*`
+that tell it how to behave with alternate directories. For example, we could
+have `repack.alternates.explodeUnreachable`, which indicates to repack that it
+should behave like `-A` in any alternate ODB it is linked to.
+
+#### How does this work with object pools?
+
+When we use alternates, how does this interact with object pools? Does the
+object pool also offload data to secondary storage? Does the object pool member?
+In the most complex case this means that a single repository has four different
+object databases, which may increase complexity.
+
+Possibly we can mark some packfiles as "keep", using the
+[--keep-pack](https://git-scm.com/docs/git-pack-objects#Documentation/git-pack-objects.txt---keep-packltpack-namegt)
+and
+[--honor-pack-keep](https://git-scm.com/docs/git-pack-objects#Documentation/git-pack-objects.txt---honor-pack-keep)
+options.
+
+#### Potential Solution: Do not allow object pools to offload their blobs
+
+For the sake of not adding too much complexity, we could decide that object
+pools will not offload their blobs. Instead, we can design housekeeping to
+offload blobs from the repository before deduplicating with the object pool.
+Theoretically, this means that offloaded blobs will not end up in the object
+pool.
+
+#### How will this work with Raft + WAL?
+
+How will this mechanism interact with Raft and the write-ahead log?
+
+The WAL uses hard-links and copy-free moves, to avoid slow copy operations. But
+that does not work across different file systems. At some point repacks and such
+will likely also go through the log. Transferring data between file systems can
+lead to delays in transaction processing.
+
+Ideally we keep the use of an alternate internal to the node and not have to
+leak this complexity to the rest of the cluster. This is a challenge, given we
+have to consider available space when making placement decisions. It's possible
+to keep this internal by only showing the lower capacity of the two storages,
+but that could also lead to inefficient storage use.
+
+## Problems with the design
+
+### Added complexity
+
+The fact that we are adding another object pool to the mix adds complexity to
+the system, and especially with repository replication since we are adding yet
+another place to replicate data to.
+
+### Possible change in cost over time
+
+The cost of the different storage types might change over time. To anticipate
+for this, it should be easy to adapt to such changes.
+
+### More points of failure
+
+Having some blobs on a separate storage device adds one more failure scenario
+where the device hosting the large blobs may fail.
+
+## Alternative Solutions
+
+### Placing entire projects onto cheaper storage
+
+Instead of placing Git data onto cheaper storage, the Rails application could
+choose to move a project in its entirety to a mounted HDD drive.
+
+#### Possible optimization
+
+Giving these machines with cheaper storage extra RAM might help to deal with the
+slow read/write speeds due to the use of page cache. It's not sure though this
+will turn out to be cheaper overall.
diff --git a/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/adaptive_concurrency_limit_flow.png b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/adaptive_concurrency_limit_flow.png
new file mode 100644
index 00000000000..0475a32e933
--- /dev/null
+++ b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/adaptive_concurrency_limit_flow.png
Binary files differ
diff --git a/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md
new file mode 100644
index 00000000000..89606cdc8fa
--- /dev/null
+++ b/doc/architecture/blueprints/gitaly_adaptive_concurrency_limit/index.md
@@ -0,0 +1,372 @@
+---
+status: proposed
+creation-date: "2023-05-30"
+authors: [ "@qmnguyen0711" ]
+approvers: [ ]
+owning-stage: "~devops::enablement"
+---
+
+# Gitaly Adaptive Concurrency Limit
+
+## Summary
+
+Gitaly, a Git server, needs to push back on its clients to reduce the risk of
+incidents. In the past, we introduced per-RPC concurrency limit and pack-objects
+concurrency limit. Both systems were successful, but the configurations were
+static, leading to some serious drawbacks. This blueprint proposes an adaptive
+concurrency limit system to overcome the drawbacks of static limits. The
+algorithm primarily uses the [Additive Increase/Multiplicative Decrease](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease)
+approach to gradually increase the limit during normal processing but quickly
+reduce it during an incident. The algorithm focuses on lack of resources and
+serious latency degradation as criteria for determining when Gitaly is in
+trouble.
+
+## Motivation
+
+To reduce the risk of incidents and protect itself, Gitaly should be able to
+push back on its clients when it determines some limits have been reached. In
+the [prior attempt](https://Gitlab.com/groups/Gitlab-org/-/epics/7891), we laid
+out some foundations for [backpressure](https://Gitlab.com/Gitlab-org/Gitaly/-/blob/382d1e57b2cf02763d3d65e31ff4d38f467b797c/doc/backpressure.md)
+by introducing two systems: per-RPC concurrency limits and pack-objects
+concurrency limits.
+
+Per-RPC concurrency limits allows us to configure a maximum amount of in-flight
+requests simultaneously. It scopes the limit by RPC and repository. Pack-objects
+concurrency limit restricts the concurrent Git data transfer request by IP. One
+note, the pack-objects concurrency limit is applied on cache misses, only. If
+this limit is exceeded, the request is either put in a queue or rejected if the
+queue is full. If the request remains in the queue for too long, it will also be
+rejected.
+
+Although both of them yielded promising results on GitLab.com, the
+configurations, especially the value of the concurrency limit, are static. There
+are some drawbacks to this:
+
+- It's tedious to maintain a sane value for the concurrency limit. Looking at
+this [production configuration](https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/db11ef95859e42d656bb116c817402635e946a32/roles/gprd-base-stor-gitaly-common.json),
+each limit is heavily calibrated based on clues from different sources. When the
+overall scene changes, we need to tweak them again.
+- Static limits are not good for all usage patterns. It's not feasible to pick a
+fit-them-all value. If the limit is too low, big users will be affected. If the
+value is too loose, the protection effect is lost.
+- A request may be rejected even though the server is idle as the rate is not
+necessarily an indicator of the load induced on the server.
+
+To overcome all of those drawbacks while keeping the benefits of concurrency
+limiting, one promising solution is to make the concurrency limit adaptive to
+the currently available processing capacity of the node. We call this proposed
+new mode "Adaptive Concurrency Limit".
+
+## Goals
+
+- Make Gitaly smarter in push-back traffic when it's under heavy load, thus enhancing the reliability and resiliency of Gitaly.
+- Minimize the occurrences of Gitaly saturation incidents.
+- Decrease the possibility of clients inaccurately reaching the concurrency limit, thereby reducing the ResourceExhausted error rate.
+- Facilitate seamless or fully automated calibration of the concurrency limit.
+
+## Non-goals
+
+- Increase the workload or complexity of the system for users or administrators. The adaptiveness proposed here aims for the opposite.
+
+## Proposal
+
+The proposed Adaptive Concurrency Limit algorithm primarily uses the Additive
+Increase/Multiplicative Decrease ([AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease))
+approach. This method involves gradually increasing the limit during normal
+process functioning but quickly reducing it when an issue (backoff event)
+occurs. There are various criteria for determining whether Gitaly is in trouble.
+In this proposal, we focus on two things:
+
+- Lack of resources, particularly memory and CPU, which are essential for
+handling Git processes.
+- Serious latency degradation.
+
+The proposed solution is heavily inspired by many materials about this subject
+shared by folks from other companies in the industry, especially the following:
+
+- TCP Congestion Control ([RFC-2581](https://www.rfc-editor.org/rfc/rfc2581), [RFC-5681](https://www.rfc-editor.org/rfc/rfc5681),
+[RFC-9293](https://www.rfc-editor.org/rfc/rfc9293.html#name-tcp-congestion-control), [Computer Networks: A Systems Approach](https://book.systemsapproach.org/congestion/tcpcc.html)).
+- Netflix adaptive concurrency limit ([blog post](https://tech.olx.com/load-shedding-with-nginx-using-adaptive-concurrency-control-part-1-e59c7da6a6df)
+and [implementation](https://github.com/Netflix/concurrency-limits))
+- Envoy Adaptive Concurrency
+([doc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/adaptive_concurrency_filter#config-http-filters-adaptive-concurrency))
+
+We cannot blindly apply a solution without careful consideration and expect it
+to function flawlessly. The suggested approach considers Gitaly's specific
+constraints and distinguishing features, including cgroup utilization and
+upload-pack RPC, among others.
+
+The proposed solution does not aim to replace the existing [Gitaly concurrency
+limit][Gitlay-backpressure], but automatically tweak its parameters. This means
+that other aspects, such as queuing, in-queue timeout, queue length,
+partitioning, and scoping, will remain unchanged. The proposed solution only
+focuses on modifying the current **value** of the concurrency limit.
+
+## Design and implementation details
+
+### AIMD Algorithm
+
+The Adaptive Concurrency Limit algorithm primarily uses the Additive
+Increase/Multiplicative Decrease ([AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease))
+approach. This method involves gradually increasing the limit during normal
+process functioning but quickly reducing it when an issue occurs.
+
+During initialization, we configure the following parameters:
+
+- `initialLimit`: Concurrency limit to start with. This value is essentially
+equal to the current static concurrency limit.
+- `maxLimit`: Maximum concurrency limit.
+- `minLimit`: Minimum concurrency limit so that the process is considered as
+functioning. If it's equal to 0, it rejects all upcoming requests.
+- `backoffFactor`: how fast the limit decreases when a backoff event occurs (`0
+< backoff < 1`, default to `0.75`)
+
+When the Gitaly process starts, it sets `limit = initialLimit`, in which `limit`
+is the maximum in-flight requests allowed at a time.
+
+Periodically, maybe once per 15 seconds, the value of the `limit` is
+re-calibrated:
+
+- `limit = limit + 1` if there is no backoff event since the last
+calibration. The new limit cannot exceed `maxLimit`.
+- `limit = limit * backoffFactor` otherwise. The new limit cannot be lower than
+`minLimit`.
+
+When a process can no longer handle requests or will not be able to handle them
+soon, it is referred to as a back-off event. Ideally, we would love to see the
+efficient state as long as possible. It's the state where Gitaly is at its
+maximum capacity.
+
+![Adaptive Concurrency Limit Flow](adaptive_concurrency_limit_flow.png)
+
+Ideally, min/max values are safeguards that aren't ever meant to be hit during
+operation, even overload. In fact, hitting either probably means that something
+is wrong and the dynamic algorithms aren't working well enough.
+
+### How requests are handled
+
+The concurrency limit restricts the total number of in-flight requests (IFR) at
+a time.
+
+- When `IFR < limit`, Gitaly handles new requests without waiting. After an
+increment, Gitaly immediately handles the subsequent request in the queue, if
+any.
+- When `IFR = limit`, it means the limit is reached. Subsequent requests are
+queued, waiting for their turn. If the queue length reaches a configured limit,
+Gitaly rejects new requests immediately. When a request stays in the queue long
+enough, it is also automatically dropped by Gitaly.
+- When `IRF > limit`, it's appropriately a consequence of backoff events. It
+means Gitaly handles more requests than the newly appointed limits. In addition
+to queueing upcoming requests similarly to the above case, Gitaly may start
+load-shedding in-flight requests if this situation is not resolved long enough.
+
+At several points in time we have discussed whether we want to change queueing
+semantics. Right now we admit queued processes from the head of the queue
+(FIFO), whereas it was proposed several times that it might be preferable to
+admit processes from the back (LIFO).
+
+Regardless of the rejection reason, the client received a `ResourceExhausted`
+response code as a signal that they would back off and retry later. Since most
+direct clients of Gitaly are internal, especially GitLab Shell and Workhorse,
+the actual users received some friendly messages. Gitaly can attach
+[exponential pushback headers](https://Gitlab.com/Gitlab-org/Gitaly/-/issues/5023)
+to force internal clients to back off. However, that's a bit brutal and may lead
+to unexpected results. We can consider that later.
+
+### Backoff events
+
+Each system has its own set of signals, and in the case of Gitaly, there are two
+aspects to consider:
+
+- Lack of resources, particularly memory and CPU, which are essential for
+handling Git processes like `git-pack-objects(1)`. When these resources are limited
+or depleted, it doesn't make sense for Gitaly to accept more requests. Doing so
+would worsen the saturation, and Gitaly addresses this issue by applying cgroups
+extensively. The following section outlines how accounting can be carried out
+using cgroup.
+- Serious latency degradation. Gitaly offers various RPCs for different purposes
+besides serving Git data that is hard to reason about latencies. A significant
+overall latency decline is an indication that Gitaly should not accept more
+requests. Another section below describes how to assert latency degradation
+reasonably.
+
+Apart from the above signals, we can consider adding more signals in the future
+to make the system smarter. Some examples are Go garbage collector statistics,
+networking stats, file descriptors, etc. Some companies have clever tricks, such
+as [using time drifting to estimate CPU saturation](https://engineering.linkedin.com/blog/2022/hodor--detecting-and-addressing-overload-in-linkedin-microservic).
+
+#### Backoff events of Upload Pack RPCs
+
+Upload Pack RPCs and their siblings PackObjects RPC are unique to Gitaly. They
+are for the heaviest operations: transferring large volumes of Git data. Each
+operation may take minutes or even hours to finish. The time span of each
+operation depends on multiple factors, most notably the number of requested
+objects and the internet speed of clients.
+
+Thus, latency is a poor signal for determining the backoff event. This type of
+RPC should only depend on resource accounting at this stage.
+
+#### Backoff events of other RPCs
+
+As stated above, Gitaly serves various RPCs for different purposes. They can
+also vary in terms of acceptable latency as well as when to recognize latency
+degradation. Fortunately, the current RPC concurrency limits implementation
+scopes the configuration by RPC and repository individually. The latency signal
+makes sense in this setting.
+
+Apart from latency, resource usage also plays an important role. Hence, other
+RPCs should use both latency measurement and resource accounting signals.
+
+### Resource accounting with cgroup
+
+The issue with saturation is typically not caused by Gitaly, itself but rather by the
+spawned Git processes that handle most of the work. These processes are contained
+within a [cgroup](https://Gitlab.com/Gitlab-org/Gitaly/-/blob/382d1e57b2cf02763d3d65e31ff4d38f467b797c/doc/cgroups.md),
+and the algorithm for bucketing cgroup can be
+found [here](https://Gitlab.com/Gitlab-org/Gitaly/-/blob/382d1e57b2cf02763d3d65e31ff4d38f467b797c/internal/cgroups/v1_linux.go#L166-166).
+Typically, Gitaly selects the appropriate cgroup for a request based on the
+target repository. There is also a parent cgroup to which all repository-level
+cgroups belong to.
+
+Cgroup statistics are widely accessible. Gitaly can trivially fetch both
+resource capacity and current resource consumption via the following information
+in [cgroup control file](https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt):
+
+- `memory.limit_in_bytes`
+- `memory.usage_in_bytes`
+- `cpu.cfs_period_us`
+- `cpu.cfs_quota_us`
+- `cpuacct.usage`
+
+Fetching those statistics may imply some overheads. It's not necessary to keep
+them updated in real-time. Thus, they can be processed periodically in the limit
+adjustment cycle.
+
+In the past, cgroup has been reliable in preventing spawned processes from
+exceeding their limits. It is generally safe to trust cgroup and allow processes
+to run without interference. However, when the limits set by cgroup are reached
+(at 100%), overloading can occur. This often leads to a range of issues such as
+an increase in page faults, slow system calls, memory allocation problems, and
+even out-of-memory kills. The consequences of such incidents are
+highlighted in
+[this example](https://Gitlab.com/Gitlab-com/gl-infra/production/-/issues/8713#note_1352403481). Inflight requests are significantly impacted, resulting in unacceptable delays,
+timeouts, and even cancellations.
+
+Besides, through various observations in the past, some Git processes such as
+`git-pack-objects(1)` build up memory over time. When a wave of `git-pull(1)`
+requests comes, the node can be easily filled up with various memory-hungry
+processes. It's much better to stop this accumulation in the first place.
+
+As a result, to avoid overloading, Gitaly employs a set of soft limits, such as
+utilizing only 75% of memory capacity and 90% of CPU capacity instead of relying
+on hard limits. Once these soft limits are reached, the concurrency adjuster
+reduces the concurrency limit in a multiplicative manner. This strategy ensures
+that the node has enough headroom to handle potential overloading events.
+
+In theory, the cgroup hierarchy allows us to determine the overloading status
+individually. Thus, Gitaly can adjust the concurrency limit for each repository
+separately. However, this approach would be unnecessarily complicated in
+practice. In contrast, it may lead to confusion for operators later.
+
+As a good start, Gitaly recognizes an overloading event in _either_ condition:
+
+- Soft limits of the parent cgroup are reached.
+- Soft limits of **any** of the repository cgroup are reached
+
+It is logical for the second condition to be in place since a repository's
+capacity limit can be significant to the parent cgroup's capacity. This means
+that when the repository cgroup reaches its limit, fewer resources are available
+for other cgroups. As a result, reducing the concurrency limit delays the
+occurrence of overloading.
+
+#### Latency measurement
+
+When re-calibrate the concurrency limit, latency is taken into account for RPCs
+other than Upload Pack. Two things to consider when measuring latencies:
+
+- How to record latencies
+- How to recognize a latency degradation
+
+It is clear that a powerful gRPC server such as Gitaly has the capability to
+manage numerous requests per second per node. A production server can serve up
+to thousands of requests per second. Keeping track and storing response times in
+a precise manner is not practical.
+
+The heuristic determining whether the process is facing latency degradation is
+interesting. The most naive solution is to pre-define a static latency
+threshold. Each RPC may have a different threshold. Unfortunately, similar to
+static concurrency limiting, it's challenging and tedious to pick a reasonable
+up-to-date value.
+
+Fortunately, there are some famous algorithms for this line of problems, mainly
+applied in the world of TCP Congestion Control:
+
+- Vegas Algorithm ([CN: ASA - Chapter 6.4](https://book.systemsapproach.org/congestion/avoidance.html), [Reference implementation](https://Github.com/Netflix/concurrency-limits/blob/master/concurrency-limits-core/src/main/java/com/netflix/concurrency/limits/limit/VegasLimit.java))
+- Gradient Algorithm ([Paper](https://link.springer.com/chapter/10.1007/978-3-642-20798-3_25), [Reference implementation](https://Github.com/Netflix/concurrency-limits/blob/master/concurrency-limits-core/src/main/java/com/netflix/concurrency/limits/limit/Gradient2Limit.java))
+
+The two algorithms are capable of automatically determining the latency
+threshold without any pre-defined configuration. They are highly efficient and
+statistically reliable for real-world scenarios. In my opinion, both algorithms
+are equally suitable for our specific use case.
+
+### Load-shedding
+
+Gitaly being stuck in the overloaded situation for too long can be denoted by
+two signs:
+
+- A certain amount of consecutive backoff events
+- More in-flight requests than concurrency limit for a certain amount of them
+
+In such cases, a particular cgroup or the whole Gitaly node may become
+unavailable temporarily. In-flight requests are likely to either be canceled or
+timeout. On GitLab.com production, an incident is triggered and called for human
+intervention. We can improve this situation by load-shedding.
+
+This mechanism deliberately starts to kill in-flight requests selectively. The
+main purpose is to prevent cascading failure of all inflight requests.
+Hopefully, after some of them are dropped, the cgroup/node can recover back to
+the normal situation fast without human intervention. As a result, it leads to
+net availability and resilience improvement.
+
+Picking which request to kill is tricky. In many systems, request criticality is
+considered. A request from downstream is assigned with a criticality point.
+Requests with lower points are targeted first. Unfortunately, GitLab doesn't
+have a similar system. We have an
+[Urgency system](https://docs.Gitlab.com/ee/development/application_slis/rails_request.html),
+but it is used for response time committing rather than criticality.
+
+As a replacement, we can prioritize requests harming the system the most. Some
+criteria to consider:
+
+- Requests consuming a significant percentage of memory
+- Requests consuming a significant of CPU over time
+- Slow clients
+- Requests from IPs dominating the traffic recently
+- In-queue requests/requests at an early stage. We don’t want to reject requests that are almost finished.
+
+To get started, we can pick the first two criteria first. The list can be
+reinforced when learning from production later.
+
+## References
+
+- Linkedin HODOR system
+ - [https://www.youtube.com/watch?v=-haM4ZpYNko](https://www.youtube.com/watch?v=-haM4ZpYNko)
+ - [https://engineering.linkedin.com/blog/2022/hodor--detecting-and-addressing-overload-in-linkedin-microservic](https://engineering.linkedin.com/blog/2022/hodor--detecting-and-addressing-overload-in-linkedin-microservic)
+- [https://engineering.linkedin.com/blog/2023/hodor--overload-scenarios-and-the-evolution-of-their-detection-a](https://engineering.linkedin.com/blog/2023/hodor--overload-scenarios-and-the-evolution-of-their-detection-a)
+- Google SRE chapters about load balancing and overload:
+ - [https://sre.google/sre-book/load-balancing-frontend/](https://sre.google/sre-book/load-balancing-frontend/)
+ - [https://sre.google/sre-book/load-balancing-datacenter/](https://sre.google/sre-book/load-balancing-datacenter/)
+ - [https://sre.google/sre-book/handling-overload/](https://sre.google/sre-book/handling-overload/)
+ - [https://sre.google/sre-book/addressing-cascading-failures/](https://sre.google/sre-book/addressing-cascading-failures/)
+ - [https://sre.google/workbook/managing-load/](https://sre.google/workbook/managing-load/)
+- [Netflix Performance Under Load](https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581)
+- [Netflix Adaptive Concurrency Limit](https://Github.com/Netflix/concurrency-limits)
+- [Load Shedding with NGINX using adaptive concurrency control](https://tech.olx.com/load-shedding-with-nginx-using-adaptive-concurrency-control-part-1-e59c7da6a6df)
+- [Overload Control for Scaling WeChat Microservices](http://web1.cs.columbia.edu/~junfeng/papers/dagor-socc18.pdf)
+- [ReactiveConf 2019 - Jay Phelps: Backpressure: Resistance is NOT Futile](https://www.youtube.com/watch?v=I6eZ4ZyI1Zg)
+- [AWS re:Invent 2021 - Keeping Netflix reliable using prioritized load shedding](https://www.youtube.com/watch?v=TmNiHbh-6Wg)
+- [AWS Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/)
+- ["Stop Rate Limiting! Capacity Management Done Right" by Jon Moore](https://www.youtube.com/watch?v=m64SWl9bfvk)
+- [Using load shedding to survive a success disaster—CRE life lessons](https://cloud.google.com/blog/products/gcp/using-load-shedding-to-survive-a-success-disaster-cre-life-lessons)
+- [Load Shedding in Web Services](https://medium.com/helpshift-engineering/load-shedding-in-web-services-9fa8cfa1ffe4)
+- [Load Shedding in Distributed Systems](https://blog.sofwancoder.com/load-shedding-in-distributed-systems)
diff --git a/doc/architecture/blueprints/gitlab_ci_events/index.md b/doc/architecture/blueprints/gitlab_ci_events/index.md
index fb78c0f5d9d..51d65869dfb 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/index.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/index.md
@@ -44,7 +44,37 @@ Events" blueprint is about making it possible to:
1. Describe technology required to match subscriptions with events at GitLab.com scale and beyond.
1. Describe technology we could use to reduce the cost of running automation jobs significantly.
-## Proposals
+## Proposal
+
+### Requirements
+
+Any accepted proposal should take in consideration the following requirements and characteristics:
+
+1. Defining events should be done in separate files.
+ - If we define all events in a single file, then the single file gets too complicated and hard to
+ maintain for users. Then, users need to separate their configs with the `include` keyword again and we end up
+ with the same solution.
+ - The structure of the pipelines, the personas and the jobs will be different depending on the events being
+ subscribed to and the goals of the subscription.
+1. A single subscription configuration file should define a single pipeline that is created when an event is triggered.
+ - The pipeline config can include other files with the `include` keyword.
+ - The pipeline can have many jobs and trigger child pipelines or multi-project pipelines.
+1. The events and handling syntax should use the existing CI config syntax where it is pragmatic to do so.
+ - It'll be easier for users to adapt. It'll require less work to implement.
+1. The event subscription and emiting events should be performant, scalable, and non blocking.
+ - Reading from the database is usually faster than reading from files.
+ - A CI event can potentially have many subscriptions.
+ This also includes evaluating the right YAML files to create pipelines.
+ - The main business logic (e.g. creating an issue) should not be affected
+ by any subscriptions to the given CI event (e.g. issue created).
+1. The CI events design should be implemented in a maintainable and extensible way.
+ - If there is a `issues/create` event, then any new event (`merge_request/created`) can be added without
+ much effort.
+ - We expect that many events will be added. It should be trivial for developers to
+ register domain events (e.g. 'issue closed') as GitLab-defined CI events.
+ - Also, we should consider the opportunity of supporting user-defined CI events long term (e.g. 'order shipped').
+
+### Options
For now, we have technical 5 proposals;
diff --git a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
index 6ac67dd0f18..6b1b4d452c9 100644
--- a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
+++ b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
@@ -8,7 +8,7 @@ owning-stage: "~devops::configure"
participating-stages: []
---
-# GitLab to Kubernetes communication **(FREE)**
+# GitLab to Kubernetes communication **(FREE ALL)**
The goal of this document is to define how GitLab can communicate with Kubernetes
and in-cluster services through the GitLab agent.
diff --git a/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md b/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md
index c215ffafbe4..7169e12a7b5 100644
--- a/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md
+++ b/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md
@@ -36,7 +36,7 @@ gRPC would carry messages between modules.
## Use Packwerk to enforce module boundaries
Packwerk is a static analyzer that helps defining and enforcing module boundaries
-in Ruby.
+in Ruby.
[In this PoC merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/98801)
we demonstrate a possible directory structure of the monolith broken down into separate
diff --git a/doc/architecture/blueprints/observability_tracing/index.md b/doc/architecture/blueprints/observability_tracing/index.md
index 4291683f83f..71e03d81bcf 100644
--- a/doc/architecture/blueprints/observability_tracing/index.md
+++ b/doc/architecture/blueprints/observability_tracing/index.md
@@ -135,7 +135,7 @@ All requests from GitLab.com will then include the GOB session cookie for observ
The new UI will be built using the Pajamas Design System in accordance with GitLab UX design standards. The UI will interact with the GOB query service directly from vue.js (see architecture diagram above) by sending a fetch to the subdomain `observe.gitLab.com/v1/query` with `{withCredentials: true}`. See the Authentication and Authorization section above for more details on how this is enabled.
-[**TODO Figma UI designs and commentary**]
+**TODO Figma UI designs and commentary**
## Iterations
diff --git a/doc/architecture/blueprints/organization/index.md b/doc/architecture/blueprints/organization/index.md
index 2cfaf33ff50..09448d6d90c 100644
--- a/doc/architecture/blueprints/organization/index.md
+++ b/doc/architecture/blueprints/organization/index.md
@@ -34,7 +34,7 @@ Organizations solve the following problems:
1. Allows integration with Cells. Isolating Organizations makes it possible to allocate and distribute them across different Cells.
1. Removes the need to define hierarchies. An Organization is a container that could be filled with whatever hierarchy/entity set makes sense (Organization, top-level Groups, etc.)
1. Enables centralized control of user profiles. With an Organization-specific user profile, administrators can control the user's role in a company, enforce user emails, or show a graphical indicator that a user as part of the Organization. An example could be adding a "GitLab employee" stamp on comments.
-1. Organizations bring an on-premise-like experience to SaaS (GitLab.com). The Organization admin will have access to instance-equivalent Admin Area settings with most of the configuration controlled on Organization level.
+1. Organizations bring an on-premise-like experience to SaaS (GitLab.com). The Organization admin will have access to instance-equivalent Admin Area settings with most of the configuration controlled at the Organization level.
## Motivation
@@ -68,13 +68,13 @@ graph TD
ns[Namespace] -. has many .- ns[Namespace]
```
-Self-managed instances would set a default Organization.
+All instances would set a default Organization.
### Benefits
- No changes to URL's for Groups moving under an Organization, which makes moving around top-level Groups very easy.
- Low risk rollout strategy, as there is no conversion process for existing top-level Groups.
-- Organization becomes the key for identifying what is part of an Organization, which is likely on its own table for performance and clarity.
+- Organization becomes the key for identifying what is part of an Organization, which is on its own table for performance and clarity.
### Drawbacks
@@ -93,7 +93,7 @@ From an initial [data exploration](https://gitlab.com/gitlab-data/analytics/-/is
- Most top-level Groups that are matched to organizations with more than one top-level Group are assumed to be intended to be combined into a single organization (82%).
- Most top-level Groups that are matched to organizations with more than one top-level Group are using only a single pricing tier (59%).
- Most of the current top-level Groups are set to public visibility (85%).
-- Less than 0.5% of top-level Groups share Groups with another top-level Group. However, this means we could potentially break 76,000 links between top-level Groups by introducing the Organization.
+- Less than 0.5% of top-level Groups share Groups with another top-level Group. However, this means we could potentially break 76,000 existing links between top-level Groups by introducing the Organization.
Based on this analysis we expect to see similar behavior when rolling out Organizations.
@@ -104,15 +104,15 @@ Based on this analysis we expect to see similar behavior when rolling out Organi
The Organization MVC will contain the following functionality:
- Instance setting to allow the creation of multiple Organizations. This will be enabled by default on GitLab.com, and disabled for self-managed GitLab.
-- Every instance will have a default organization. Initially, all Users will be managed by this default Organization.
+- Every instance will have a default Organization named `Default Organization`. Initially, all Users will be managed by this default Organization.
- Organization Owner. The creation of an Organization appoints that User as the Organization Owner. Once established, the Organization Owner can appoint other Organization Owners.
- Organization Users. A User is managed by one Organization, but can be part of multiple Organizations. Users are able to navigate between the different Organizations they are part of.
- Setup settings. Containing the Organization name, ID, description, and avatar. Settings are editable by the Organization Owner.
-- Setup flow. Users are able to build an Organization on top of an existing top-level Group. New Users are able to create an Organization from scratch and to start building top-level Groups from there.
-- Visibility. Options will be `public` and `private`. A Non-User of a specific Organization will not see private Organizations in the explore section. Visibility is editable by the Organization Owner.
+- Setup flow. Users are able to build new Organizations and transfer existing top-level Groups into them. They can also create new top-level Groups in an Organization.
+- Visibility. Initially, Organizations can only be `public`. Public Organizations can be seen by everyone. They can contain public and private Groups and Projects.
- Organization settings page with the added ability to remove an Organization. Deletion of the default Organization is prevented.
-- Groups. This includes the ability to create, edit, and delete Groups, as well as a Groups overview that can be accessed by the Organization Owner.
-- Projects. This includes the ability to create, edit, and delete Projects, as well as a Projects overview that can be accessed by the Organization Owner.
+- Groups. This includes the ability to create, edit, and delete Groups, as well as a Groups overview that can be accessed by the Organization Owner and Users.
+- Projects. This includes the ability to create, edit, and delete Projects, as well as a Projects overview that can be accessed by the Organization Owner and Users.
### Organization Access
@@ -127,7 +127,7 @@ Organization Users can get access to Groups and Projects as:
Organization Users can be managed in the following ways:
- As [Enterprise Users](../../../user/enterprise_user/index.md), managed by the Organization. This includes control over their User account and the ability to block the User.
-- As Non-Enterprise Users, managed by the Default Organization. Non-Enterprise Users can be removed from an Organization, but the User keeps ownership of their User account.
+- As Non-Enterprise Users, managed by the default Organization. Non-Enterprise Users can be removed from an Organization, but the User keeps ownership of their User account.
Enterprise Users are only available to Organizations with a Premium or Ultimate subscription. Organizations on the free tier will only be able to host Non-Enterprise Users.
@@ -141,10 +141,33 @@ Users are visible across all Organizations. This allows Users to move between Or
- Being invited by email address
- Requesting access. This requires visibility of the Organization and Namespace and must be accepted by the owner of the Namespace. Access cannot be requested to private Groups or Projects.
-1. Becoming an Enterprise Users of an Organization. Bringing Enterprise Users to the Organization level is planned post MVC.
+1. Becoming an Enterprise User of an Organization. Bringing Enterprise Users to the Organization level is planned post MVC. For the Organization MVC Enterprise Users will remain at the top-level Group.
+
+The creator of an Organization automatically becomes the Organization Owner. It is not necessary to become a User of a specific Organization to comment on or create public issues, for example. All existing Users can create and comment on all public issues.
##### When can Users see an Organization?
+For the MVC, an Organization can only be public. Public Organizations can be seen by everyone. They can contain public and private Groups and Projects.
+
+In the future, Organizations will get an additional internal visibility setting for Groups and Projects. This will allow us to introduce internal Organizations that can only be seen by the Users it contains. This would mean that only Users that are part of the Organization will see:
+
+- The Organization front page, instead of a 404 when navigating the Organization URL
+- Name of the organization
+- Description of the organization
+- Organization pages, such as the Activity page, Groups, Projects and Users overview
+
+Content of these pages will be determined by each User's access to specific Groups and Projects. For instance, private Projects would only be seen by the members of this Project in the Project overview.
+
+As an end goal, we plan to offer the following scenarios:
+
+| Organization visibility | Group/Project visibility | Who sees the Organization? | Who sees Groups/Projects? |
+| ------ | ------ | ------ | ------ |
+| public | public | Everyone | Everyone |
+| public | internal | Everyone | Organization Users |
+| public | private | Everyone | Group/Project members |
+| internal | internal | Organization Users | Organization Users |
+| internal | private | Organization Users | Group/Project members |
+
##### What can Users see in an Organization?
Users can see the things that they have access to in an Organization. For instance, an Organization User would be able to access only the private Groups and Projects that they are a Member of, but could see all public Groups and Projects. Actionable items such as issues, merge requests and the to-do list are seen in the context of the Organization. This means that a User might see 10 merge requests they created in `Organization A`, and 7 in `Organization B`, when in total they have created 17 merge requests across both Organizations.
@@ -176,9 +199,9 @@ graph TD
F[ProjectMember] <-.type of.- D
G -.has many.-> E -.belongs to.-> A
- GGL[GroupGroupLink<br/> See note 1] -.belongs_to.->A
- PGL[ProjectGroupLink<br/> See note 2] -.belongs_to.->A
- PGL -.belongs_to.->C
+ GGL[GroupGroupLink] -.belongs to.->A
+ PGL[ProjectGroupLink] -.belongs to.->A
+ PGL -.belongs to.->C
```
GroupGroupLink is the join table between two Group records, indicating that one Group has invited the other.
@@ -207,43 +230,72 @@ Actions such as banning and deleting a User will be added to the Organization at
Non-Users are external to the Organization and can only access the public resources of an Organization, such as public Projects.
+### Roles and Permissions
+
+Organizations will have an Owner role. Compared to Users, they can perform the following actions:
+
+| Action | Owner | User |
+| ------ | ------ | ----- |
+| View Organization settings | ✓ | |
+| Edit Organization settings | ✓ | |
+| Delete Organization | ✓ | |
+| Remove Users | ✓ | |
+| View Organization front page | ✓ | ✓ |
+| View Groups overview | ✓ | ✓ (1) |
+| View Projects overview | ✓ | ✓ (1) |
+| View Users overview | ✓ | ✓ (2) |
+| Transfer top-level Group into Organization if Owner of both | ✓ | |
+
+(1) Users can only see what they have access to.
+(2) Users can only see Users from Groups and Projects they have access to.
+
+[Roles](../../../user/permissions.md) at the Group and Project level remain as they currently are.
+
### Routing
Today only Users, Projects, Namespaces and container images are considered routable entities which require global uniqueness on `https://gitlab.com/<path>/-/`. Initially, Organization routes will be [unscoped](../../../development/routing.md). Organizations will follow the path `https://gitlab.com/-/organizations/org-name/` as one of the design goals is that the addition of Organizations should not change existing Group and Project paths.
### Impact of the Organization on Other Features
-We want a minimal amount of infrequently written tables in the shared database. If we have high write volume or large amounts of data in the shared database then this can become a single bottleneck for scaling and we lose the horizontal scalability objective of Cells.
+We want a minimal amount of infrequently written tables in the shared database. If we have high write volume or large amounts of data in the shared database then this can become a single bottleneck for scaling and we lose the horizontal scalability objective of Cells. With isolation being one of the main requirements to make Cells work, this means that existing features will mostly be scoped to an Organization rather than work across Organizations. One exception to this are Users, which are stored in the cluster-wide shared database. For a deeper exploration of the impact on select features, see the [list of features impacted by Cells](../cells/index.md#impacted-features).
## Iteration Plan
The following iteration plan outlines how we intend to arrive at the Organization MVC. We are following the guidelines for [Experiment, Beta, and Generally Available features](../../../policy/experiment-beta-support.md).
-### Iteration 1: Organization Prototype (FY24Q2)
+### Iteration 1: Organization Prototype (FY24Q3)
-In iteration 1, we introduce the concept of an Organization as a way to Group top-level Groups together. Support for Organizations does not require any [Cells](../cells/index.md) work, but having them will make all subsequent iterations of Cells simpler. The goal of iteration 1 will be to generate a prototype that can be used by GitLab teams to test moving functionality to the Organization. It contains everything that is necessary to move an Organization to a Cell:
+In iteration 1, we introduce the concept of an Organization as a way to group top-level Groups together. Support for Organizations does not require any [Cells](../cells/index.md) work, but having them will make all subsequent iterations of Cells simpler. The goal of iteration 1 will be to generate a prototype that can be used by GitLab teams to test basic functionality within an Organization. The prototype contains the following functionality:
-- The Organization can be named, has an ID and an avatar.
-- Only a Non-Enterprise User can be part of an Organization.
+- A new Organization can be created.
+- The Organization contains a name, ID, description and avatar.
+- The creator of the Organization is assigned as the Organization Owner.
+- Groups can be created in an Organization. Groups are listed in the Groups overview. Every Organization User can access the Groups overview and see the Groups they have access to.
+- Projects can be created in a Group. Projects are listed in the Projects overview. Every Organization User can access the Projects overview and see the Projects they have access to.
+- Both Enterprise and Non-Enterprise Users can be part of an Organization.
+- Enterprise Users are still managed by top-level Groups.
- A User can be part of multiple Organizations.
-- A single Organization Owner can be assigned.
-- Groups can be created in an Organization. Groups are listed in the Groups overview.
-- Projects can be created in a Group. Projects are listed in the Projects overview.
+- Users can navigate between the different Organizations they are part of.
+- Any User within or outside of an Organization can be invited to Groups and Projects contained by the Organization.
-### Iteration 2: Organization MVC Experiment (FY24Q3)
+### Iteration 2: Organization MVC Experiment (FY24Q4)
-In iteration 2, an Organization MVC Experiment will be released. We will test the functionality with a select set of customers and improve the MVC based on these learnings. Users will be able to build an Organization on top of their existing top-level Group.
+In iteration 2, an Organization MVC Experiment will be released. We will test the functionality with a select set of customers and improve the MVC based on these learnings. The MVC Experiment contains the following functionality:
-- The Organization has a description.
+- Users are listed in the User overview. Every Organization User can access the User overview and see Users that are part of the Groups and Projects they have access to.
- Organizations can be deleted.
-- Users can navigate between the different Organizations they are part of.
+- Forking across Organizations will be defined.
### Iteration 3: Organization MVC Beta (FY24Q4)
-In iteration 3, the Organization MVC Beta will be released.
+In iteration 3, the Organization MVC Beta will be released. Users will be able to transfer existing top-level Groups into an Organization.
- Multiple Organization Owners can be assigned.
-- Organization Owners can change the visibility of an organization between `public` and `private`. A Non-User of a specific Organization will not see private Organizations in the explore section.
+- Organization avatars can be changed in the Organization settings.
+- Organization Owners can create, edit and delete Groups from the Groups overview.
+- Organization Owners can create, edit and delete Projects from the Projects overview.
+- Top-level Groups can be transferred into an Organization.
+- The Organization URL path can be changed.
### Iteration 4: Organization MVC GA (FY25Q1)
@@ -253,7 +305,9 @@ In iteration 4, the Organization MVC will be rolled out.
After the initial rollout of Organizations, the following functionality will be added to address customer needs relating to their implementation of GitLab:
+1. [Organizations can invite Users](https://gitlab.com/gitlab-org/gitlab/-/issues/420166).
1. Internal visibility will be made available on Organizations that are part of GitLab.com.
+1. Restrict inviting Users outside of the Organization.
1. Enterprise Users will be made available at the Organization level.
1. Organizations are able to ban and delete Users.
1. Projects can be created from the Organization-level Projects overview.
@@ -270,6 +324,21 @@ After the initial rollout of Organizations, the following functionality will be
## Organization Rollout
+We propose the following steps to successfully roll out Organizations:
+
+- Phase 1: Rollout
+ - Organizations will be rolled out using the concept of a `default Organization`. All existing top-level groups on GitLab.com are already part of this `default Organization`. The Organization UI is feature flagged and can be enabled for a specific set of users initially, and the global user pool at the end of this phase. This way, users will already become familiar with the concept of an Organization and the Organization UI. No features would be impacted by enabling the `default Organization`. See issue [#418225](https://gitlab.com/gitlab-org/gitlab/-/issues/418225) for more details.
+- Phase 2: Migrations
+ - GitLab, the organization, will be the first one to bud off into a separate Organization. We move all top-level groups that belong to GitLab into the new GitLab Organization, including the `gitLab-org` and `gitLab-com` top-level Groups. See issue [#418228](https://gitlab.com/gitlab-org/gitlab/-/issues/418228) for more details.
+ - Existing customers can create their own Organization. Creation of an Organization remains optional.
+- Phase 3: Onboarding changes
+ - New customers will only have the option to start their journey by creating an Organization.
+- Phase 4: Targeted efforts
+ - Organizations are promoted, e.g. via a banner message, targeted conversations with large customers via the CSMs. Creating a separate Organization will remain a voluntary action.
+ - We increase the value proposition of the Organization, for instance by moving billing to the Organization level to provide incentives for more customers to move to a separate Organization. Adoption will be monitored.
+
+A force-option will only be considered if the we do not achieve the load distribution we are aiming for with Cells.
+
## Alternative Solutions
An alternative approach to building Organizations is to convert top-level Groups into Organizations. The main advantage of this approach is that features could be built on top of the Namespace framework and therewith leverage functionality that is already available at the Group level. We would avoid building the same feature multiple times. However, Organizations have been identified as a critical driver of Cells. Due to the urgency of delivering Cells, we decided to opt for the quickest and most straightforward solution to deliver an Organization, which is the lightweight design described above. More details on comparing the two Organization proposals can be found [here](https://gitlab.com/gitlab-org/tenant-scale-group/group-tasks/-/issues/56).
@@ -282,5 +351,8 @@ An alternative approach to building Organizations is to convert top-level Groups
## Links
- [Organization epic](https://gitlab.com/groups/gitlab-org/-/epics/9265)
+- [Organization MVC design](https://gitlab.com/groups/gitlab-org/-/epics/10068)
- [Enterprise Users](../../../user/enterprise_user/index.md)
- [Cells blueprint](../cells/index.md)
+- [Cells epic](https://gitlab.com/groups/gitlab-org/-/epics/7582)
+- [Namespaces](../../../user/namespace/index.md)
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md
index f42a70aa97a..7af50097e97 100644
--- a/doc/architecture/blueprints/rate_limiting/index.md
+++ b/doc/architecture/blueprints/rate_limiting/index.md
@@ -186,6 +186,24 @@ Things we want to build and support by default:
1. Logging that will expose limits applied in Kibana.
1. An automatically generated documentation page describing all the limits.
+### Support rate limits based on resources used
+
+One of the problems of our rate limiting system is that values are static
+(e.g. 100 requests per minutes) and irrespective of the complexity or resources
+used by the operation. For example:
+
+- Firing 100 requests per minute to fetch a simple resource can have very different
+ implications than creating a CI pipeline.
+- Each pipeline creation action can perform very differently depending on the
+ pipeline being created (small MR pipeline VS large scheduled pipeline).
+- Paginating resources after an offset of 1000 starts to become expensive on the database.
+
+We should allow some rate limits to be defiened as `computing score / period` where for
+computing score we calculate the milliseconds accumulated (for all requests executed
+and inflight) within a given period (for example: 1 minute).
+
+This way if a user is sending expensive requests they are likely to hit the rate limit earlier.
+
### API to expose limits and policies
Once we have an established a consistent way to define application limits we
diff --git a/doc/architecture/blueprints/remote_development/index.md b/doc/architecture/blueprints/remote_development/index.md
index ce55f23f828..c7d1ec29add 100644
--- a/doc/architecture/blueprints/remote_development/index.md
+++ b/doc/architecture/blueprints/remote_development/index.md
@@ -483,6 +483,40 @@ RestartRequested : User has requested a workspace restart.\n**desired_state** wi
RestartRequested -left-> Running : status=Running
```
+## Injecting environment variables and files into a workspace
+
+Like CI, there is a need to inject environment variables and files into a workspace. These environment variables and files will be frozen in time during workspace creation to ensure the same values are injected into the workspace every time it starts/restarts. Thus, a new database table, on the lines of `ci_job_variables` will be required. This table will contain the following columns -
+
+- `key` - To store the name of the environment variable or the file.
+- `encrypted_value` - To store the encrypted value of the environment variable or the file.
+- `encrypted_value_iv` - To store the initialization vector used for encryption.
+- `workspace_id` - To reference the workspace the environment variable or the file is to be injected into.
+- `variable_type` - To store whether this data is to be injected as an environment variable or a file.
+
+To perform the encryption, a secret key would be required. This would be uniquely generated for each workpsace upon creation.
+Having a unique secret key which is used for encrypting the corresponding workspace's environment variable and file data in the workspace, improves the security profile.
+
+Because of the nature of reconciliation loop between Agent and Rails, it is not scalable to decrypt these values at Rails side for each request.
+Instead, the `key`, `encrypted_value` and `encrypted_value_iv` of each environment variable of the workspace are sent to the Agent along with the workspace's `secret_key`
+for the Agent to decrypt them in place.
+
+To optimize this further, the data about the environment variables and files along with the secret key will only be sent when required i.e.
+
+- When new workspace creation request has been received from the user and an Agent initiates a Partial Reonciliation request
+- When an Agent initiates a Full Reconciliation request
+
+When a workspace is created from a project, it will inherit all the variables from the group/subgroup/project hierarchy which are defined under
+[`Settings > CI/CD > Variables`](../../../ci/variables/index.md#define-a-cicd-variable-in-the-ui). This aspect will be generalized to allow for defining `Variables`
+which will be inherited in both CI/CD and Workspaces.
+
+A user will also be able to define, at a user level, environment variables and files to be injected into each workspace created by them.
+
+When a new workspace is created, a new personal access token associated to the user who created the workspace will be generated.
+This personal access token will be tied to the lifecycle of the workspace and will be injected into the workspace as an environment variable or a file
+to allow for cloning private projects and supporting transparent Git operations from within the workspace out-of-the-box among other things.
+
+More details about the implementation details can be found in this [epic](https://gitlab.com/groups/gitlab-org/-/epics/10882).
+
## Workspace user traffic authentication and authorization
We need to only allow certain users to access workspaces. Currently, we are restricting this to the creator/owner of the workspace. After the workspace is created, it needs to be exposed to the network so that the user can connect to it.
diff --git a/doc/architecture/blueprints/runner_admission_controller/index.md b/doc/architecture/blueprints/runner_admission_controller/index.md
index d73ffb21ef3..92c824527ec 100644
--- a/doc/architecture/blueprints/runner_admission_controller/index.md
+++ b/doc/architecture/blueprints/runner_admission_controller/index.md
@@ -229,7 +229,7 @@ be rare in typical circumstances.
### Implementation Details
-1. [placeholder for steps required to code the admissions controller MVC]
+1. _placeholder for steps required to code the admissions controller MVC_
## Technical issues to resolve
diff --git a/doc/architecture/blueprints/ssh_certificates/index.md b/doc/architecture/blueprints/ssh_certificates/index.md
new file mode 100644
index 00000000000..3cbe6711028
--- /dev/null
+++ b/doc/architecture/blueprints/ssh_certificates/index.md
@@ -0,0 +1,211 @@
+---
+status: ongoing
+creation-date: "2023-07-28"
+authors: [ "@igor.drozdov" ]
+coach: "@stanhu"
+approvers: [ ]
+owning-stage: "~devops::create"
+---
+
+# SSH certificates
+
+## Summary
+
+On GitLab.com customers obtain their own top-level group (later organization). In comparison to self-managed, they have to manage organization-wide settings at this level.
+
+Currently, the provided Git access control options on SaaS (SSH, HTTPS) rely on credentials (access tokens, SSH keys) setup in the user profile. As the user profile is out of control of the organization, there is no way for a customer to assess whether the key is kept confidential or whether the expiry date is meeting policies. Also, there's very little that can be done for damage control in case the keys are leaked as well as a customer cannot enforce MFA on Git access flows.
+
+Customers may have processes in place, where developers on a daily basis can, via MFA, request a temporary SSH certificate which gives them access to internal systems. To enable the same way of working on SaaS, we would need a way to share public Certificate Authority (`CA`) files with GitLab.com SaaS for the purpose of Git access control.
+
+## Motivation
+
+- Enable users to share public Certificate Authority (`CA`) files with GitLab.com SaaS for the purpose of Git access control.
+- Fill the product gap between GitLab and competitive products that already support authentication via SSH certificates.
+
+### Goals
+
+This document proposes an architectural design to implement functionality to satisfy the following requirements:
+
+- The public key of the `CA` file (`CA.pub`) that is used to issue certificates can be added to a group.
+- A certificate issued by the `CA` can be used to get Git access to projects of the group and its ancestors.
+- The certificate cannot be used to get Git access to projects outside the group and its ancestors.
+
+### Non-goals
+
+This document focuses on providing core functionality for supporting authentication via SSH certificates. The potential improvements are described in [Follow Ups](#follow-ups).
+
+## Proposal
+
+### MVC
+
+A group admin generates an SSH key pair to be used as a Certified Authority file (`ssh-keygen -f CA`):
+
+- The private key is used to issue user certificates
+- The public key is added to a group in order to grant access to the group via the user certificates
+
+#### User certificate
+
+A group admin issues user certificates using `CA` private key and specifies either a GitLab username or a user's primary email as the key identity:
+
+```shell
+ssh-keygen -s CA -I user@example.com -V +1d user-key.pub
+```
+
+As a result, a user certificate of the following structure is generated:
+
+```shell
+ssh-keygen -L -f ssh_host_ed25519_key-cert.pub
+
+ssh_host_ed25519_key-cert.pub:
+ Type: ssh-ed25519-cert-v01@openssh.com user certificate
+ Public key: ED25519-CERT SHA256:dRVV49XJHt85X1seqr9xXyxyuuGTbtFV6Lbwlrx6BIQ
+ Signing CA: RSA SHA256:UAcgUeGoXrs8WOT/N+bmqY2vB9145Mc5NaN1Y977NCI (using rsa-sha2-512)
+ Key ID: "user@example.com"
+ Serial: 1
+ Valid: from 2023-07-31T18:20:00 to 2023-08-01T18:21:34
+ Principals: (none)
+ Critical Options: (none)
+ Extensions:
+ permit-X11-forwarding
+ permit-agent-forwarding
+ permit-port-forwarding
+ permit-pty
+ permit-user-rc
+```
+
+- `Type` is the type of the user certificate. Only user certificates are accepted by GitLab Shell, all other types are rejected.
+- `Public Key` is the public key of the user.
+- `Signing CA` is the public key of `CA`. Its fingerprint is used to find a group associated with the user certificate.
+- `Key ID` is a user's username or primary email. It is used to associate a GitLab user with the user certificate.
+- `Serial` is a serial number of the user certificate. It can be used to distinguish between different certificates created by the same `CA`.
+- `Valid` indicates the period of validity. This value is validated by GitLab Shell: expired and not yet valid user certificates are rejected.
+- `Principals`, `Critical Options` and `Extensions` are used to embed additional information into the user certificate. This fields can be potentially used in future iterations to apply additional restrictions on a user certificated.
+
+#### Application behavior
+
+[GitLab Shell](https://gitlab.com/gitlab-org/gitlab-shell) is the project responsible for handling [commands](../../../development/gitlab_shell/features.md) sent to GitLab instance via SSH.
+When a user tries to establish an SSH connection and authenticate via a public key, GitLab Shell sends an internal API request to `/authorized_keys` endpoint to detect whether the key is associated with a GitLab user. If a certificate is used for authentication, GitLab Shell can recognize it and perform a request to `/authorized_certs` instead.
+
+1. A group admin adds `CA.pub` file to a group.
+1. A user tries to authenticate using a certificate signed by the `CA`.
+1. GitLab Shell sends to `/authorized_certs` the fingerprint of the `CA` and the user identity (either GitLab username or the primary email).
+1. GitLab Rails finds a group that has a `CA.pub` with the fingerprint added and the user. The `CA.pub` is unique for an instance by fingerprint which defines the one-to-one relationship between `CA` and a group.
+1. GitLab Shell remembers the namespace full path for the established connection.
+1. GitLab Shell sends a request to `/allowed` endpoint every time a check whether a user has access to a particular project is needed. The namespace full path is passed to `/allowed` endpoint.
+1. GitLab Rails checks whether the namespace matches the project namespace or one of its ancestors in order to determine whether a user has access to this project via the certificate.
+1. If all the checks above are successful, the user gets access to the project.
+
+```mermaid
+sequenceDiagram
+ User->>+GitLab Shell: Auth using SSH Certificate
+ GitLab Shell->>+GitLab Rails: /authorized_certs?key=fingerprint-of-signing-key&user_identity=username-or-primary-email
+ GitLab Rails-->>-GitLab Shell: responds with the namespace that configures the CA and the username of the user
+ GitLab Shell-->>User: Authenticated successfully
+ User->>+GitLab Shell: Git command to a specific project
+ GitLab Shell->>+GitLab Rails: /allowed [namespace=namespace]
+ GitLab Rails-->>-GitLab Shell: responds that the user is allowed to access a project in this namespace
+ GitLab Shell-->>User: success
+```
+
+#### Examples
+
+1. Access a project outside the group that configures `CA.pub`.
+
+ Given there is the following hierarchy of groups
+
+ ```plaintext
+ a/b/c/d/e/f
+ |
+ └/g/h/i
+ ```
+
+ - A group admin adds `CA.pub` to `d` and a user is authenticated using a certificate signed by the `CA`.
+ - When a user clones `a/b/c/d/e/f/project`, we send `a/b/c/d/e/project` project full path and `a/b/c/d` namespace full path: the user is allowed to clone the project because `d` is an ancestor of the project's namespace.
+ - When a user clones `a/b/c/g/h/i/project`, the user is not allowed to clone the project because `d` is not in a list of its ancestors.
+
+1. A group that configures `CA.pub` is transferred to a different namespace.
+
+The existing certificates are still valid because the namespace full path is stored per connection. When a user reconnects, another request to `/authorized_certs` is sent and the new full path of the namespace is returned.
+
+## Open questions
+
+### Multiple SSH certificates to different projects
+
+A user may have different SSH certificates to access different projects.
+When the user establishes an SSH connection, the SSH client will iterate over a number of potential
+options in order to find the one that authenticates successfully.
+With the current architecture, the first certificate that provides access to a namespace is accepted,
+even if the user is meant to get access to a different project.
+
+For example:
+
+1. A user has valid certificates for `a` and `b` groups.
+1. The user is successfully authenticated using `a`.
+1. The user tries to clone `b/project` and fails.
+
+A workaround for this scenario is to configure Git to use a particular certificate during an SSH
+connection. Add the following to your `.gitconfig` file:
+
+```plaintext
+[core]
+ sshCommand = ssh git@gitlab.com -i cert.pub
+```
+
+### A single certificate cannot be revoked
+
+Revocation of a single user certificate is out of the scope of this MVC. It is possible to implement this feature but the feasibility should be discussed.
+
+Supporting it will complicate the implementation and UI/UX. However, the risk of a compromised certificate can be significantly reduced by the following actions:
+
+- Expiration date of a user certificate. It should be highly recommended in the docs.
+- Rotation of the CA which revokes all the current user certificates.
+- Implementing `source-address` feature that allows restricting IP address(es) from which the user certificate can be used.
+
+### A certificate can be used across multiple GitLab instances
+
+The information about GitLab instance is not embedded into a user certificate. It means that it can be used across multiple GitLab instances as long as the `KeyId` value is recognized in those instances.
+
+Potential solution:
+
+- An optional field that restricts usage of a user certificate to a particular instance can be implemented using `Extensions` in a follow-up. Specifying `extension:login@gitlab.com=username` is a more secure and flexible option, but we can support both.
+
+### CA cannot be reused by multiple groups
+
+The `CA.pub` fingerprint must be unique and cannot be reused by multiple groups. The one-to-one relationship is chosen by design to be able to find a single group that a user has access to.
+
+Another option is to embed the namespace into the user certificate using `Extensions` or `Critical Options`.
+
+Pros:
+
+- `CA` can be reused by multiple groups.
+- A user certificate _tells_ which group to get access to rather than _asks_ which group it can get access to.
+
+Cons:
+
+- Requires a custom requirement to the user certificate format.
+- If some other group adds `CA.pub`, a user may unintentionally get access to that group.
+
+Potential solution:
+
+- An optional field that restricts usage of a user certificate to a particular group can be implemented using `Extensions` or `Critical Options` in a follow-up. `CA` still cannot be reused but a user certificate cannot be used for some other group.
+
+## Iteration plan
+
+| Component | Milestone | Group | Changes |
+|--------------|-----------------------|----------------------------------|---------|
+| GitLab Shell | `16.3` | Source Code | [Implement](https://gitlab.com/gitlab-org/gitlab-shell/-/merge_requests/812) authentication using SSH certificates in GitLab Shell |
+| GitLab Rails | `16.4` | Source Code | Implement the internal GitLab Rails API endpoint `authorized_certs` to find a group that configures the `CA.pub` |
+| GitLab Rails | `16.4` | Source Code | Implement a GitLab Rails API endpoint for groups to add/remove a `CA.pub` |
+| GitLab Rails | `Next 2-3 milestones` | Authentication and Authorization | Implement Group Settings UX to add/remove `CA.pub` |
+| GitLab Rails | `Next 2-3 milestones` | Authentication and Authorization | Implement an option to enforce using SSH certificates only for authentication and forbid personal SSH keys and access tokens |
+
+## Follow-ups
+
+The functionality that is related to the topic but out of the scope of this blueprint:
+
+- Enforce using SSH certificates only for authentication by disabling Git over HTTPS for an instance or a Group: a must-have functionality.
+- Enforce using Group-level SSH keys only by forbidding the use of personal SSH keys for an instance or a Group: a must-have functionality.
+- Specifying `source-address` `Critical Option` that restricts usage of a user certificate to a set of IP addresses: a nice-to-have functionality.
+- Specifying `login@hostname=username` `Extensions` that restrict usage of a user certificate to a set of instances: a nice-to-have functionality.
+- Signing commits using SSH certificates: a nice-to-have functionality.
+- Revoke a single user certificate. It requires complex UI/UX while the risk can be significantly mitigated by using other features.