Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2023-07-19 17:16:28 +0300
committerGitLab Bot <gitlab-bot@gitlab.com>2023-07-19 17:16:28 +0300
commite4384360a16dd9a19d4d2d25d0ef1f2b862ed2a6 (patch)
tree2fcdfa7dcdb9db8f5208b2562f4b4e803d671243 /doc/architecture
parentffda4e7bcac36987f936b4ba515995a6698698f0 (diff)
Add latest changes from gitlab-org/gitlab@16-2-stable-eev16.2.0-rc42
Diffstat (limited to 'doc/architecture')
-rw-r--r--doc/architecture/blueprints/ai_gateway/img/architecture.pngbin0 -> 378194 bytes
-rw-r--r--doc/architecture/blueprints/ai_gateway/index.md463
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-backups.md6
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-contributions-forks.md6
-rw-r--r--doc/architecture/blueprints/cells/cells-feature-secrets.md4
-rw-r--r--doc/architecture/blueprints/cells/index.md10
-rw-r--r--doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md2
-rw-r--r--doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md2
-rw-r--r--doc/architecture/blueprints/ci_data_decay/index.md2
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md6
-rw-r--r--doc/architecture/blueprints/ci_scale/index.md2
-rw-r--r--doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md4
-rw-r--r--doc/architecture/blueprints/clickhouse_usage/index.md37
-rw-r--r--doc/architecture/blueprints/code_search_with_zoekt/index.md2
-rw-r--r--doc/architecture/blueprints/consolidating_groups_and_projects/index.md2
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database/index.md2
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md238
-rw-r--r--doc/architecture/blueprints/database/automated_query_analysis/index.md2
-rw-r--r--doc/architecture/blueprints/gitlab_agent_deployments/index.md171
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/index.md13
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/proposal-1-using-the-gitlab-ci-file.md24
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/proposal-2-using-the-rules-keyword.md23
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/proposal-3-using-the-gitlab-ci-events-folder.md13
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/proposal-4-creating-events-via-ci-files.md26
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/proposal-5-combined-proposal.md99
-rw-r--r--doc/architecture/blueprints/gitlab_observability_backend/index.md (renamed from doc/architecture/blueprints/gitlab_observability_backend/metrics/index.md)81
-rw-r--r--doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png (renamed from doc/architecture/blueprints/gitlab_observability_backend/metrics/supported-deployments.png)bin74153 -> 74153 bytes
-rw-r--r--doc/architecture/blueprints/modular_monolith/bounded_contexts.md119
-rw-r--r--doc/architecture/blueprints/modular_monolith/hexagonal_monolith/hexagonal_architecture.pngbin0 -> 33135 bytes
-rw-r--r--doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md132
-rw-r--r--doc/architecture/blueprints/modular_monolith/index.md112
-rw-r--r--doc/architecture/blueprints/modular_monolith/proof_of_concepts.md134
-rw-r--r--doc/architecture/blueprints/modular_monolith/references.md70
-rw-r--r--doc/architecture/blueprints/object_pools/index.md4
-rw-r--r--doc/architecture/blueprints/observability_tracing/arch.pngbin0 -> 53192 bytes
-rw-r--r--doc/architecture/blueprints/observability_tracing/index.md171
-rw-r--r--doc/architecture/blueprints/organization/index.md192
-rw-r--r--doc/architecture/blueprints/rate_limiting/index.md2
-rw-r--r--doc/architecture/blueprints/remote_development/index.md4
-rw-r--r--doc/architecture/blueprints/repository_backups/index.md268
-rw-r--r--doc/architecture/blueprints/runner_admission_controller/index.md241
-rw-r--r--doc/architecture/blueprints/runner_scaling/index.md6
-rw-r--r--doc/architecture/blueprints/runner_tokens/index.md4
-rw-r--r--doc/architecture/blueprints/work_items/index.md22
44 files changed, 2392 insertions, 329 deletions
diff --git a/doc/architecture/blueprints/ai_gateway/img/architecture.png b/doc/architecture/blueprints/ai_gateway/img/architecture.png
new file mode 100644
index 00000000000..dea8b5ddb45
--- /dev/null
+++ b/doc/architecture/blueprints/ai_gateway/img/architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/ai_gateway/index.md b/doc/architecture/blueprints/ai_gateway/index.md
new file mode 100644
index 00000000000..c9947723739
--- /dev/null
+++ b/doc/architecture/blueprints/ai_gateway/index.md
@@ -0,0 +1,463 @@
+---
+status: ongoing
+creation-date: "2023-07-14"
+authors: [ "@reprazent" ]
+coach: [ "@andrewn", "@stanhu" ]
+approvers: [ "@m_gill", "@mksionek", "@marin" ]
+owning-stage: "~devops::modelops"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# AI-gateway
+
+## Summary
+
+The AI-gateway is a standalone-service that will give access to AI
+features to all users of GitLab, no matter which instance they are
+using: self-managed, dedicated or GitLab.com.
+
+Initially, all AI-gateway deployments will be managed by GitLab (the
+organization), and GitLab.com and all GitLab self-managed instances
+will use the same gateway. However, in the future we could also deploy
+regional gateways, or even customer-specific gateways if the need
+arises.
+
+The AI-Gateway is an API-Gateway that takes traffic from clients, in
+this case GitLab installations, and directing it to different
+services, in this case AI-providers and their models. This North/South
+traffic pattern allows us to control what requests go where and to
+translate the content of the redirected request where needed.
+
+![architecture diagram](img/architecture.png)
+
+[src of the architecture diagram](https://docs.google.com/drawings/d/1PYl5Q5oWHnQAuxM-Jcw0C3eYoGw8a9w8atFpoLhhEas/edit)
+
+By using a hosted service under the control of GitLab we can ensure
+that we provide all GitLab instances with AI features in a scalable
+way. It is easier to scale this small stateless service, than scaling
+GitLab-rails with it's dependencies (database, Redis).
+
+It allows users of self-managed installations to have access to
+features using AI without them having to host their own models or
+connect to 3rd party providers.
+
+## Language: Python
+
+The AI-Gateway was originally started as the "model-gateway" that
+handled requests from IDEs to provide code suggestions. It was written
+in Python.
+
+Python is an object oriented language that is familiar enough for
+Rubyists to pick up through in the younger codebase that is the
+AI-gateway. It also makes it easy for data- and ML-engineers that
+already have Python experience to contribute.
+
+## API
+
+### Basic stable API for the AI-gateway
+
+Because the API of the AI-gateway will be consumed by a wide variety
+of GitLab instances, it is important that we design a stable, yet
+flexible API.
+
+To do this, we can implement an API-endpoint per use-case we
+build. This means that the interface between GitLab and the AI-gateway
+is one that we build and own. This ensures future scalability,
+composability and security.
+
+The API is not versioned, but is backward compatible. See [cross version compatibility](#cross-version-compatibility)
+for details. The AI-gateway will support the last 2 major
+versions. For example when working on GitLab 17.2, we would support
+both GitLab 17 and GitLab 16.
+
+We can add common functionality like rate-limiting, circuit-breakers and
+secret redaction at this level of the stack as well as in GitLab-rails.
+
+#### Protocol
+
+We're choosing to use a simple JSON API for the AI-gateway
+service. This allows us to re-use a lot of what is already in place in
+the current model-gateway. It also allows us to make the endpoints
+version agnostic. We could have an API that expects only a rudimentary
+envelope that can contain dynamic information. We should make sure
+that we make these APIs compatible with multiple versions of GitLab,
+or other clients that use the gateway through GitLab. **This means
+that all client versions talk to the same API endpoint, the AI-gateway
+needs to support this, but we don't need to support different
+endpoints per version**.
+
+We also considered gRPC as a the protocol for communication between
+GitLab instances, they differ on these items:
+
+| gRPC | REST + JSON |
+|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
+| + Strict protocol definition that is easier to evolve versionless | - No strict schema, so the implementation needs to take good care of supporting multiple versions |
+| + A new Ruby-gRPC server for vscode: likely faster because we can limit dependencies to load ([modular monolith](https://gitlab.com/gitlab-org/gitlab/-/issues/365293)) | - Existing Grape API for vscode: meaning slow boot time and unneeded resources loaded |
+| + Bi-directional streaming | - Straight forward way to stream requests and responses (could still be added) |
+| - A new Python-gRPC server: we don't have experience running gRPC-Python servers | + Existing Python fastapi server, already running for code suggestions to extend |
+| - Hard to pass on unknown messages from vscode through GitLab to ai-gateway | + Easier support for newer vscode + newer ai-gatway, through old GitLab instance |
+| - Unknown support for gRPC in other clients (vscode, jetbrains, other editors) | + Support in all external clients |
+| - Possible protocol mismatch (VSCode --REST--> Rails --gRPC--> AI gateway) | + Same protocol across the stack |
+
+**Discussion:** Because we chose REST+JSON in this iteration to port
+features that already partially exist does not mean we need to exclude
+new features using gRPC or Websockets. For example: Chat features
+might be better served by streaming requests and responses. Since we
+are suggesting an endpoint per use-case, different features could also
+opt for different protocols, as long as we keep cross-version
+compatibility in mind.
+
+#### Single purpose endpoints
+
+For features using AI, we prefer building a single purpose endpoint
+with a stable API over the [provider API we expose](#exposing-ai-providers)
+as a direct proxy.
+
+Some features will have specific endpoints, while others can share
+endpoints. For example, code suggestions or chat could have their own
+endpoint, while several features that summarize issues or merge
+requests could use the same endpoint but make the distinction on what
+information is provided in the payload.
+
+The end goal is to build an API that exposes AI for building
+features without having to touch the AI-gateway. This is analogous to
+how we built Gitaly, adding features to Gitaly where it was needed,
+and reusing existing endpoints when that was possible. We had some
+cost to pay up-front in the case where we needed to implement a new
+endpoint (RPC), but pays off in the long run when most of the required
+functionality is implemented.
+
+**This does not mean that prompts need to be built inside the
+AI-gateway.** But if prompts are part of the payload to a single
+purpose endpoint, the payload needs to specify which model they were
+built for along with other metadata about the prompts. By doing this,
+we can gracefully degrade or otherwise try to support the request if
+one of the prompt payloads is no longer supported by the AI
+gateway. It allows us to potentially avoid breaking features in older
+GitLab installations as the AI landscape changes.
+
+#### Cross version compatibility
+
+When building single purpose endpoints, we should be mindful that
+these endpoints will be used by different GitLab instances indirectly
+by external clients. To achieve this, we have a very simple envelope
+to provide information. It has to have a series of `prompt_components`
+that contain information the AI-gateway can use to build prompts and
+query the model of it selects.
+
+Each prompt component contains 3 elements:
+
+- `type`: This is the kind of information that is being presented in
+ `payload`. The AI-gateway should ignore any types it does not know
+ about.
+- `payload`: The actual information that can be used by the AI-gateway
+ to build the payload that is going to go out to AI providers. The
+ payload will be different depending on the type, and the version of
+ the client providing the payload. This means that the AI-gateway
+ needs to consider all fields optional.
+- `metadata`: Information about the client that built this part of the
+ prompt. This may, or may not be used by GitLab for
+ telemetry. Nothing inside this field should be required.
+
+The only component in there that is likely to change often is the
+`payload` one. There we need to make sure that all fields are
+optional, and avoid renaming, removing or repurposing fields.
+
+When this is needed, we need to build support for the old versions of
+a field in the gateway, and keep them around for at least 2 major
+versions of GitLab. For example, we could consider adding 2 versions
+of a prompt to the `prompt_components` payload:
+
+```json
+{
+ prompt_components: [
+ {
+ "type": "prompt",
+ "metadata": {
+ "source": "GitLab EE",
+ "version": "16.3",
+ },
+ "payload": {
+ "content": "You can fetch information about a resource called an issue...",
+ "params": {
+ "temperature": 0.2,
+ "maxOutputTokens": 1024
+ },
+ "model": "text-bison",
+ "provider": "vertex-ai"
+ }
+ },
+ {
+ "type": "prompt",
+ "metadata": {
+ "source": "GitLab EE",
+ "version": "16.3",
+ },
+ "payload": {
+ "content": "System: You can fetch information about a resource called an issue...\n\nHuman:",
+ "params": {
+ "temperature": 0.2,
+ },
+ "model": "claude-2",
+ "provider": "anthropic"
+ }
+ }
+
+ ]
+}
+```
+
+Allowing the API to direct the prompt to either provider, based on
+what is in the payload.
+
+To document and validate the content of `payload` we can specify their
+format using [JSON-schema](https://json-schema.org/).
+
+#### Example feature: code suggestions
+
+For example, a rough code suggestions service could look like this:
+
+```plaintext
+POST /internal/code-suggestions/completions
+```
+
+```json
+{
+ "prompt_components": [
+ {
+ "type": "prompt",
+ "metadata": {
+ "source": "GitLab EE",
+ "version": "16.3",
+ },
+ "payload": {
+ "content": "...",
+ "params": {
+ "temperature": 0.2,
+ "maxOutputTokens": 1024
+ },
+ "model": "code-gecko",
+ "provider": "vertex-ai"
+ }
+ },
+ {
+ "type": "editor_content",
+ "metadata": {
+ "source": "vscode",
+ "version": "1.1.1"
+ },
+ "payload": {
+ "filename": "application.rb",
+ "before_cursor": "require 'active_record/railtie'",
+ "after_cursor": "\nrequire 'action_controller/railtie'",
+ "open_files": [
+ {
+ "filename": "app/controllers/application_controller.rb",
+ "content": "class ApplicationController < ActionController::Base..."
+ }
+ ]
+ }
+ }
+ ]
+}
+```
+
+A response could look like this:
+
+```json
+{
+ "response": "require 'something/else'",
+ "metadata": {
+ "identifier": "deadbeef",
+ "model": "code-gecko",
+ "timestamp": 1688118443
+ }
+}
+```
+
+The `metadata` field contains information that could be used in a
+telemetry endpoint on the AI-gateway where we could count
+suggestion-acceptance rates among other things.
+
+The way we will receive telemetry for code suggestions is being
+discussed in [#415745](https://gitlab.com/gitlab-org/gitlab/-/issues/415745).
+We will try to come up with an architecture for all AI-related features.
+
+#### Exposing AI providers
+
+A lot of AI functionality has already been built into GitLab-Rails
+that currently builds prompts and submits this directly to different
+AI providers. At the time of writing, GitLab has API-clients for the
+following providers:
+
+- [Anthropic](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/anthropic/client.rb#L6)
+- [Vertex](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/vertex_ai/client.rb#L6)
+- [OpenAI](https://gitlab.com/gitlab-org/gitlab/blob/4344729240496a5018e19a82030d6d4b227e9c79/ee/lib/gitlab/llm/open_ai/client.rb#L8)
+
+To make these features available to self-managed instances, we should
+provide endpoints for each of these that GitLab.com, self-managed or
+dedicated installations can use to give these customers to these
+features.
+
+In a first iteration we could build endpoints that proxy the request
+to the AI provider. This should make it easier to migrate to routing
+these requests through the AI-Gateway. As an example, the endpoint for
+Anthropic could look like this:
+
+```plaintext
+POST /internal/proxy/anthropic/(*endpoint)
+```
+
+The `*endpoint` means that the client specifies what is going to be
+called, for example `/v1/complete`. The request body is entirely
+forwarded to the AI provider. The AI-gateway makes sure the request is
+correctly authenticated.
+
+Having the proxy in between GitLab and the AI provider means that we
+still have control over what goes through to the AI provider and if
+the need arises, we can manipulate or reroute the request to a
+different provider. Doing this means that we could keep supporting
+the features of older GitLab installations even if the provider's API
+changes or we decide not to work with a certain provider anymore.
+
+I think there is value in moving features that use API providers
+directly to a feature-specific purpose built API. Doing this means
+that we can improve these features as AI providers evolve by changing
+the AI-gateway that is under our control. Customers using self-managed
+or dedicated installations could then start getting better
+AI-supported features without having to upgrade their GitLab instance.
+
+Features that are currently
+[experimental](../../../policy/experiment-beta-support.md#experiment)
+can use these generic APIs, but we should aim to convert to a single
+purpose API endpoint before we make the feature [generally available](../../../policy/experiment-beta-support.md#generally-available-ga)
+for self-managed installations. This makes it easier for us to support
+features long-term even if the landscape of AI providers change.
+
+The [Experimental REST API](../../../development/ai_features.md#experimental-rest-api)
+available to GitLab team members should also use this proxy in the
+short term. In the longer term, we should provide developers access to
+a separate proxy that allows them to use GitLab owned authentication
+to several AI providers for experimentation. This will separate the
+traffic from developers trying out new things from the fleet that is
+serving paying customers.
+
+### API in GitLab instances
+
+This is the API that external clients can consume on their local
+GitLab instance. For example VSCode that talks to a self-managed
+instance.
+
+These versions could also widely defer: it could be that the VSCode
+extension is kept up-to-date by developers. But the GitLab instance
+they use for work is kept a minor version behind. So the same
+requirements in terms of stability and flexibility apply for the
+clients as for the AI gateway.
+
+In a first iteration we could consider keeping the current REST
+payloads that the VSCode extension and the Web-IDE send, but direct it
+to the appropriate GitLab installation. GitLab-rails can wrap the
+payload in an envelope for the AI-gateway without having to interpret
+it.
+
+When we do this then the GitLab-instance that receives the request
+from the extension doesn't need to understand it to enrich it and pass
+it on to the AI-Gateway. GitLab can add information to the
+`prompt_components` and pass everything that was already there
+straight through to the AI-gateway.
+
+If a request is initiated from another client (for example VSCode),
+GitLab-rails needs to forward the entire payload in addition to any
+other enhancements and prompts. This is required so we can potentially
+support changes from a newer version of the client, traveling through
+an outdated GitLab installation to a recent AI-gateway.
+
+**Discussion:** This first iteration is also using a REST+JSON
+approach. This is how the VSCode extension is currently communicating
+with the model gateway. This means that it's a smaller iteration to go
+from that, to wrapping that existing payload into an envelope. With
+the added advantage of cross version compatibility. But it does not
+mean that future iterations also need to use REST+JSON. As each
+feature would have it's own endpoint, the protocol could also be
+different.
+
+## Authentication & Authorization
+
+GitLab will provide the first layer of authorization: It authenticate
+the user and check if the license allows using the feature the user is
+trying to use. This can be done using the authentication and license
+checks that are already built into GitLab.
+
+Authenticating the GitLab-instance on the AI-gateway will be discussed
+in[#177](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/177).
+Because the AI-gateway exposes proxied endpoints to AI providers, it
+is important that the authentication tokens have limited validity.
+
+## Embeddings
+
+Embeddings can be requested for all features in a single endpoint, for
+example through a request like this:
+
+```plaintext
+POST /internal/embeddings
+```
+
+```json
+{
+ "content": "The lazy fox and the jumping dog",
+ "content_type": "issue_title",
+ "metadata": {
+ "source": "GitLab EE",
+ "version": "16.3"
+ }
+}
+```
+
+The `content_type` and properties `content` could in the future be
+used to create embeddings from different models based on what is
+appropriate.
+
+The response will include the embedding vector besides the used
+provider and model. For example:
+
+```json
+{
+ "response": [0.2, -1, ...],
+ "metadata": {
+ "identifier": "8badf00d",
+ "model": "text-embedding-ada-002",
+ "provider": "open_ai",
+ }
+}
+```
+
+When storing the embedding, we should make sure we include the model
+and provider data. When embeddings are used to generate a prompt, we
+could include that metadata in the payload so we can judge the quality
+of the embedding.
+
+## Deployment
+
+Currently, the model-gateway that will become the AI-gateway is being
+deployed using from the project repository in
+[`gitlab-org/modelops/applied-ml/code-suggestions/ai-assist`](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist).
+
+It is deployed to a Kubernetes cluster in it's own project. There is a
+staging environment that is currently used directly by engineers for
+testing.
+
+In the future, this will be deloyed using
+[Runway](https://gitlab.com/gitlab-com/gl-infra/platform/runway/). At
+that time, there will be a production and staging deployment. The
+staging deployment can be used for automated QA-runs that will have
+the potential to stop a deployment from reaching production.
+
+Further testing strategy is being discussed in
+[&10563](https://gitlab.com/groups/gitlab-org/-/epics/10563).
+
+## Alternative solutions
+
+Alternative solutions were discussed in
+[applied-ml/code-suggestions/ai-assist#161](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/161#what-are-the-alternatives).
diff --git a/doc/architecture/blueprints/cells/cells-feature-backups.md b/doc/architecture/blueprints/cells/cells-feature-backups.md
index d596bdd2078..b5d5d7afdcf 100644
--- a/doc/architecture/blueprints/cells/cells-feature-backups.md
+++ b/doc/architecture/blueprints/cells/cells-feature-backups.md
@@ -25,10 +25,10 @@ and also Git repository data.
## 2. Data flow
-Each cell has a number of application databases to back up (e.g. `main`, and `ci`).
+Each cell has a number of application databases to back up (for example, `main`, and `ci`).
-Additionally, there may be cluster-wide metadata tables (e.g. `users` table)
-which is directly accesible via PostgreSQL.
+Additionally, there may be cluster-wide metadata tables (for example, `users` table)
+which is directly accessible via PostgreSQL.
## 3. Proposal
diff --git a/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md b/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
index 3e498c24144..8a67383c5e4 100644
--- a/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
+++ b/doc/architecture/blueprints/cells/cells-feature-contributions-forks.md
@@ -30,13 +30,13 @@ with various usage patterns:
Forks allow users not having write access to parent project to make changes. The forking workflow
is especially important for the Open Source community which is able to contribute back
to public projects. However, it is equally important in some companies which prefer the strong split
-of responsibilites and tighter access control. The access to project is restricted
+of responsibilities and tighter access control. The access to project is restricted
to designated list of developers.
Forks enable:
-- tigther control of who can modify the upstream project
-- split of the responsibilites: parent project might use CI configuration connecting to production systems
+- tighter control of who can modify the upstream project
+- split of the responsibilities: parent project might use CI configuration connecting to production systems
- run CI pipelines in context of fork in much more restrictive environment
- consider all forks to be unveted which reduces risks of leaking secrets, or any other information
tied with the project
diff --git a/doc/architecture/blueprints/cells/cells-feature-secrets.md b/doc/architecture/blueprints/cells/cells-feature-secrets.md
index 20260c89ccd..50ccf926b4d 100644
--- a/doc/architecture/blueprints/cells/cells-feature-secrets.md
+++ b/doc/architecture/blueprints/cells/cells-feature-secrets.md
@@ -25,10 +25,10 @@ GitLab has a lot of
[secrets](https://docs.gitlab.com/charts/installation/secrets.html) that needs
to be configured.
-Some secrets are for inter-component communication, e.g. `GitLab Shell secret`,
+Some secrets are for inter-component communication, for example, `GitLab Shell secret`,
and used only within a cell.
-Some secrets are used for features, e.g. `ci_jwt_signing_key`.
+Some secrets are used for features, for example, `ci_jwt_signing_key`.
## 2. Data flow
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
index 6da99e0aa6a..dcd28707890 100644
--- a/doc/architecture/blueprints/cells/index.md
+++ b/doc/architecture/blueprints/cells/index.md
@@ -190,7 +190,7 @@ information. For example:
by one of the Cells, and the results of that can be cached. We also need to implement
a mechanism for negative cache and cache eviction.
-1. **GraphQL and other ambigious endpoints.**
+1. **GraphQL and other ambiguous endpoints.**
Most endpoints have a unique sharding key: the organization, which directly
or indirectly (via a group or project) can be used to classify endpoints.
@@ -281,24 +281,24 @@ changes to prepare the codebase for data split.
One iteration describes one quarter's worth of work.
-1. Iteration 1 - FY24Q1
+1. [Iteration 1](https://gitlab.com/groups/gitlab-org/-/epics/9667) - FY24Q1
- Data access layer: Initial Admin Area settings are shared across cluster.
- Essential workflows: Allow to share cluster-wide data with database-level data access layer
-1. Iteration 2 - FY24Q2
+1. [Iteration 2](https://gitlab.com/groups/gitlab-org/-/epics/9813) - FY24Q2
- Essential workflows: User accounts are shared across cluster.
- Essential workflows: User can create group.
-1. Iteration 3 - FY24Q3
+1. [Iteration 3](https://gitlab.com/groups/gitlab-org/-/epics/10997) - FY24Q3
- Essential workflows: User can create project.
- Essential workflows: User can push to Git repository.
- Cell deployment: Extend GitLab Dedicated to support GCP
- Routing: Technology.
-1. Iteration 4 - FY24Q4
+1. [Iteration 4](https://gitlab.com/groups/gitlab-org/-/epics/10998) - FY24Q4
- Essential workflows: User can run CI pipeline.
- Essential workflows: User can create issue, merge request, and merge it after it is green.
diff --git a/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md b/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
index f352fea84b1..c1ca0c60dcd 100644
--- a/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
+++ b/doc/architecture/blueprints/cells/proposal-stateless-router-with-buffering-requests.md
@@ -429,7 +429,7 @@ sequenceDiagram
```
In this case the user is not on their "default organization" so their TODO
-counter will not include their normal todos. We may choose to highlight this in
+counter will not include their typical todos. We may choose to highlight this in
the UI somewhere. A future iteration may be able to fetch that for them from
their default organization.
diff --git a/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md b/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
index aadc08016e3..3b3d481914f 100644
--- a/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
+++ b/doc/architecture/blueprints/cells/proposal-stateless-router-with-routes-learning.md
@@ -452,7 +452,7 @@ sequenceDiagram
```
In this case the user is not on their "default organization" so their TODO
-counter will not include their normal todos. We may choose to highlight this in
+counter will not include their typical todos. We may choose to highlight this in
the UI somewhere. A future iteration may be able to fetch that for them from
their default organization.
diff --git a/doc/architecture/blueprints/ci_data_decay/index.md b/doc/architecture/blueprints/ci_data_decay/index.md
index 2eac27def18..25ce71a3a6f 100644
--- a/doc/architecture/blueprints/ci_data_decay/index.md
+++ b/doc/architecture/blueprints/ci_data_decay/index.md
@@ -55,7 +55,7 @@ of the primary database, to a different storage, that is more performant and
cost effective.
It is already possible to prevent processing builds
-[that have been archived](../../../user/admin_area/settings/continuous_integration.md#archive-jobs).
+[that have been archived](../../../administration/settings/continuous_integration.md#archive-jobs).
When a build gets archived it will not be possible to retry it, but we still do
keep all the processing metadata in the database, and it consumes resources
that are scarce in the primary database.
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 667c8d225db..9a8084f290b 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -12,6 +12,9 @@ participating-stages: []
# CI/CD Catalog
+NOTE:
+This document covers the future plans for the CI/CD Catalog feature. For information on the features already available for use in GitLab, see the [CI/CD Components documentation](../../../ci/components/index.md).
+
## Summary
## Goals
@@ -227,8 +230,7 @@ The version of the component can be (in order of highest priority first):
1. A commit SHA - For example: `gitlab.com/gitlab-org/dast@e3262fdd0914fa823210cdb79a8c421e2cef79d8`
1. A tag - For example: `gitlab.com/gitlab-org/dast@1.0`
-1. A special moving target version that points to the most recent released tag. The target project must be
-explicitly marked as a [catalog resource](#catalog-resource) - For example: `gitlab.com/gitlab-org/dast@~latest`
+1. A special moving target version that points to the most recent released tag - For example: `gitlab.com/gitlab-org/dast@~latest`
1. A branch name - For example: `gitlab.com/gitlab-org/dast@master`
If a tag and branch exist with the same name, the tag takes precedence over the branch.
diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md
index 3a6ed4ae9b1..340e6557a72 100644
--- a/doc/architecture/blueprints/ci_scale/index.md
+++ b/doc/architecture/blueprints/ci_scale/index.md
@@ -137,7 +137,7 @@ stores more than 600 gigabytes of data, and `ci_builds.yaml_variables` more
than 300 gigabytes (as of February 2021).
It is a lot of data that needs to be reliably moved to a different place.
-Unfortunately, right now, our [background migrations](../../../development/database/background_migrations.md)
+Unfortunately, right now, our background migrations
are not reliable enough to migrate this amount of data at scale. We need to
build mechanisms that will give us confidence in moving this data between
columns, tables, partitions or database shards.
diff --git a/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md b/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
index 6645f390fd1..66089085d0d 100644
--- a/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
+++ b/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
@@ -129,7 +129,7 @@ We're also not addressing any client-side specific details into the design at th
## General Considerations
-Having addressed the details of the two aformentioned problem-domains, we can model a proposed solution with the following logical structure:
+Having addressed the details of the two aforementioned problem-domains, we can model a proposed solution with the following logical structure:
- Ingestion
- APIs/SDKs
@@ -208,7 +208,7 @@ Gitlab::Database::Writer.config do |config|
# then backend-specific configurations hereafter
#
config.url = 'tcp://user:pwd@localhost:9000/database'
- # e.g. a serializer helps define how data travels over the wire
+ # for example, a serializer helps define how data travels over the wire
config.json_serializer = ClickHouse::Serializer::JsonSerializer
# ...
end
diff --git a/doc/architecture/blueprints/clickhouse_usage/index.md b/doc/architecture/blueprints/clickhouse_usage/index.md
index 8a5530313e5..3febb09f0bf 100644
--- a/doc/architecture/blueprints/clickhouse_usage/index.md
+++ b/doc/architecture/blueprints/clickhouse_usage/index.md
@@ -20,6 +20,12 @@ participating-stages: ["~section::ops", "~section::dev"]
In FY23-Q2, the Monitor:Observability team developed and shipped a [ClickHouse data platform](https://gitlab.com/groups/gitlab-org/-/epics/7772) to store and query data for Error Tracking and other observability features. Other teams have also begun to incorporate ClickHouse into their current or planned architectures. Given the growing interest in ClickHouse across product development teams, it is important to have a cohesive strategy for developing features using ClickHouse. This will allow teams to more efficiently leverage ClickHouse and ensure that we can maintain and support this functionality effectively for SaaS and self-managed customers.
+### Use Cases
+
+Many product teams at GitLab are considering ClickHouse when developing new features and to improve performance of existing features.
+
+During the start of the ClickHouse working group, we [documented existing and potential use cases](https://gitlab.com/groups/gitlab-com/-/epics/2075#use-cases) and found that there was interest in ClickHouse from teams across all DevSecOps stage groups.
+
### Goals
As ClickHouse has already been selected for use at GitLab, our main goal now is to ensure successful adoption of ClickHouse across GitLab. It is helpful to break down this goal according to the different phases of the product development workflow.
@@ -29,29 +35,36 @@ As ClickHouse has already been selected for use at GitLab, our main goal now is
1. Launch: Support ClickHouse-backed features for SaaS and self-managed.
1. Improve: Successfully scale our usage of ClickHouse.
-### Non-Goals
+### Non-goals
+
+ClickHouse will not be packaged by default with self-managed GitLab, due to uncertain need, complexity, and lack of operational experience. We will still work to find the best possible way to enable users to use ClickHouse themselves if they desire, but it will not be on by default. [ClickHouse maintenance and cost](self_managed_costs_and_requirements/index.md) investigations revealed an uncertain cost impact to smaller instances, and at this time unknown nuance to managing ClickHouse. This means features that depend only on ClickHouse will not be available out of the box for self-managed users (as of end of 2022, the majority of revenue comes from self-managed), so new features researching the use of ClickHouse should be aware of the potential impacts to user adoption in the near-term, until a solution is viable.
## Proposals
The following are links to proposals in the form of blueprints that address technical challenges to using ClickHouse across a wide variety of features.
-1. Scalable data ingestion pipeline.
+1. [Scalable data ingestion pipeline](../clickhouse_ingestion_pipeline/index.md).
- How do we ingest large volumes of data from GitLab into ClickHouse either directly or by replicating existing data?
-1. Supporting ClickHouse for self-managed installations.
- - For which use-cases and scales does it make sense to run ClickHouse for self-managed and what are the associated costs?
- - How can we best support self-managed installation of ClickHouse for different types/sizes of environments?
- - Consider using the [Opstrace ClickHouse operator](https://gitlab.com/gitlab-org/opstrace/opstrace/-/tree/main/clickhouse-operator) as the basis for a canonical distribution.
- - Consider exposing Clickhouse backend as [GitLab Plus](https://gitlab.com/groups/gitlab-org/-/epics/308) to combine benefits of using self-managed instance and GitLab-managed database.
- - Should we develop abstractions for querying and data ingestion to avoid requiring ClickHouse for small-scale installations?
-1. Abstraction layer for features to leverage both ClickHouse or PostreSQL.
+1. [Abstraction layer](../clickhouse_read_abstraction_layer/index.md) for features to leverage both ClickHouse and PostgreSQL.
- What are the benefits and tradeoffs? For example, how would this impact our automated migration and query testing?
-1. Security recommendations and secure defaults for ClickHouse usage.
-Note that we are still formulating proposals and will update the blueprint accordingly.
+### Product roadmap
+
+#### Near-term
+
+In the next 3 months (FY24 Q2) ClickHouse will be implemented by default only for SaaS on GitLab.com or manual enablement for self-managed instances. This is due to the uncertain costs and management requirements for self-managed instances. This near-term implementation will be used to develop best practices and strategy to direct self-managed users. This will also constantly shape our recommendations for self-managed instances that want to onboard ClickHouse early.
+
+#### Mid-term
+
+After we have formulated best practices of managing ClickHouse ourselves for GitLab.com, the plan for 3-9 months (FY24 2H) will be to offer supported recommendations for self-managed instances that want to run ClickHouse themselves or potentially to a ClickHouse cluster/VM we would manage for users. One proposal for self-managed users is to [create a proxy or abstraction layer](https://gitlab.com/groups/gitlab-org/-/epics/308) that would allow users to connect their self-managed instance to SaaS without additional effort. Another option would be to allow users to "Bring your own ClickHouse" similar to our [approach for Elasticsearch](../../../integration/advanced_search/elasticsearch.md#install-elasticsearch). For the features that require ClickHouse for optimal usage (Value Streams Dashboard, [Product Analytics](https://gitlab.com/groups/gitlab-org/-/epics/8921) and Observability), this will be the initial go-to-market action.
+
+#### Long-term
+
+We will work towards a packaged reference version of ClickHouse capable of being easily managed with minimal cost increases for self-managed users. We should be able to reliably instruct users on the management of ClickHouse and provide accurate costs for usage. This will mean any feature could depend on ClickHouse without decreasing end-user exposure.
## Best Practices
-Best practices and guidelines for developing performant and scalable features using ClickHouse are located in the [ClickHouse developer documentation](../../../development/database/clickhouse/index.md).
+Best practices and guidelines for developing performant, secure, and scalable features using ClickHouse are located in the [ClickHouse developer documentation](../../../development/database/clickhouse/index.md).
## Cost and maintenance analysis
diff --git a/doc/architecture/blueprints/code_search_with_zoekt/index.md b/doc/architecture/blueprints/code_search_with_zoekt/index.md
index 681782609ba..273d8da482c 100644
--- a/doc/architecture/blueprints/code_search_with_zoekt/index.md
+++ b/doc/architecture/blueprints/code_search_with_zoekt/index.md
@@ -33,7 +33,7 @@ GitLab code search functionality today is backed by Elasticsearch.
Elasticsearch has proven useful for other types of search (issues, merge
requests, comments and so-on) but is by design not a good choice for code
search where users expect matches to be precise (ie. no false positives) and
-flexible (e.g. support
+flexible (for example, support
[substring matching](https://gitlab.com/gitlab-org/gitlab/-/issues/325234)
and
[regexes](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)). We have
diff --git a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
index f5bd53627cb..2e0b4d40e13 100644
--- a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
+++ b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
@@ -198,7 +198,7 @@ In this phase we are migrating basic, high-priority project functionality from `
- [Unify members/members actions](https://gitlab.com/groups/gitlab-org/-/epics/8010) - on UI and API level.
- Starring: Right now only projects can be starred. We want to bring this to the group level.
- Common actions: Destroying, transferring, restoring. This can be unified on the controller level and then propagated lower.
-- Archiving currently only works on the project level. This can be brought to the group level, similar to the mechanism for “pending deletion”.
+- Archiving currently only works on the project level. This can be brought to the group level, similar to the mechanism for "pending deletion".
- Avatar's serving and actions.
### Phase 4
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index b77aaf598e6..a538910f553 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -266,7 +266,7 @@ The expected registry behavior will be covered with integration tests by manipul
##### Latency
-Excessive latency on established connections is hard to detect and debug, as these might not raise an application error or network timeout in normal circumstances but usually precede them.
+Excessive latency on established connections is hard to detect and debug, as these might not raise an application error or network timeout in typical circumstances but usually precede them.
For this reason, the duration of database queries used to serve HTTP API requests should be instrumented using metrics, allowing the detection of unusual variations and trigger alarms accordingly before an excessive latency becomes a timeout or service unavailability.
diff --git a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
new file mode 100644
index 00000000000..a73f6335218
--- /dev/null
+++ b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
@@ -0,0 +1,238 @@
+---
+status: proposed
+creation-date: "2023-06-09"
+authors: [ "@hswimelar" ]
+coach: "@grzesiek"
+approvers: [ "@trizzi ", "@sgoldstein" ]
+owning-stage: "~devops::package"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# Container Registry Self-Managed Database Rollout
+
+## Summary
+
+The latest iteration of the [Container Registry](https://gitlab.com/gitlab-org/container-registry)
+has been rearchitected to use a PostgreSQL database and deployed on GitLab.com.
+Now we must bring the advantages provided by the database to self-managed users.
+While the container registry retains the capacity to run without the new database,
+many new and highly desired features cannot be implemented without it.
+Additionally, unifying the registry used for GitLab.com and for self-managed
+allows us to provide a cohesive user experience and reduces the burden
+associated with maintaining the old registry implementation. To accomplish this,
+we plan to eventually require all self-managed to migrate to the new registry
+database, so that we may deprecate and remove support for the old object storage
+metadata subsystem.
+
+This document seeks to describe how we may use the proven core migration
+functionality, which was used to migrate millions of container images on GitLab.com,
+to enable self-managed users to enjoy the benefits of the metadata database.
+
+## Motivation
+
+Enabling self-managed users to migrate to the new metadata database allows these
+users to take advantage of the new features that require the database. Additionally,
+the greater adoption of the database allows the container registry team to focus
+our knowledge and capacity, and will eventually allow us to fully remove the old
+registry metadata subsystem, greatly improving the maintainability and stability
+of the container registry for both GitLab.com and for self-managed users.
+
+### Goals
+
+- Progressively rollout the new dependency of a PostgreSQL database instance for the registry for charts and omnibus deployments.
+- Progressively rollout automation for the registry PostgreSQL database instance for charts and omnibus deployments.
+- Develop processes and tools that self-managed admins can use to migrate existing registry deployments to the metadata database.
+- Develop processes and tools that self-managed admins can use spin up fresh installs of the Container Registry which use the metadata database.
+- Create a plan which will eventually allow us to fully drop support for original object storage metadata subsystem.
+
+### Non-Goals
+
+- Developing new Container Registry features outside the scope of enabling admins to migrate to the metadata database.
+- Determining lifecycle support decisions, such as when to default to the database, and when to end support for non-database registries.
+
+## Proposal
+
+There are two main components that must be further developed in order for
+self-managed admins to move to the registry database: the deployment environment and
+the registry migration tooling.
+
+For the deployment environments need to document what the user needs to do to set up their
+deployment such that the registry has access to a suitable database given the
+expected registry workload. As well as develop tooling and automation to ease
+the setup and maintenance of the registry database for new and existing deploys.
+
+For the registry, we need to develop and validate import tooling which
+coordinates with the core import functionality which was used to migrate all
+container images on GitLab.com. Additionally, we must validate that each supported
+storage driver works as expected with the import process and provide estimated
+import times for admins.
+
+We can structure our work to meet the standards outlined in support for
+Experiment, Beta, and Alpha features. Doing so will help to prioritize core
+functionality and to allow users who wish to be early adopters to begin using
+the database and providing us with invaluable feedback.
+
+These levels of support could be advertised to self-managed users via a simple
+chart, allowing them to tell at a glance the status of this project as it relates
+to their situation.
+
+| Installation | GCS | AWS | Filesystem | Azure | OSS | Swift|
+| ------ | ------ |------ | ------ | ------ |------ | ------ |
+| Omnibus | GA | GA | Beta | Experimental | Experimental | Experimental |
+| Charts | GA | GA |Beta | Experimental | Experimental | Experimental |
+
+### Justification of Structuring Support by Driver
+
+It's possible that we could simplify the proposed support matrix by structuring
+it only by deployment environment and not differentiate by storage driver. The
+following two sections briefly summarize several points for and against.
+
+#### Arguments Opposed to Structuring Support by Driver
+
+Each storage driver is well abstracted in the code, specifically the import process
+makes use of the following Methods:
+
+- Walk
+- List
+- GetContent
+- Stat
+- Reader
+
+Each of the methods is a read method we do not need to create or delete data via
+the object storage methods. Additionally, all of these methods are standard API
+methods.
+
+Given that we're not mutating data via object storage as part of the import
+process, we should not need to double-check these drivers or try to predict
+potential errors. Relying on user feedback during the beta to direct any efforts
+we should be making here could prevent us from scheduling unnecessary work.
+
+#### Arguments in Favor of Structuring Support by Driver
+
+Our experience with enhancing and supporting offline garbage collection has
+shown that while the storage driver implementation should not matter, it does.
+The drivers have proven to have important differences in performance and
+reliability. Many of the planned possible driver-related improvements are
+related to testing and stability, rather than outright new work for each driver.
+
+In particular, retries and error reporting across storage drivers are not as
+standardized as one would hope for, and therefore there is a potential that a
+long-running import process could be interrupted by an error that could have
+been retried.
+
+Creating import estimates based on combinations of the registry size and storage
+driver, would also be of use to self-managed admins, looking to schedule their
+migration. There will be a difference here between local filesystem storage and
+object storage and there could be a difference between the object storage
+providers as well.
+
+Also, we could work with the importer to smooth out the differences in the
+storage drivers. Even without unified retryable error reporting from the storage
+drivers, we could have the importer retry more time and for more errors. There's
+a risk we would retry several times on non-retryable errors, but since no writes
+are being made to object storage, this should not ultimately be harmful.
+
+Additionally, implementing [Validate Self-Managed Imports](https://gitlab.com/gitlab-org/container-registry/-/issues/938)
+would perform a consistency check against a sample of images before and after
+import which would lead to greater consistency across all storage driver implementations.
+
+## Design and Implementation Details
+
+### The Import Tool
+
+The [import tool](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/database-import-tool.md)
+is a well-validated component of the Container Registry project that we have used
+from the beginning as a way to perform local testing. This tool is a thin wrapper
+over the core import functionality — the code which handles the import logic has
+been extensively validated.
+
+While the core import functionality is solid, we must ensure that this tool and
+the surrounding process will enable non-expert users to import their registries
+with both minimal risk and with minimal support from GitLab team members.
+Therefore, the most important work remaining is crafting the UX of this tooling
+such that those goals are met. This
+[epic](https://gitlab.com/groups/gitlab-org/-/epics/8602) captures many of the
+proposed improvements.
+
+#### Design
+
+The tool is designed such that a single execution flow can support both users
+with large registries with strict uptime requirements who can take advantage of
+a more involved process to reduce read-only time to the absolute minimum as well
+as users with small registries who benefit from a streamlined workflow. This is
+achieved via the same pre import, then full import cycle that was used on
+GitLab.com, along with an additional step to catalog all unreferenced blobs held
+in common storage.
+
+##### One-Shot Import
+
+In most cases, a user can simply choose to run the import tool while the registry
+is offline or read-only in mode. This will be similar to what admins must
+already do in order to run offline garbage collection. Each step completes in
+sequence, moving directly to the next. The command exits when the import process
+is complete and the registry is ready to make full use of the metadata database.
+
+##### Minimal Downtime Import
+
+For users with large registries and who are interested in the minimum possible
+downtime, each step can be ran independently when the tool is passed the appropriate
+flag. The user will first run the pre-import step while the registry is
+performing its usual workload. Once that has completed, and the user is ready
+to stop writes to the registry, the tag import step can be ran. As with the GitLab.com
+migration, importing tags requires that the registry be offline or in
+read-only mode. This step does the minimum possible work to achieve fast and
+efficient tag imports and will always be the fastest of the three steps, reducing
+the downtime component to a fraction of the total import time. The user can then
+bring up the registry configured to use the metadata database. After that, the
+user is free to run the third step during standard registry operations. This step
+makes any dangling blobs in common storage visible to the database and therefore
+the online garbage collection process.
+
+### Distribution Paths
+
+Tooling, process, and documentation will need to be developed in order to
+support users who wish to use the metadata database, especially in regards to
+providing a foundation for the new database instance required for the migration.
+
+For new deployments, we should wait until we've moved to general support, have
+automation in place for the registry database and migration, and have a major
+GitLab version bump before enabling the database by default for self-managed.
+
+#### Omnibus
+
+#### Charts
+
+## Alternative Solutions
+
+### Do Nothing
+
+#### Pros
+
+- The database and associated features are generally most useful for large-scale, high-availability deployments.
+- Eliminate the need to support an additional logical or physical database for self-managed deployments.
+
+#### Cons
+
+- The registry on GitLab.com and the registry used by self-managed will greatly diverge in supported features over time.
+- The maintenance burden of supporting two registry implementations will reduce the velocity at which new registry features can be released.
+- The registry on GitLab.com stops being an effective way to validate changes before they are released to self-managed.
+- Large self-managed users continue to not be able to scale the registry to suit their needs.
+
+### Gradual Migration
+
+This approach would be to exactly replicate the GitLab.com migration on
+self-managed.
+
+#### Pros
+
+- Replicate an already successful process.
+- Scope downtime by repository, rather than instance.
+
+#### Cons
+
+- Dramatically increased complexity in all aspects of the migration process.
+- Greatly increased possibility of data consistency issues.
+- Less clear demarcation of registry migration progress.
diff --git a/doc/architecture/blueprints/database/automated_query_analysis/index.md b/doc/architecture/blueprints/database/automated_query_analysis/index.md
index c08784dab48..40f6b2af412 100644
--- a/doc/architecture/blueprints/database/automated_query_analysis/index.md
+++ b/doc/architecture/blueprints/database/automated_query_analysis/index.md
@@ -12,7 +12,7 @@ participating-stages: []
## Problem Summary
-Our overarching goal is to improve the reliability and throughput of GitLab’s
+Our overarching goal is to improve the reliability and throughput of the GitLab
database review process. The current process requires merge request authors to
manually provide query plans and raw SQL when introducing new queries or
updating existing queries. This is both time consuming and error prone.
diff --git a/doc/architecture/blueprints/gitlab_agent_deployments/index.md b/doc/architecture/blueprints/gitlab_agent_deployments/index.md
index d8d26389d7d..798c8a3045d 100644
--- a/doc/architecture/blueprints/gitlab_agent_deployments/index.md
+++ b/doc/architecture/blueprints/gitlab_agent_deployments/index.md
@@ -1,11 +1,11 @@
---
-status: proposed
+status: implemented
creation-date: "2022-11-23"
authors: [ "@shinya.maeda" ]
coach: "@DylanGriffith"
approvers: [ "@nagyv-gitlab", "@cbalane", "@hustewart", "@hfyngvason" ]
-owning-stage: "~devops::release"
-participating-stages: [Configure, Release]
+owning-stage: "~devops::deploy"
+participating-stages: [Environments]
---
<!-- vale gitlab.FutureTense = NO -->
@@ -28,9 +28,7 @@ This blueprint describes how the association is established and how these domain
- The proposed architecture can be used in [Organization-level Environment dashboard](https://gitlab.com/gitlab-org/gitlab/-/issues/241506).
- The cluster resources and events can be visualized per [GitLab Environment](../../../ci/environments/index.md).
An environment-specific view scoped to the resources managed either directly or indirectly by a deployment commit.
-- Support both [GitOps mode](../../../user/clusters/agent/gitops.md#gitops-configuration-reference) and [CI Access mode](../../../user/clusters/agent/ci_cd_workflow.md#authorize-the-agent).
- - NOTE: At the moment, we focus on the solution for CI Access mode. GitOps mode will have significant architectural changes _outside of_ this blueprint,
- such as [Flux switching](https://gitlab.com/gitlab-org/gitlab/-/issues/357947) and [Manifest projects outside of the Agent configuration project](https://gitlab.com/groups/gitlab-org/-/epics/7704). In order to derisk potential rework, we'll revisit the GitOps mode after these upstream changes have been settled.
+- Support both [GitOps mode](../../../user/clusters/agent/gitops/agent.md#gitops-configuration-reference) and [CI Access mode](../../../user/clusters/agent/ci_cd_workflow.md#authorize-the-agent).
### Non-Goals
@@ -41,22 +39,22 @@ This blueprint describes how the association is established and how these domain
### Overview
-- GitLab Environment and Agent-managed Resource Group have 1-to-1 relationship.
-- Agent-managed Resource Group tracks all resources produced by the connected [agent](../../../user/clusters/agent/index.md). This includes not only resources written in manifest files but also subsequently generated resources (e.g. `Pod`s created by `Deployment` manifest file).
-- Agent-managed Resource Group renders dependency graph, such as `Deployment` => `ReplicaSet` => `Pod`. This is for providing ArgoCD-style resource view.
-- Agent-managed Resource Group has the Resource Health status that represents a summary of resource statuses, such as `Healthy`, `Progressing` or `Degraded`.
+- GitLab Environment and GitLab Agent For Kubernetes have 1-to-1 relationship.
+- GitLab Environment tracks all resources produced by the connected [agent](../../../user/clusters/agent/index.md). This includes not only resources written in manifest files but also subsequently generated resources (for example, `Pod`s created by `Deployment` manifest file).
+- GitLab Environment renders dependency graph, such as `Deployment` => `ReplicaSet` => `Pod`. This is for providing ArgoCD-style resource view.
+- GitLab Environment has the Resource Health status that represents a summary of resource statuses, such as `Healthy`, `Progressing` or `Degraded`.
```mermaid
flowchart LR
subgraph Kubernetes["Kubernetes"]
- subgraph ResourceGroupProduction["ResourceGroup"]
+ subgraph ResourceGroupProduction["Production"]
direction LR
ResourceGroupProductionService(["Service"])
ResourceGroupProductionDeployment(["Deployment"])
ResourceGroupProductionPod1(["Pod1"])
ResourceGroupProductionPod2(["Pod2"])
end
- subgraph ResourceGroupStaging["ResourceGroup"]
+ subgraph ResourceGroupStaging["Staging"]
direction LR
ResourceGroupStagingService(["Service"])
ResourceGroupStagingDeployment(["Deployment"])
@@ -88,28 +86,20 @@ flowchart LR
- [GitLab Project](../../../user/project/working_with_projects.md) and GitLab Environment have 1-to-many relationship.
- GitLab Project and Agent have 1-to-many _direct_ relationship. Only one project can own a specific agent.
-- [GitOps mode](../../../user/clusters/agent/gitops.md#gitops-configuration-reference)
+- [GitOps mode](../../../user/clusters/agent/gitops/agent.md#gitops-configuration-reference)
- GitLab Project and Agent do _NOT_ have many-to-many _indirect_ relationship yet. This will be supported in [Manifest projects outside of the Agent configuration project](https://gitlab.com/groups/gitlab-org/-/epics/7704).
- - Agent and Agent-managed Resource Group have 1-to-1 relationship. Inventory IDs are used to group Kubernetes resources. This might be changed in [Flux switching](https://gitlab.com/gitlab-org/gitlab/-/issues/357947).
- [CI Access mode](../../../user/clusters/agent/ci_cd_workflow.md#authorize-the-agent)
- GitLab Project and Agent have many-to-many _indirect_ relationship. The project owning the agent can [share the access with the other proejcts](../../../user/clusters/agent/ci_cd_workflow.md#authorize-the-agent-to-access-projects-in-your-groups). (NOTE: Technically, only running jobs inside the project are allowed to access the cluster due to job-token authentication.)
- - Agent and Agent-managed Resource Group do _NOT_ have relationships yet.
### Issues
-- Agent-managed Resource Group should have environment ID as the foreign key, which must be unique across resource groups.
-- Agent-managed Resource Group should have parameters how to group resources in the associated cluster, for example, `namespace`, `lable` and `inventory-id` (GitOps mode only) can passed as parameters.
-- Agent-managed Resource Group should be able to fetch all relevant resources, including both default resource kinds and other [Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).
-- Agent-managed Resource Group should be aware of dependency graph.
-- Agent-managed Resource Group should be able to compute Resource Health status from the associated resources.
+- GitLab Environment should have ID of GitLab Agent For Kubernetes as the foreign key.
+- GitLab Environment should have parameters how to group resources in the associated cluster, for example, `namespace`, `lable` and `inventory-id` (GitOps mode only) can passed as parameters.
+- GitLab Environment should be able to fetch all relevant resources, including both default resource kinds and other [Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).
+- GitLab Environment should be aware of dependency graph.
+- GitLab Environment should be able to compute Resource Health status from the associated resources.
-### Example: Pull-based deployment (GitOps mode)
-
-NOTE:
-At the moment, we focus on the solution for CI Access mode. GitOps mode will have significant architectural changes _outside of_ this blueprint,
-such as [Flux switching](https://gitlab.com/gitlab-org/gitlab/-/issues/357947) and [Manifest projects outside of the Agent configuration project](https://gitlab.com/groups/gitlab-org/-/epics/7704). In order to derisk potential rework, we'll revisit the GitOps mode after these upstream changes have been settled.
-
-### Example: Push-based deployment (CI access mode)
+### Example
This is an example of how the architecture works in push-based deployment.
The feature is documented [here](../../../user/clusters/agent/ci_cd_workflow.md) as CI access mode.
@@ -117,21 +107,21 @@ The feature is documented [here](../../../user/clusters/agent/ci_cd_workflow.md)
```mermaid
flowchart LR
subgraph ProductionKubernetes["Production Kubernetes"]
- subgraph ResourceGroupProductionFrontend["ResourceGroup"]
+ subgraph ResourceGroupProductionFrontend["Production"]
direction LR
ResourceGroupProductionFrontendService(["Service"])
ResourceGroupProductionFrontendDeployment(["Deployment"])
ResourceGroupProductionFrontendPod1(["Pod1"])
ResourceGroupProductionFrontendPod2(["Pod2"])
end
- subgraph ResourceGroupProductionBackend["ResourceGroup"]
+ subgraph ResourceGroupProductionBackend["Staging"]
direction LR
ResourceGroupProductionBackendService(["Service"])
ResourceGroupProductionBackendDeployment(["Deployment"])
ResourceGroupProductionBackendPod1(["Pod1"])
ResourceGroupProductionBackendPod2(["Pod2"])
end
- subgraph ResourceGroupProductionPrometheus["ResourceGroup"]
+ subgraph ResourceGroupProductionPrometheus["Monitoring"]
direction LR
ResourceGroupProductionPrometheusService(["Service"])
ResourceGroupProductionPrometheusDeployment(["Deployment"])
@@ -202,21 +192,21 @@ The microservice project setup can be improved by [Multi-Project Deployment Pipe
```mermaid
flowchart LR
subgraph ProductionKubernetes["Production Kubernetes"]
- subgraph ResourceGroupProductionFrontend["ResourceGroup"]
+ subgraph ResourceGroupProductionFrontend["Frontend"]
direction LR
ResourceGroupProductionFrontendService(["Service"])
ResourceGroupProductionFrontendDeployment(["Deployment"])
ResourceGroupProductionFrontendPod1(["Pod1"])
ResourceGroupProductionFrontendPod2(["Pod2"])
end
- subgraph ResourceGroupProductionBackend["ResourceGroup"]
+ subgraph ResourceGroupProductionBackend["Backend"]
direction LR
ResourceGroupProductionBackendService(["Service"])
ResourceGroupProductionBackendDeployment(["Deployment"])
ResourceGroupProductionBackendPod1(["Pod1"])
ResourceGroupProductionBackendPod2(["Pod2"])
end
- subgraph ResourceGroupProductionPrometheus["ResourceGroup"]
+ subgraph ResourceGroupProductionPrometheus["Monitoring"]
direction LR
ResourceGroupProductionPrometheusService(["Service"])
ResourceGroupProductionPrometheusDeployment(["Deployment"])
@@ -266,104 +256,18 @@ flowchart LR
DeploymentPipelines -- "Deploy" --> ResourceGroupProductionBackend
```
-#### View all Agent-managed Resource Groups on production environment
-
-At the group-level, we can accumulate all environments match a specific tier, for example,
-listing all environments with `production` tier from subsequent projects.
-This is useful to see the entire Agent-managed Resource Groups on production environment.
-The following diagram examplifies the relationship between GitLab group and Kubernetes resources:
-
-```mermaid
-flowchart LR
- subgraph Kubernetes["Kubernetes"]
- subgraph ResourceGroupProduction["ResourceGroup"]
- direction LR
- ResourceGroupProductionService(["Service"])
- ResourceGroupProductionDeployment(["Deployment"])
- ResourceGroupProductionPod1(["Pod1"])
- ResourceGroupProductionPod2(["Pod2"])
- end
- subgraph ResourceGroupStaging["ResourceGroup"]
- direction LR
- ResourceGroupStagingService(["Service"])
- ResourceGroupStagingDeployment(["Deployment"])
- ResourceGroupStagingPod1(["Pod1"])
- ResourceGroupStagingPod2(["Pod2"])
- end
- end
-
- subgraph GitLab
- subgraph Organization
- OrganizationProduction["All resources on production"]
- subgraph Frontend project
- FrontendEnvironmentProduction["production environment"]
- end
- subgraph Backend project
- BackendEnvironmentProduction["production environment"]
- end
- end
- end
-
- FrontendEnvironmentProduction --- ResourceGroupProduction
- BackendEnvironmentProduction --- ResourceGroupStaging
- ResourceGroupProductionService -.- ResourceGroupProductionDeployment
- ResourceGroupProductionDeployment -.- ResourceGroupProductionPod1
- ResourceGroupProductionDeployment -.- ResourceGroupProductionPod2
- ResourceGroupStagingService -.- ResourceGroupStagingDeployment
- ResourceGroupStagingDeployment -.- ResourceGroupStagingPod1
- ResourceGroupStagingDeployment -.- ResourceGroupStagingPod2
- OrganizationProduction --- FrontendEnvironmentProduction
- OrganizationProduction --- BackendEnvironmentProduction
-```
-
-A few notes:
-
-- In the future, we'd have more granular filters for resource search.
- For example, there are two environments `production/us-region` and `production/eu-region` in each project
- and show only resources in US region at the group-level.
- This could be achivable by query filtering in PostgreSQL or label/namespace filtering in Kubernetes.
-- Please see [Add dynamically populated organization-level environments page](https://gitlab.com/gitlab-org/gitlab/-/issues/241506) for more information.
-
## Design and implementation details
-NOTE:
-The following solution might be only applicable for CI Access mode. GitOps mode will have significant architectural changes _outside of_ this blueprint,
-such as [Flux switching](https://gitlab.com/gitlab-org/gitlab/-/issues/357947) and [Manifest projects outside of the Agent configuration project](https://gitlab.com/groups/gitlab-org/-/epics/7704). In order to derisk potential rework, we'll revisit the GitOps mode after these upstream changes have been settled.
-
### Associate Environment with Agent
-As a preliminary step, we allow users to explicitly define "which deployment job" uses "which agent" and deploy to "which namespace". The following keywords are supported in `.gitlab-ci.yml`.
-
-- `environment:kubernetes:agent` ... Define which agent the deployment job uses. It can select the appropriate context from the `KUBE_CONFIG`.
-- `environment:kubernetes:namespace` ... Define which namespace the deployment job deploys to. It injects `KUBE_NAMESPACE` predefined variable into the job. This keyword already [exists](../../../ci/yaml/index.md#environmentkubernetes).
-
-Here is an example of `.gitlab-ci.yml`.
-
-```yaml
-deploy-production:
- environment:
- name: production
- kubernetes:
- agent: path/to/agent/repository:agent-name
- namespace: default
- script:
- - helm --context="$KUBE_CONTEXT" --namespace="$KUBE_NAMESPACE" upgrade --install
-```
-
-When a deployment job is created, GitLab persists the relationship of specified agent, namespace and deployment job. If the CI job is NOT authorized to access the agent (Please refer [`Clusters::Agents::FilterAuthorizationsService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/clusters/agents/filter_authorizations_service.rb) for more details), this relationship aren't recorded. This process happens in [`Deployments::CreateForBuildService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/deployments/create_for_build_service.rb). The database table scheme is:
-
-```plaintext
-agent_deployments:
- - deployment_id (bigint/FK/NOT NULL/Unique)
- - agent_id (bigint/FK/NOT NULL)
- - kubernetes_namespace (character varying(255)/NOT NULL)
-```
+Users can explicitly set a GitLab Agent For Kubernetes to a GitLab Environment in setting UI.
+Frontend will use this associated agent for authenticating/authorizing the user access, which is described in a latter section.
-To idenfity an associated agent for a specific environment, `environment.last_deployment.agent` can be used in Rails.
+We need to adjust the `read_cluster_agent` permission in DeclarivePolicy for supporting agents shared by an external project (also known as the Agent management project).
### Fetch resources through `user_access`
-When user visits an environment page, GitLab frontend fetches an environment via GraphQL. Frontend additionally fetches the associated agent-ID and namespace through deployment relationship, which being tracked by the `agent_deployments` table.
+When user visits an environment page, GitLab frontend fetches an environment via GraphQL. Frontend additionally fetches the associated agent-ID and namespace.
Here is an example of GraphQL query:
@@ -373,12 +277,12 @@ Here is an example of GraphQL query:
id
environment(name: "<environment-name>") {
slug
- lastDeployment(status: SUCCESS) {
- agent {
- id
+ kubernetesNamespace
+ clusterAgent {
+ id
+ name
+ project {
name
- project
- kubernetesNamespace
}
}
}
@@ -388,20 +292,17 @@ Here is an example of GraphQL query:
GitLab frontend authenticate/authorize the user access with [browser cookie](https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/blob/master/doc/kubernetes_user_access.md#browser-cookie-on-gitlab-frontend). If the access is forbidden, frontend shows an error message that `You don't have access to an agent that deployed to this environment. Please contact agent administrator if you are allowed in "user_access" in agent config file. See <troubleshooting-doc-link>`.
-After the user gained access to the agent, GitLab frontend fetches available API Resource list in the Kubernetes and fetches the resources with the following parameters:
+After the user gained access to the agent, GitLab frontend fetches specific Resource kinds (for example, `Deployment`, `Pod`) in the Kubernetes with the following parameters:
-- `namespace` ... `#{environment.lastDeployment.agent.kubernetesNamespace}`
-- `labels`
- - `app.gitlab.com/project_id=#{project.id}` _AND_
- - `app.gitlab.com/environment_slug: #{environment.slug}`
+- `namespace` ... `#{environment.kubernetesNamespace}`
If no resources are found, this is likely that the users have not embedded these lables into their resources. In this case, frontend shows an warning message `There are no resources found for the environment. Do resources have GitLab preserved labels? See <troubleshooting-doc-link>`.
### Dependency graph
- GitLab frontend uses [Owner References](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) to idenfity the dependencies between resources. These are embedded in resources as `metadata.ownerReferences` field.
-- For the resoruces that don't have owner references, we can use [Well-Known Labels, Annotations and Taints](https://kubernetes.io/docs/reference/labels-annotations-taints/) as complement. e.g. `EndpointSlice` doesn't have `metadata.ownerReferences`, but has `kubernetes.io/service-name` as a reference to the parent `Service` resource.
+- For the resoruces that don't have owner references, we can use [Well-Known Labels, Annotations and Taints](https://kubernetes.io/docs/reference/labels-annotations-taints/) as complement. for example, `EndpointSlice` doesn't have `metadata.ownerReferences`, but has `kubernetes.io/service-name` as a reference to the parent `Service` resource.
### Health status of resources
-- GitLab frontend computes the status summary from the fetched resources. Something similar to ArgoCD's [Resource Health](https://argo-cd.readthedocs.io/en/stable/operator-manual/health/) e.g. `Healthy`, `Progressing`, `Degraded` and `Suspended`. The formula is TBD.
+- GitLab frontend computes the status summary from the fetched resources. Something similar to ArgoCD's [Resource Health](https://argo-cd.readthedocs.io/en/stable/operator-manual/health/) for example, `Healthy`, `Progressing`, `Degraded` and `Suspended`. The formula is TBD.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/index.md b/doc/architecture/blueprints/gitlab_ci_events/index.md
index 7ce8fea9410..fb78c0f5d9d 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/index.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/index.md
@@ -2,6 +2,7 @@
status: proposed
creation-date: "2023-03-15"
authors: [ "@furkanayhan" ]
+owners: [ "@furkanayhan" ]
coach: "@grzesiek"
approvers: [ "@jreporter", "@cheryl.li" ]
owning-stage: "~devops::verify"
@@ -45,7 +46,7 @@ Events" blueprint is about making it possible to:
## Proposals
-For now, we have technical 4 proposals;
+For now, we have technical 5 proposals;
1. [Proposal 1: Using the `.gitlab-ci.yml` file](proposal-1-using-the-gitlab-ci-file.md)
Based on;
@@ -55,9 +56,7 @@ For now, we have technical 4 proposals;
Highly inefficient way.
1. [Proposal 3: Using the `.gitlab/ci/events` folder](proposal-3-using-the-gitlab-ci-events-folder.md)
Involves file reading for every event.
-1. [Proposal 4: Creating events via CI files](proposal-4-creating-events-via-ci-files.md)
- Combination of some proposals.
-
-Each of them has its pros and cons. There could be many more proposals and we
-would like to discuss them all. We can combine the best part of those proposals
-and create a new one.
+1. [Proposal 4: Creating events via a CI config file](proposal-4-creating-events-via-ci-files.md)
+ Separate configuration files for defininig events.
+1. [Proposal 5: Combined proposal](proposal-5-combined-proposal.md)
+ Combination of all of the proposals listed above.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/proposal-1-using-the-gitlab-ci-file.md b/doc/architecture/blueprints/gitlab_ci_events/proposal-1-using-the-gitlab-ci-file.md
index 7dfc3873ada..f4cde963224 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/proposal-1-using-the-gitlab-ci-file.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/proposal-1-using-the-gitlab-ci-file.md
@@ -12,7 +12,7 @@ Currently, we have two proof-of-concept (POC) implementations:
They both have similar ideas;
-1. Find a new CI Config syntax to define the pipeline events.
+1. Find a new CI Config syntax to define pipeline events.
Example 1:
@@ -42,19 +42,13 @@ They both have similar ideas;
script: echo "Hello World"
```
-1. Upsert an event to the database when creating a pipeline.
-1. Create [EventStore subscriptions](../../../development/event_store.md) to handle the events.
+1. Upsert a workflow definition to the database when new configuration gets
+ pushed.
+1. Match subscriptions and publishers whenever something happens at GitLab.
-## Problems & Questions
+## Discussion
-1. The CI config of a project can be anything;
- - `.gitlab-ci.yml` by default
- - another file in the project
- - another file in another project
- - completely a remote/external file
-
- How do we handle these cases?
-1. Since we have these problems above, should we keep the events in its own file? (`.gitlab-ci-events.yml`)
-1. Do we only accept the changes in the main branch?
-1. We try to create event subscriptions every time a pipeline is created.
-1. Can we move the existing workflows into the new CI events, for example, `merge_request_event`?
+1. How to efficiently detect changes to the subscriptions?
+1. How do we handle differences between workflows / events / subscriptions on
+ different branches?
+1. Do we need to upsert subscriptions on every push?
diff --git a/doc/architecture/blueprints/gitlab_ci_events/proposal-2-using-the-rules-keyword.md b/doc/architecture/blueprints/gitlab_ci_events/proposal-2-using-the-rules-keyword.md
index 6f69a0f11f0..1f59a8ccf20 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/proposal-2-using-the-rules-keyword.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/proposal-2-using-the-rules-keyword.md
@@ -23,16 +23,13 @@ test_package_removed:
- events: ["package/removed"]
```
-1. We don't upsert anything to the database.
-1. We'll have a single worker which subcribes to events
-like `store.subscribe ::Ci::CreatePipelineFromEventWorker, to: ::Issues::CreatedEvent`.
-1. The worker just runs `Ci::CreatePipelineService` with the correct parameters, the rest
-will be handled by the `rules` system. Of course, we'll need modifications to the `rules` system to support `events`.
-
-## Problems & Questions
-
-1. For every defined event run, we need to enqueue a new `Ci::CreatePipelineFromEventWorker` job.
-1. The worker will need to run `Ci::CreatePipelineService` for every event run.
-This may be costly because we go through every cycle of `Ci::CreatePipelineService`.
-1. This would be highly inefficient.
-1. Can we move the existing workflows into the new CI events, for example, `merge_request_event`?
+1. We don't upsert subscriptions to the database.
+1. We'll have a single worker which runs when something happens in GitLab.
+1. The worker just tries to create a pipeline with the correct parameters.
+1. Pipeline runs when `rules` subsystem finds a job to run.
+
+## Challenges
+
+1. For every defined event run, we need to enqueue a new pipeline creation worker.
+1. Creating pipelines and selecting builds to run is a relatively expensive operation
+1. This will not work on GitLab.com scale.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/proposal-3-using-the-gitlab-ci-events-folder.md b/doc/architecture/blueprints/gitlab_ci_events/proposal-3-using-the-gitlab-ci-events-folder.md
index ad76b7f8dd4..8a8efe2be08 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/proposal-3-using-the-gitlab-ci-events-folder.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/proposal-3-using-the-gitlab-ci-events-folder.md
@@ -5,11 +5,8 @@ description: 'GitLab CI Events Proposal 3: Using the .gitlab/ci/events folder'
# GitLab CI Events Proposal 3: Using the `.gitlab/ci/events` folder
-We can also approach this problem by creating separate files for events.
-
-Let's say we'll have the `.gitlab/ci/events` folder (or `.gitlab/workflows/ci`).
-
-We can define events in the following format:
+In this proposal we want to create separate files for each group of events. We
+can define events in the following format:
```yaml
# .gitlab/ci/events/package-published.yml
@@ -17,9 +14,7 @@ We can define events in the following format:
spec:
events:
- name: package/published
-
---
-
include:
- local: .gitlab-ci.yml
with:
@@ -35,9 +30,7 @@ spec:
inputs:
event:
default: push
-
---
-
job1:
script: echo "Hello World"
@@ -61,4 +54,4 @@ When an event happens;
1. For every defined event run, we need to enqueue a new job.
1. Every event-job will need to search for files.
1. This would be only for the project-scope events.
-1. This can be inefficient because of searching for files for the project for every event.
+1. This will not work for GitLab.com scale.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/proposal-4-creating-events-via-ci-files.md b/doc/architecture/blueprints/gitlab_ci_events/proposal-4-creating-events-via-ci-files.md
index 5f10ba1fbb2..debca82d148 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/proposal-4-creating-events-via-ci-files.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/proposal-4-creating-events-via-ci-files.md
@@ -1,12 +1,13 @@
---
owning-stage: "~devops::verify"
-description: 'GitLab CI Events Proposal 4: Creating events via CI files'
+description: 'GitLab CI Events Proposal 4: Defining subscriptions in a dedicated configuration file'
---
-# GitLab CI Events Proposal 4: Creating events via CI files
+# GitLab CI Events Proposal 4: Defining subscriptions in a dedicated configuration file
-Each project can have its own event configuration file. Let's call it `.gitlab-ci-event.yml` for now.
-In this file, we can define events in the following format:
+Each project can have its own configuration file for defining subscriptions to
+events. For example, `.gitlab-ci-event.yml`. In this file, we can define events
+in the following format:
```yaml
events:
@@ -14,12 +15,13 @@ events:
- issue/created
```
-When this file is changed in the project repository, it is parsed and the events are created, updated, or deleted.
-This is highly similar to [Proposal 1](proposal-1-using-the-gitlab-ci-file.md) except that we don't need to
-track pipeline creations every time.
+When this file is changed in the project repository, it is parsed and the
+events are created, updated, or deleted. This is highly similar to
+[Proposal 1](proposal-1-using-the-gitlab-ci-file.md) except that we don't need
+to track pipeline creations every time.
-1. Upsert events to the database when `.gitlab-ci-event.yml` is updated.
-1. Create [EventStore subscriptions](../../../development/event_store.md) to handle the events.
+1. Upsert events to the database when `.gitlab-ci-event.yml` gets updated.
+1. Create inline reactions to events in code to trigger pipelines.
## Filtering jobs
@@ -51,7 +53,7 @@ test_package_removed:
- if: $CI_EVENT == "package/removed"
```
-or an input like in the [Proposal 3](proposal-3-using-the-gitlab-ci-events-folder.md);
+or an input like in the [Proposal 3](proposal-3-using-the-gitlab-ci-events-folder.md):
```yaml
spec:
@@ -71,3 +73,7 @@ test_package_removed:
rules:
- if: $[[ inputs.event ]] == "package/removed"
```
+
+## Challenges
+
+1. This will not work on GitLab.com scale.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/proposal-5-combined-proposal.md b/doc/architecture/blueprints/gitlab_ci_events/proposal-5-combined-proposal.md
new file mode 100644
index 00000000000..3a596b21526
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_ci_events/proposal-5-combined-proposal.md
@@ -0,0 +1,99 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab CI Events Proposal 5: Combined proposal'
+---
+
+# GitLab CI Events Proposal 5: Combined proposal
+
+In this proposal we have separate files for cohesive groups of events. The
+files are being included into the main `.gitlab-ci.yml` configuration file.
+
+```yaml
+# my/events/packages.yaml
+
+spec:
+ events:
+ - events/package/published
+ - events/audit/package/*
+ inputs:
+ env:
+---
+do_something:
+ script: ./run_for $[[ event.name ]] --env $[[ inputs.env ]]
+ rules:
+ - if: $[[ event.payload.package.name ]] == "my_package"
+```
+
+In the `.gitlab-ci.yml` file, we can enable the subscription:
+
+```yaml
+# .gitlab-ci.yml
+
+include:
+ - local: my/events/packages.yaml
+ inputs:
+ env: test
+
+```
+
+GitLab will detect changes in the included files, and parse their specs. All
+the information required to define a subscription will be encapsulated in the
+spec, hence we will not need to read a whole file. We can easily read `spec`
+header and calculate its checksum what can become a workflow identifier.
+
+Once we see a new identifier, we can redefine subscriptions for a particular
+project and then to upsert them into the database.
+
+We will use an efficient GIN index matching technique to match publishers with
+the subscribers to run pipelines.
+
+The syntax is also compatible with CI Components, and make it easier to define
+components that will only be designed to run for events happening inside
+GitLab.
+
+## No entrypoint file variant
+
+Another variant of this proposal is to move away from the single GitLab CI YAML
+configuration file. In such case we would define another search **directory**,
+like `.gitlab/workflows/` where we would store all YAML files.
+
+We wouldn't need to `include` workflow / events files anywhere, because these
+would be found by GitLab automatically. In order to implement this feature this
+way we would need to extend features like "custom location for `.gitlab-ci.yml`
+file".
+
+Example, without using a main configuration file (the GitLab CI YAML file would
+be still supported):
+
+```yaml
+# .gitlab/workflows/push.yml
+
+spec:
+ events:
+ - events/repository/push
+---
+rspec-on-push:
+ script: bundle exec rspec
+```
+
+```yaml
+# .gitlab/workflows/merge_requests.yml
+
+spec:
+ events:
+ - events/merge_request/push
+---
+rspec-on-mr-push:
+ script: bundle exec rspec
+```
+
+```yaml
+# .gitlab/workflows/schedules.yml
+
+spec:
+ events:
+ - events/pipeline/schedule/run
+---
+smoke-test:
+ script: bundle exec rspec --smoke
+```
diff --git a/doc/architecture/blueprints/gitlab_observability_backend/metrics/index.md b/doc/architecture/blueprints/gitlab_observability_backend/index.md
index 3edb01d9140..5b99235e18c 100644
--- a/doc/architecture/blueprints/gitlab_observability_backend/metrics/index.md
+++ b/doc/architecture/blueprints/gitlab_observability_backend/index.md
@@ -37,7 +37,7 @@ With the development of the proposed system, we have the following goals:
- Support for long-term storage for Prometheus/OpenTelemetry formatted metrics, ingested via Prometheus remote_write API and queried via Prometheus remote_read API, PromQL or SQL with support for metadata and exemplars.
-The aformentioned goals can further be broken down into the following four sub-goals:
+The aforementioned goals can further be broken down into the following four sub-goals:
#### Ingesting data
@@ -52,7 +52,7 @@ The aformentioned goals can further be broken down into the following four sub-g
NOTE:
Although remote_write_sender does not test the correctness of a remote write receiver itself as is our case, it does bring some inspiration to implement/develop one within the scope of this project.
-- We aim to also ensure compatibility for special Prometheus data types, e.g. Prometheus histogram(s), summary(s).
+- We aim to also ensure compatibility for special Prometheus data types, for example, Prometheus histogram(s), summary(s).
#### Reading data
@@ -78,22 +78,22 @@ With the goals established above, we also want to establish what specific things
- We do not aim to support ingesting Prometheus exemplars in our first iteration, though we do aim to account for them in our design from the beginning.
NOTE:
-Worth noting that we intend to model exemplars the same way we’re modeling metric-labels, so building on top of the same data structure should help implementt support for metadata/exemplars rather easily.
+Worth noting that we intend to model exemplars the same way we're modeling metric-labels, so building on top of the same data structure should help implementt support for metadata/exemplars rather easily.
## Proposal
-We intend to use GitLab Observability Backend as a framework for the Metrics implementation so that its lifecycle is also managed via already existing Kubernetes controllers e.g. scheduler, tenant-operator.
+We intend to use GitLab Observability Backend as a framework for the Metrics implementation so that its lifecycle is also managed via already existing Kubernetes controllers for example, scheduler, tenant-operator.
![Architecture](supported-deployments.png)
-From a development perspective, what’s been marked as our “Application Server” above needs to be developed as a part of this proposal while the remaining peripheral components either already exist or can be provisioned via existing code in `scheduler`/`tenant-operator`.
+From a development perspective, what's been marked as our "Application Server" above needs to be developed as a part of this proposal while the remaining peripheral components either already exist or can be provisioned via existing code in `scheduler`/`tenant-operator`.
-**On the write path**, we expect to receive incoming data via `HTTP`/`gRPC` `Ingress` similar to what we do for our existing services, e.g. errortracking, tracing.
+**On the write path**, we expect to receive incoming data via `HTTP`/`gRPC` `Ingress` similar to what we do for our existing services, for example, errortracking, tracing.
NOTE:
Additionally, since we intend to ingest data via Prometheus `remote_write` API, the received data will be Protobuf-encoded, Snappy-compressed. All received data therefore needs to be decompressed & decoded to turn it into a set of `prompb.TimeSeries` objects, which the rest of our components interact with.
-We also need to make sure to avoid writing a lot of small writes into Clickhouse, therefore it’d be prudent to batch data before writing it into Clickhouse.
+We also need to make sure to avoid writing a lot of small writes into Clickhouse, therefore it'd be prudent to batch data before writing it into Clickhouse.
We must also make sure ingestion remains decoupled with `Storage` so as to reduce undue dependence on a given storage implementation. While we do intend to use Clickhouse as our backing storage for any foreseeable future, this ensures we do not tie ourselves in into Clickhouse too much should future business requirements warrant the usage of a different backend/technology. A good way to implement this in Go would be our implementations adhering to a standard interface, the following for example:
@@ -177,7 +177,7 @@ Keeping inline with our current operational structure, we intend to deploy the m
```sql
CREATE TABLE IF NOT EXISTS samples ON CLUSTER '{cluster}' (
series_id UUID,
- timestamp DateTime64(3, ‘UTC’) CODEC(Delta(4), ZSTD),
+ timestamp DateTime64(3, 'UTC') CODEC(Delta(4), ZSTD),
value Float64 CODEC(Gorilla, ZSTD)
) ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(timestamp)
@@ -189,7 +189,7 @@ ORDER BY (series_id, timestamp)
```sql
CREATE TABLE IF NOT EXISTS samples_metadata ON CLUSTER '{cluster}' (
series_id UUID,
- timestamp DateTime64(3, ‘UTC’) CODEC(Delta(4), ZSTD),
+ timestamp DateTime64(3, 'UTC') CODEC(Delta(4), ZSTD),
metadata Map(String, String) CODEC(ZSTD),
) ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(timestamp)
@@ -207,7 +207,7 @@ PRIMARY KEY (labels, series_id)
```
```sql
-CREATE TABLE IF NOT EXISTS group_to_series ON CLUSTER ‘{cluster}’ (
+CREATE TABLE IF NOT EXISTS group_to_series ON CLUSTER '{cluster}'' (
group_id Uint64,
series_id UUID,
) ORDER BY (group_id, series_id)
@@ -217,7 +217,7 @@ CREATE TABLE IF NOT EXISTS group_to_series ON CLUSTER ‘{cluster}’ (
- sharding considerations for a given tenant when ingesting/persisting data if we intend to co-locate data specific to multiple tenants within the same database tables. To simplify things, segregating tenant-specific data to their own dedicated set of tables would make a lot of sense.
-- structural considerations for “timestamps” when ingesting data across tenants.
+- structural considerations for "timestamps" when ingesting data across tenants.
- creation_time vs ingestion_time
@@ -228,7 +228,7 @@ Slightly non-trivial but we can potentially investigate the possibility of using
### Pros - multiple tables
-- Normalised data structuring allows for efficient storage of data, removing any redundancy across multiple samples for a given timeseries. Evidently, for the “samples” schema, we expect to store 32 bytes of data per metric point.
+- Normalised data structuring allows for efficient storage of data, removing any redundancy across multiple samples for a given timeseries. Evidently, for the "samples" schema, we expect to store 32 bytes of data per metric point.
- Better search complexity when filtering timeseries by labels/metadata, via the use of better indexed columns.
@@ -245,17 +245,18 @@ Slightly non-trivial but we can potentially investigate the possibility of using
### Storage - multiple tables
A major portion of our writes are made into the `samples` schema which contains a tuple containing three data points per metric point written:
-|column|data type|byte size|
-|---|---|---|
-|series_id|UUID|16 bytes|
-|timestamp|DateTime64|8 bytes|
-|value|Float64|8 bytes|
+
+| Column | Data type | Byte size |
+|:------------|:-----------|:----------|
+| `series_id` | UUID | 16 bytes |
+| `timestamp` | DateTime64 | 8 bytes |
+| `value` | Float64 | 8 bytes |
Therefore, we estimate to use 32 bytes per sample ingested.
### Compression - multiple tables
-Inspecting the amount of compression we’re able to get with the given design on our major schemas, we see it as a good starting point. Following measurements for both primary tables:
+Inspecting the amount of compression we're able to get with the given design on our major schemas, we see it as a good starting point. Following measurements for both primary tables:
**Schema**: `labels_to_series` containing close to 12k unique `series_id`, each mapping to a set of 10-12 label string pairs
@@ -308,7 +309,7 @@ Query id: 04219cea-06ea-4c5f-9287-23cb23c023d2
### Performance - multiple tables
-From profiling our reference implementation, it can also be noted that most of our time right now is spent in the application writing data to Clickhouse and/or its related operations. A “top” pprof profile sampled from the implementation looked like:
+From profiling our reference implementation, it can also be noted that most of our time right now is spent in the application writing data to Clickhouse and/or its related operations. A "top" pprof profile sampled from the implementation looked like:
```shell
(pprof) top
@@ -329,26 +330,26 @@ Showing top 10 nodes out of 58
As is evident above from our preliminary analysis, writing data into Clickhouse can be a potential bottleneck. Therefore, on the write path, it'd be prudent to batch our writes into Clickhouse so as to reduce the amount of work the application server ends up doing making the ingestion path more efficient.
-On the read path, it’s also possible to parallelize reads for the samples table either by series_id(s) OR by blocks of time between the queried start and end timestamps.
+On the read path, it's also possible to parallelize reads for the samples table either by `series_id` OR by blocks of time between the queried start and end timestamps.
### Caveats
-- When dropping labels from already existing metrics, we treat their new counterparts as completely new series and hence attribute them to a new series_id. This avoids having to merge series data and/or values. The old series, if not actively written into, should eventually fall off their retention and get deleted.
+- When dropping labels from already existing metrics, we treat their new counterparts as completely new series and hence attribute them to a new `series_id`. This avoids having to merge series data and/or values. The old series, if not actively written into, should eventually fall off their retention and get deleted.
-- We have not yet accounted for any data aggregation. Our assumption is that the backing store (in Clickhouse) should allow us to keep a “sufficient” amount of data in its raw form and that we should be able to query against it within our query latency SLOs.
+- We have not yet accounted for any data aggregation. Our assumption is that the backing store (in Clickhouse) should allow us to keep a "sufficient" amount of data in its raw form and that we should be able to query against it within our query latency SLOs.
### **Rejected alternative**: Single, centralized table
### single, centralized data table
```sql
-CREATE TABLE IF NOT EXISTS metrics ON CLUSTER ‘{cluster}’ (
+CREATE TABLE IF NOT EXISTS metrics ON CLUSTER '{cluster}' (
group_id UInt64,
name LowCardinality(String) CODEC(ZSTD),
labels Map(String, String) CODEC(ZSTD),
metadata Map(String, String) CODEC(ZSTD),
value Float64 CODEC (Gorilla, ZSTD),
- timestamp DateTime64(3, ‘UTC’) CODEC(Delta(4),ZSTD)
+ timestamp DateTime64(3, 'UTC') CODEC(Delta(4),ZSTD)
) ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (group_id, name, timestamp);
@@ -364,7 +365,7 @@ ORDER BY (group_id, name, timestamp);
- Huge redundancy built into the data structure since attributes such as name, labels, metadata are stored repeatedly for each sample collected.
-- Non-trivial complexity to search timeseries with values for labels/metadata given how they’re stored when backed by Maps/Arrays.
+- Non-trivial complexity to search timeseries with values for labels/metadata given how they're stored when backed by Maps/Arrays.
- High query latencies by virtue of having to scan large amounts of data per query made.
@@ -372,14 +373,14 @@ ORDER BY (group_id, name, timestamp);
### Storage - single table
-|column|data type|byte size|
-|---|---|---|
-|group_id|UUID|16 bytes|
-|name|String|-|
-|labels|Map(String, String)|-|
-|metadata|Map(String, String)|-|
-|value|Float64|8 bytes|
-|timestamp|DateTime64|8 bytes|
+| Column | Data type | Byte size |
+|:------------|:--------------------|:----------|
+| `group_id` | UUID | 16 bytes |
+| `name` | String | - |
+| `labels` | Map(String, String) | - |
+| `metadata` | Map(String, String) | - |
+| `value` | Float64 | 8 bytes |
+| `timestamp` | DateTime64 | 8 bytes |
NOTE:
Strings are of an arbitrary length, the length is not limited. Their value can contain an arbitrary set of bytes, including null bytes. We will need to regulate what we write into these columns application side.
@@ -486,7 +487,9 @@ We should only store data for a predetermined period of time, post which we eith
### Data access via SQL
-While our corpus of data is PromQL-queryable, it would be prudent to make sure we make the SQL interface “generally available” as well. This capability opens up multiple possibilities to query resident data and allows our users to slice and dice their datasets whichever way they prefer to and/or need to.
+While our corpus of data is PromQL-queryable, it would be prudent to make sure we make the SQL interface
+"generally available" as well. This capability opens up multiple possibilities to query resident data and
+allows our users to slice and dice their datasets whichever way they prefer to and/or need to.
#### Challenges
@@ -545,7 +548,7 @@ value: 0
On the read path, we first query all timeseries identifiers by searching for the labels under consideration. Once we have all the `series_id`(s), we then look up all corresponding samples between the query start timestamp and end timestamp.
-For e.g.
+For example:
```plaintext
kernel{service_environment=~"prod.*", measurement="boot_time"}
@@ -572,7 +575,7 @@ To account for newer writes when maintaining this cache:
- Have TTLs on the keys, jittered per key so as to rebuild them frequently enough to account for new writes.
-Once we know which timeseries we’re querying for, from there, we can easily look up all samples via the following query:
+Once we know which timeseries we're querying for, from there, we can easily look up all samples via the following query:
```sql
SELECT *
@@ -592,13 +595,13 @@ yielding all timeseries samples we were interested in.
We then render these into an array of `prometheus.QueryResult` object(s) and return back to the caller as a `prometheus.ReadResponse` object.
NOTE:
-The queries have been broken down into multiple queries only during our early experimentation/iteration, it’d be prudent to use subqueries within the same roundtrip to the database going forward into production/benchmarking.
+The queries have been broken down into multiple queries only during our early experimentation/iteration, it'd be prudent to use subqueries within the same roundtrip to the database going forward into production/benchmarking.
## Production Readiness
### Batching
-Considering we’ll need to batch data before ingesting large volumes of small writes into Clickhouse, the design must account for app-local persistence to allow it to locally batch incoming data before landing it into Clickhouse in batches of a predetermined size in order to increase performance and allow the table engine to continue to persist data successfully.
+Considering we'll need to batch data before ingesting large volumes of small writes into Clickhouse, the design must account for app-local persistence to allow it to locally batch incoming data before landing it into Clickhouse in batches of a predetermined size in order to increase performance and allow the table engine to continue to persist data successfully.
We have considered the following alternatives to implement app-local batching:
@@ -623,7 +626,7 @@ We propose the following three dimensions be tested while benchmarking the propo
- On-disk storage requirements (accounting for replication if applicable)
- Mean query response times
-For understanding performance, we’ll need to first compile a list of such queries given the data we ingest for our tests. Clickhouse query logging is super helpful while doing this.
+For understanding performance, we'll need to first compile a list of such queries given the data we ingest for our tests. Clickhouse query logging is super helpful while doing this.
NOTE:
Ideally, we aim to benchmark the system to be able to ingest >1M metric points/sec while consistently serving most queries under <1 sec.
diff --git a/doc/architecture/blueprints/gitlab_observability_backend/metrics/supported-deployments.png b/doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png
index 9dccc515129..9dccc515129 100644
--- a/doc/architecture/blueprints/gitlab_observability_backend/metrics/supported-deployments.png
+++ b/doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png
Binary files differ
diff --git a/doc/architecture/blueprints/modular_monolith/bounded_contexts.md b/doc/architecture/blueprints/modular_monolith/bounded_contexts.md
new file mode 100644
index 00000000000..0f71e24864e
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/bounded_contexts.md
@@ -0,0 +1,119 @@
+---
+status: proposed
+creation-date: "2023-06-21"
+authors: [ "@fabiopitino" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: ""
+---
+
+# Defining bounded contexts
+
+## Status quo
+
+Today the GitLab codebase doesn't have a clear domain structure.
+We have [forced the creation of some modules](https://gitlab.com/gitlab-org/gitlab/-/issues/212156)
+as a first step but we don't have a well defined strategy for doing it consistently.
+
+The majority of the code is not properly namespaced and organized:
+
+- Ruby namespaces used don't always represent the SSoT. We have overlapping concepts spread across multiple
+ namespaces. For example: `Abuse::` and `Spam::` or `Security::Orchestration::` and `Security::SecurityOrchestration`.
+- Domain code related to the same bounded context is scattered across multiple directories.
+- Domain code is present in `lib/` directory under namespaces that differ from the same domain under `app/`.
+- Some namespaces are very shallow, containing a few classes while other namespaces are very deep and large.
+- A lot of the old code is not namespaced, making it difficult to understand the context where it's used.
+
+## Goal
+
+1. Define a list of characteristics that bounded contexts should have. For example: must relate to at least 1 product category.
+1. Have a list of top-level bounded contexts where all domain code is broken down into.
+1. Engineers can clearly see the list of available bounded contexts and can make an easy decision where to add
+ new classes and modules.
+1. Define a process for adding a new bounded context to the application. This should occur quite infrequently
+ and new bounded contexts need to adhere to the characteristics defined previously.
+1. Enforce the list of bounded contexts so that no new top-level namespaces can be used aside from the authorized ones.
+
+## Iterations
+
+### 0. Extract libraries out of the codebase
+
+In June 2023 we've started extracing gems out of the main codebase, into
+[`gems/` directory inside the monorepo](https://gitlab.com/gitlab-org/gitlab/-/blob/4c6e120069abe751d3128c05ade45ea749a033df/doc/development/gems.md).
+
+This is our first step towards modularization: externalize code that can be
+extracted to prevent coupling from being introduced into modules that have been
+designed as separate components.
+
+These gems as still part of the monorepo.
+
+### 1. What makes a bounded context?
+
+From the research in [Proposal: split GitLab monolith into components](https://gitlab.com/gitlab-org/gitlab/-/issues/365293)
+it seems that following [product categories](https://about.gitlab.com/handbook/product/categories/#hierarchy), as a guideline,
+would be much better than translating organization structure into folder structure (for example, `app/modules/verify/pipeline-execution/...`).
+
+However, this guideline alone is not sufficient and we need a more specific strategy:
+
+- Product categories can change ownership and we have seen some pretty frequent changes, even back and forth.
+ Moving code every time a product category changes ownership adds too much maintenance overhead.
+- Teams and organization changes should just mean relabelling the ownership of specific modules.
+- Bounded contexts (top level modules) should be [sufficiently deep](../../../development/software_design.md#use-namespaces-to-define-bounded-contexts)
+ to encapsulate implementation details and provide a smaller interface.
+- Some product categories, such as Browser Performance Testing, are just too small to represent a bounded context on their own.
+ We should have a strategy for grouping product categories together when makes sense.
+- Product categories don't necessarily translate into clean boundaries.
+ `Category:Pipeline Composition` and `Category:Continuous Integration` are some examples where Pipeline Authoring team
+ and Pipeline Execution team share a lot of code.
+- Some parts of the code might not have a clear product category associated to it.
+
+Despite the above, product categories provide a rough view of the bounded contexts at play in the application.
+
+One idea could be to use product categories to sketch the initial set of bounded contexts.
+Then, group related or strongly coupled categories under the same bounded context and create new bounded contexts if missing.
+
+### 2. Identify existing bounded contexts
+
+Start with listing all the Ruby files in a spreadsheet and categorize them into components following the guidelines above.
+Some of them are already pretty explicit like Ci::, Packages::, etc. Components should follow our
+[existing naming guide](../../../development/software_design.md#use-namespaces-to-define-bounded-contexts).
+
+This could be a short-lived Working Group with representative members of each DevOps stage (for example, Senior+ engineers).
+The WG would help defining high-level components and will be the DRIs for driving the changes in their respective DevOps stage.
+
+### 3. Publish the list of bounded contexts
+
+The list of bounded contexts (top-level namespaces) extracted from the codebase should be defined statically so it can be
+used programmatically.
+
+```yaml
+# file: config/bounded_contexts.yml
+bounded_contexts:
+ continuous_integration:
+ dir: modules/ci
+ namespace: 'Ci::'
+ packages: ...
+ merge_requests: ...
+ git: ...
+```
+
+With this static list we could:
+
+- Document the existing bounded contexts for engineers to see the big picture.
+- Understand where to place new classes and modules.
+- Enforce if any top-level namespaces are used that are not in the list of bounded contexts.
+- Autoload non-standard Rails directories based on the given list.
+
+## Glossary
+
+- `modules` are Ruby modules and can be used to nest code hierarchically.
+- `namespaces` are unique hierarchies of Ruby constants. For example, `Ci::` but also `Ci::JobArtifacts::` or `Ci::Pipeline::Chain::`.
+- `packages` are Packwerk packages to group together related functionalities. These packages can be big or small depending on the design and architecture. Inside a package all constants (classes and modules) have the same namespace. For example:
+ - In a package `ci`, all the classes would be nested under `Ci::` namespace. There can be also nested namespaces like `Ci::PipelineProcessing::`.
+ - In a package `ci-pipeline_creation` all classes are nested under `Ci::PipelineCreation`, like `Ci::PipelineCreation::Chain::Command`.
+ - In a package `ci` a class named `MergeRequests::UpdateHeadPipelineService` would not be allowed because it would not match the package's namespace.
+ - This can be enforced easily with [Packwerk's based Rubocop Cops](https://github.com/rubyatscale/rubocop-packs/blob/main/lib/rubocop/cop/packs/root_namespace_is_pack_name.rb).
+- `bounded context` is a top-level Packwerk package that represents a macro aspect of the domain. For example: `Ci::`, `MergeRequests::`, `Packages::`, etc.
+ - A bounded context is represented by a single Ruby module/namespace. For example, `Ci::` and not `Ci::JobArtifacts::`.
+ - A bounded context can be made of 1 or multiple Packwerk packages. Nested packages would be recommended if the domain is quite complex and we want to enforce privacy among all the implementation details. For example: `Ci::PipelineProcessing::` and `Ci::PipelineCreation::` could be separate packages of the same bounded context and expose their public API while keeping implementation details private.
+ - A new bounded context like `RemoteDevelopment::` can be represented a single package while large and complex bounded contexts like `Ci::` would need to be organized into smaller/nested packages.
diff --git a/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/hexagonal_architecture.png b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/hexagonal_architecture.png
new file mode 100644
index 00000000000..a8d79e276a2
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/hexagonal_architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md
new file mode 100644
index 00000000000..eb4b428cf52
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md
@@ -0,0 +1,132 @@
+---
+status: proposed
+creation-date: "2023-05-22"
+authors: [ "@fabiopitino" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: ""
+---
+
+# Hexagonal Rails Monolith
+
+## Summary
+
+**TL;DR:** Change the Rails monolith from a [big ball of mud](https://en.wikipedia.org/wiki/Big_ball_of_mud) state to
+a [modular monolith](https://www.thereformedprogrammer.net/my-experience-of-using-modular-monolith-and-ddd-architectures)
+that uses an [Hexagonal architecture](https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)) (or ports and adapters architecture).
+Extract cohesive functional domains into separate directory structure using Domain-Driven Design practices.
+Extract infrastructure code (logging, database tools, instrumentation, etc.) into gems, essentially remove the need for `lib/` directory.
+Define what parts of the functional domains (for example application services) are of public use for integration (the ports)
+and what parts are instead private encapsulated details.
+Define Web, Sidekiq, REST, GraphQL, and Action Cable as the adapters in the external layer of the architecture.
+Use [Packwerk](https://github.com/Shopify/packwerk) to enforce privacy and dependency between modules of the monolith.
+
+![Hexagonal Architecture for GitLab monolith](hexagonal_architecture.png)
+
+## Details
+
+### Application domain
+
+The application core (functional domains) is divided into separate top-level bounded contexts called after the
+[feature category](https://gitlab.com/gitlab-com/www-gitlab-com/blob/master/data/categories.yml) they represent.
+A bounded-context is represented in the form of a Ruby module.
+This follows the existing [guideline on naming namespaces](../../../../development/software_design.md#use-namespaces-to-define-bounded-contexts) but puts more structure to it.
+
+Modules should:
+
+- Be deep enough to encapsulate a lot of the internal logic, state and data.
+- Have a public interface that is as small as possible, safe to use by other bounded contexts and well documented.
+- Be cohesive and represent the SSoT (single source of truth) of the feature it describes.
+
+Feature categories represent a product area that is large enough for the module to be deep, so we don't have a proliferation
+of small top-level modules. It also helps the codebase to follow the
+[ubiquitous language](../../../../development/software_design.md#use-ubiquitous-language-instead-of-crud-terminology).
+A team can be responsible for multiple feature categories, hence owning the vision for multiple bounded contexts.
+While feature categories can sometimes change ownership, this change of mapping the bounded context to new owners
+is very cheap.
+Using feature categories also helps new contributors, either as GitLab team members of members of the wider community,
+to navigate the codebase.
+
+If multiple feature categories are strongly related, they may be grouped under a single bounded context.
+If a feature category is only relevant in the context of a parent feature category, it may be included in the
+parent's bounded context. For example: Build artifacts existing in the context of Continuous Integration feature category
+and they may be merged under a single bounded context.
+
+### Application adapters
+
+>>>
+_Adapters are the glue between components and the outside world._
+_They tailor the exchanges between the external world and the ports that represent the requirements of the inside_
+_of the application component. There can be several adapters for one port, for example, data can be provided by_
+_a user through a GUI or a command-line interface, by an automated data source, or by test scripts._ -
+[Wikipedia](https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)#Principle)
+>>>
+
+Application adapters would be:
+
+- Web UI (Rails controllers, view, JS and Vue client)
+- REST API endpoints
+- GraphQL Endpoints
+- Action Cable
+
+TODO: continue describing how adapters are organized and why they are separate from the domain code.
+
+### Platform code
+
+For platform code we consider any classes and modules that are required by the application domain and/or application
+adapters to work.
+
+The Rails' `lib/` directory today contains multiple categories of code that could live somewhere else,
+most of which is platform code:
+
+- REST API endpoints could be part of the [application adapters](#application-adapters).
+- domain code (both large domain code such as `Gitlab::Ci` and small such as `Gitlab::JiraImport`) should be
+ moved inside the [application domain](#application-domain).
+- The rest could be extracted as separate single-purpose gems under the `gems/` directory inside the monolith.
+ This can include utilities such as logging, error reporting and metrics, rate limiters,
+ infrastructure code like `Gitlab::ApplicationRateLimiter`, `Gitlab::Redis`, `Gitlab::Database`
+ and generic subdomains like `Banzai`.
+
+Base classes to extend Rails framework such as `ApplicationRecord` or `ApplicationWorker` as well as GitLab base classes
+such as `BaseService` could be implemented as gem extensions.
+
+This means that aside from the Rails framework code, the rest of the platform code resides in `gems/`.
+
+Eventually all code inside `gems/` could potentially be extracted in a separate repository or open sourced.
+Placing platform code inside `gems/` makes it clear that its purpose is to serve the application code.
+
+### Why Packwerk?
+
+TODO:
+
+- boundaries not enforced at runtime. Ruby code will still work as being all loaded in the same memory space.
+- can be introduced incrementally. Not everything requires to be moved to packs for the Rails autoloader to work.
+
+Companies like Gusto have been developing and maintaining a list of [development and engineering tools](https://github.com/rubyatscale)
+for organizations that want to move to using a Rails modular monolith around Packwerk.
+
+### EE and JH extensions
+
+TODO:
+
+## Challenges
+
+- Such changes require a shift in the development mindset to understand the benefits of the modular
+ architecture and not fallback into legacy practices.
+- Changing the application architecture is a challenging task. It takes time, resources and commitment
+ but most importantly it requires buy-in from engineers.
+- This may require us to have a medium-long term team of engineers or a Working Group that makes progresses
+ on the architecture evolution plan, foster discussions in various engineering channels and resolve adoption challenges.
+- We need to ensure we build standards and guidelines and not silos.
+- We need to ensure we have clear guidelines on where new code should be placed. We must not recreate junk drawer folders like `lib/`.
+
+## Opportunities
+
+The move to a modular monolith architecture enables a lot of opportunities that we could explore in the future:
+
+- We could align the concept of domain expert with explicitly owning specific modules of the monolith.
+- The use of static analysis tool (such as Packwerk, Rubocop) can catch design violations in development and CI, ensuring
+ that best practices are honored.
+- By defining dependencies between modules explicitly we could speed up CI by testing only the parts that are affected by
+ the changes.
+- Such modular architecture could help to further decompose modules into separate services if needed.
diff --git a/doc/architecture/blueprints/modular_monolith/index.md b/doc/architecture/blueprints/modular_monolith/index.md
new file mode 100644
index 00000000000..ef50be643a6
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/index.md
@@ -0,0 +1,112 @@
+---
+status: proposed
+creation-date: "2023-05-22"
+authors: [ "@grzesiek", "@fabiopitino" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: ""
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Modular Monolith
+
+## Summary
+
+The main [GitLab Rails](https://gitlab.com/gitlab-org/gitlab)
+project has been implemented as a large monolithic application, using
+[Ruby on Rails](https://rubyonrails.org/) framework. It has over 2.2 million
+lines of Ruby code and hundreds of engineers contributing to it every day.
+
+The application has been growing in complexity for more than a decade. The
+monolithic architecture has served us well during this time, making it possible
+to keep high development velocity and great engineering productivity.
+
+Even though we strive for having [an approachable open-core architecture](https://about.gitlab.com/blog/2022/07/14/open-core-is-worse-than-plugins/)
+we need to strengthen the boundaries between domains to retain velocity and
+increase development predictability.
+
+As we grow as an engineering organization, we want to explore a slightly
+different, but related, architectural paradigm:
+[a modular monolith design](https://en.wikipedia.org/wiki/Modular_programming),
+while still using a [monolithic architecture](https://en.wikipedia.org/wiki/Monolithic_application)
+with satellite services.
+
+This should allow us to increase engineering efficiency, reduce the cognitive
+load, and eventually decouple internal components to the extend that will allow
+us to deploy and run them separately if needed.
+
+## Motivation
+
+Working with a large and tightly coupled monolithic application is challenging:
+
+Engineering:
+
+- Onboarding engineers takes time. It takes a while before engineers feel
+ productive due to the size of the context and the amount of coupling.
+- We need to use `CODEOWNERS` file feature for several domains but
+ [these rules are complex](https://gitlab.com/gitlab-org/gitlab/-/blob/409228f064a950af8ff2cecdd138fc9da41c8e63/.gitlab/CODEOWNERS#L1396-1457).
+- It is difficult for engineers to build a mental map of the application due to its size.
+ Even apparently isolated changes can have [far-reaching repercussions](https://about.gitlab.com/handbook/engineering/development/#reducing-the-impact-of-far-reaching-work)
+ on other parts of the monolith.
+- Attrition/retention of engineering talent. It is fatiguing and demoralizing for
+ engineers to constantly deal with the obstacles to productivity.
+
+Architecture:
+
+- There is little structure inside the monolith. We have attempted to enforce
+ the creation [of some modules](https://gitlab.com/gitlab-org/gitlab/-/issues/212156)
+ but have no company-wide strategy on what the functional parts of the
+ monolith should be, and how code should be organized.
+- There is no isolation between existing modules. Ruby does not provide
+ out-of-the-box tools to effectively enforce boundaries. Everything lives
+ under the same memory space.
+- We rarely build abstractions that can boost our efficiency.
+- Moving stable parts of the application into separate services is impossible
+ due to high coupling.
+- We are unable to deploy changes to specific domains separately and isolate
+ failures that are happening inside them.
+
+Productivity:
+
+- High median-time-to-production for complex changes.
+- It can be overwhelming for the wider-community members to contribute.
+- Reducing testing times requires diligent and persistent efforts.
+
+## Goals
+
+- Increase the development velocity and predicability through separation of concerns.
+- Improve code quality by reducing coupling and introducing useful abstractions.
+- Build abstractions required to deploy and run GitLab components separately.
+
+## How do we get there?
+
+While we do recognize that modularization is a significant technical endeavor,
+we believe that the main challenge is organizational, rather than technical. We
+not only need to design separation in a way that modules are decoupled in a
+pragmatic way which works well on GitLab.com but also on self-managed
+instances, but we need to align modularization with the way in which we want to
+work at GitLab.
+
+There are many aspects and details required to make modularization of our
+monolith successful. We will work on the aspects listed below, refine them, and
+add more important details as we move forward towards the goal:
+
+1. [Deliver modularization proof-of-concepts that will deliver key insights](proof_of_concepts.md)
+1. [Align modularization plans to the organizational structure](bounded_contexts.md)
+1. Start a training program for team members on how to work with decoupled domains (TODO)
+1. Build tools that will make it easier to build decoupled domains through inversion of control (TODO)
+1. Separate domains into modules that will reflect organizational structure (TODO)
+1. Build necessary services to align frontend and backend modularization (TODO)
+1. [Introduce hexagonal architecture within the monolith](hexagonal_monolith/index.md)
+1. Introduce clean architecture with one-way-dependencies and host application (TODO)
+1. Build abstractions that will make it possible to run and deploy domains separately (TODO)
+
+## Status
+
+In progress.
+
+## References
+
+[List of references](references.md)
diff --git a/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md b/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md
new file mode 100644
index 00000000000..c215ffafbe4
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/proof_of_concepts.md
@@ -0,0 +1,134 @@
+---
+status: proposed
+creation-date: "2023-07-05"
+authors: [ "@grzesiek", "@fabiopitino" ]
+coach: [ ]
+owners: [ ]
+---
+
+# Modular Monolith: PoCs
+
+Modularization of our monolith is a complex project. There will be many
+unknowns. One thing that can help us mitigate the risks and deliver key
+insights are Proof-of-Concepts that we could deliver early on, to better
+understand what will need to be done.
+
+## Inter-module communicaton
+
+A PoC that we plan to deliver is a PoC of inter-module communication. We do
+recognize the need to separate modules, but still allow them to communicate
+together using a well defined interface. Modules can communicate through a
+facade classes (like libraries usually do), or through eventing system. Both
+ways are important.
+
+The main question is: how do we want to define the interface and how to design
+the communication channels?
+
+It is one of our goals to make it possible to plug modules out, and operate
+some of them as separate services. This will make it easier deploy GitLab.com
+in the future and scale key domains. One possible way to achieve this goal
+would be to design the inter-module communication using a protobuf as an
+interface and gRPC as a communication channel. When modules are plugged-in, we
+would bypass gRPC and serialization and use in-process communication primitives
+(while still using protobuf as an interface). When a module gets plugged-out,
+gRPC would carry messages between modules.
+
+## Use Packwerk to enforce module boundaries
+
+Packwerk is a static analyzer that helps defining and enforcing module boundaries
+in Ruby.
+
+[In this PoC merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/98801)
+we demonstrate a possible directory structure of the monolith broken down into separate
+modules.
+
+The PoC also aims to solve the problem of EE extensions (and JH too) allowing the
+Rails autoloader to be tweaked depending on whether to load only the Core codebase or
+any extensions.
+
+The PoC also attempted to only move a small part of the `Ci::` namespace into a
+`components/ci` Packwerk package. This seems to be the most iterative approach
+explored so far.
+
+There are different approaches we could use to adopt Packwerk. Other PoC's also
+explored are the [large extraction of CI package](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/88899)
+and [moving the 2 main CI classes into a package](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/90595).
+
+All 3 PoC's have a lot in common, from the introduction of Packwerk packages and configurations
+to setting paths for the autoloader to work with any packages. What changes between the
+various merge requests is the approach on choosing which files to move first.
+
+The main goals of the PoC were:
+
+- understand if Packwerk can be used on the GitLab codebase.
+- understand the learning curve for developers.
+- verify support for EE and JH extensions.
+- allow gradual modularization.
+
+### Positive results
+
+- Using Packwerk would be pretty simple on GitLab since it's designed primarily to work
+ on Rails codebases.
+- We can change the organization of the domain code to be module-oriented instead of following
+ the MVC pattern. It requires small initial changes to allow the Rails autoloading
+ to support the new directory structure, which is by the way not imposed by Packwerk.
+ After that, registering a new top-level package/bounded-context would be a 1 LOC change.
+- Using the correct directory structure indicated in the [PoC](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/98801)
+ allows packages to contain all the code, including EE and JH extensions.
+- Gradual modularization is possible and we can have any degree of modularization as we want,
+ from initial no enforcement down to complete isolation simulating an in-memory micro-service environment.
+- Moving files into a Packwerk package doesn't necessarily mean renaming constants.
+ While this is not advisable long term, its an extra flexibility that the tool provides.
+ - For example: If we are extracting the `Ci::` module into a Packwerk package there can be
+ constants that belong to the CI domain but are not namespaced, like `CommitStatus` or
+ that have a different namespace, like `Gitlab::Ci::`.
+ Packwerk allows such constants to be moved inside the `ci` package and correctly flags
+ boundary violations.
+ - Packwerk enhancements from RubyAtScale tooling allow to enforce that all constants inside
+ a package share the same Ruby namespace. We eventually would want to leverage that.
+- RubyAtScale provides also tools to track metrics about modularization and adoption which we
+ would need to monitor and drive as an engineering organization.
+- Packwerk has IDE extensions (e.g. for VSCode) to provide realtime feedback on violations
+ (like Rubocop). It can also be run via CLI during the development workflow against a single
+ package. It could be integrated into pre-push Git hooks or Danger during code reviews.
+
+### Challenges
+
+Some of these challenges are not specific to Packwerk as tool/approach. They were observed
+during the PoC and are more generically related to the process of modularization:
+
+- There is no right or wrong approach when introducing Packwerk packages. We need to define
+ clear guidelines to give developers the tools to make the best decision:
+ - Sometimes it could be creating an empty package and move files in it gradually.
+ - Sometimes it could be wrapping an already well designed and isolated part of the codebase.
+ - Sometimes it could be creating a new package from scratch.
+- As we move code to a different directory structure we need to involve JiHu as they manage
+ extensions following the current directory structure.
+ We may have modules that are partially migrated and we need to ensure JiHu is up-to-date
+ with the current progresses.
+- After privacy/dependency checks are enabled, Packwerk will log a lot of violations
+ (like Rubocop TODOs) since constant references in a Rails codebase are very entangled.
+ - The team owning the package needs to define a vision for the package.
+ What would the package look like once all violations have been fixed?
+ This may mean specifying where the package fits in the
+ [context map](https://www.oreilly.com/library/view/what-is-domain-driven/9781492057802/ch04.html)
+ of the system. How the current package should be used by another package `A` and how
+ it should use other packages.
+ - The vision above should tell developers how they should fix these violations over time.
+ Should they make a specific constant public? Should the package list another package as its
+ dependencies? Should events be used in some scenarios?
+ - Teams will likely need guidance in doing that. We may need to have a team of engineers, like
+ maintainers with a very broad understanding of the domains, that will support engineering
+ teams in this effort.
+- Changes to CI configurations on tuning Knapsack and selective testing were ignored durign the
+ PoC.
+
+## Frontend sorting hat
+
+Frontend sorting-hat is a PoC for combining multiple domains to render a full
+page of GitLab (with menus, and items that come from multiple separate
+domains).
+
+## Frontend assets aggregation
+
+Frontend assets aggregation is a PoC for a possible separation of micro-frontends.
diff --git a/doc/architecture/blueprints/modular_monolith/references.md b/doc/architecture/blueprints/modular_monolith/references.md
new file mode 100644
index 00000000000..2c7d3dc972d
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/references.md
@@ -0,0 +1,70 @@
+---
+status: proposed
+creation-date: "2023-06-21"
+authors: [ "@fabiopitino" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: ""
+---
+
+# References
+
+## Related design docs
+
+- [Composable codebase design doc](../composable_codebase_using_rails_engines/index.md)
+
+## Related Issues
+
+- [Split GitLab monolith into components](https://gitlab.com/gitlab-org/gitlab/-/issues/365293)
+- [Make it simple to build and use "Decoupled Services"](https://gitlab.com/gitlab-org/gitlab/-/issues/31121)
+- [Use nested structure to organize CI classes](https://gitlab.com/gitlab-org/gitlab/-/issues/209745)
+- [Create new models / classes within a module / namespace](https://gitlab.com/gitlab-org/gitlab/-/issues/212156)
+- [Make teams to be maintainers of their code](https://gitlab.com/gitlab-org/gitlab/-/issues/25872)
+- [Add backend guide for Dependency Injection](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73644)
+
+## Internal Slack Channels
+
+- [`#modular_monolith`](https://gitlab.slack.com/archives/C03NTK6HZBM)
+- [`#architecture`](https://gitlab.slack.com/archives/CJ4DB7517)
+
+## Reference Implementations / Guides
+
+Gusto / RubyAtScale:
+
+- [RubyAtScale toolchain for modularization](https://github.com/rubyatscale)
+- [Gusto's engineering blog](https://engineering.gusto.com/laying-the-cultural-and-technical-foundation-for-big-rails/)
+- [Gradual modularization](https://gradualmodularization.com/) (successor to CBRA)
+- [Component-Based Rails Applications](https://cbra.info) ("deprecated")
+
+Shopify:
+
+- [Packwerk](https://github.com/Shopify/packwerk)
+- [Shopify's jurney to modularization](https://shopify.engineering/shopify-monolith)
+- [Internal GitLab doc transcript of an AMA with a Shopify engineer](https://docs.google.com/document/d/1uZbcaK8Aqs-D_n7_uQ5XE295r5UWDJEBwA6g5bTjcwc/edit#heading=h.d1tml5rlzrpa)
+
+Domain-Driven Rails / Rails Event Store:
+
+Rails Event Store is relevant because it is a mechanism to achieve many
+of the goals discussed here, and is based upon patterns used by Arkency
+to build production applications.
+
+This doesn't mean we need to use this specific framework or approach.
+
+However, the general concepts of DDD/ES/CQRS are important and in some
+cases maybe necessary to achieve the goals of this blueprint, so it's
+useful to have concrete production-proven implementations of those
+concepts to look at as an example.
+
+- [Arkency's domain-driven Rails](https://products.arkency.com/domain-driven-rails/)
+- [Arkency's Rails Event Store](https://railseventstore.org)
+
+App Continuum:
+
+An illustration of how an application can evolve from a small, unstructured app, through various
+stages including a modular well-structured monolith, all the way to a microservices architecture.
+
+Includes discussion of why you might want to stop at various stages, and specifically the
+challenges/concerns with making the jump to microservices, and why sticking with a
+well-structured monolith may be preferable in many cases.
+
+- [App Continuum](https://www.appcontinuum.io)
diff --git a/doc/architecture/blueprints/object_pools/index.md b/doc/architecture/blueprints/object_pools/index.md
index d14e11b8d36..7b7f8d7d180 100644
--- a/doc/architecture/blueprints/object_pools/index.md
+++ b/doc/architecture/blueprints/object_pools/index.md
@@ -805,10 +805,10 @@ pools as it will always match the contents of the upstream repository.
It has a number of downsides though:
-- Normal repositories can now have different states, where some of the
+- Repositories can now have different states, where some of the
repositories are allowed to prune objects and others aren't. This introduces a
source of uncertainty and makes it easy to accidentally delete objects in a
- normal repository and thus corrupt its forks.
+ repository and thus corrupt its forks.
- When upstream repositories go private we must stop updating objects which are
supposed to be deduplicated across members of the fork network. This means
diff --git a/doc/architecture/blueprints/observability_tracing/arch.png b/doc/architecture/blueprints/observability_tracing/arch.png
new file mode 100644
index 00000000000..36ff23dc8a5
--- /dev/null
+++ b/doc/architecture/blueprints/observability_tracing/arch.png
Binary files differ
diff --git a/doc/architecture/blueprints/observability_tracing/index.md b/doc/architecture/blueprints/observability_tracing/index.md
new file mode 100644
index 00000000000..4291683f83f
--- /dev/null
+++ b/doc/architecture/blueprints/observability_tracing/index.md
@@ -0,0 +1,171 @@
+---
+status: proposed
+creation-date: "2023-06-20"
+authors: [ "@mappelman" ]
+approvers: [ "@hbenson", "@nicholasklick" ]
+owning-stage: "~devops::monitor"
+participating-stages: []
+---
+
+# Distributed Tracing Feature
+
+## Summary
+
+GitLab already has distributed tracing as a feature. So this proposal focuses on the intended changes required to GA the feature. Given the strategic direction update which is covered more in the motivation section, we are deprecating the GitLab Observability UI (GOUI) in favor of building native UI for tracing in GitLab UI.
+
+This proposal covers the scope and technical approach to what will be released in GA, including the new UI, API changes and any backend changes to support the new direction.
+
+Distributed Tracing will GA as a premium feature, initially only available to premium and ultimate users.
+
+## Motivation
+
+In December 2021 GitLab acquired OpsTrace and kicked off work integrating Observability functionality into the DevOps platform. At that point the stated goal was to create an observability distribution that could be run independently of GitLab and which integrated well into the DevSecOps platform. See [Internal Only- Argus FAQ](https://docs.google.com/document/d/1eWZhbRdgQx74udzZjpSMgWnHfpYWETD7AWqnPVD5Sm8/edit) for more background on previous strategy.
+
+Since December 2021 there have been a lot of changes in the world and at GitLab. It is GitLabs belief that Observability should be natively built within GitLab UI to avoid fracturing capabilities and ensuring a singular UX. As such we are deprecating GitLab Observability UI which began life as a fork of Grafana in December 2021.
+
+Much of the GitLab Observability architecture and features were built around the fork of Grafana. As such, this proposal is part of a series of proposals that align us toward achieving the following high level objectives.
+
+## Observability Group Objectives
+
+The following group-level objectives are included for context. **The Objectives below are not completely covered by this design. This design focuses on distributed tracing. More design documents will be created for logging, metrics, auto-monitor, etc.**
+
+**Timeline**: Completion of the following before December 2024
+
+**Objectives**:
+
+- GA of a Complete (Metrics, Logs, Traces) Observability Platform - Add on-by-default setup for tracing, metrics and logging including a GA for the service available on GitLab.com and for self-managed users. A user is able to trace micro-services or distributed systems using open-source tracers. Furthermore, users should be able to set sane defaults for sampling or use advanced techniques such as tail-based sampling.
+
+- Tailored Triage Workflow - Users need to connect the dots between Metrics, Logs, and Spans/Traces. Designing for the discovery, querying and the connection of all telemetry data, regardless of type, will aid users to resolve critical alerts and incidents more quickly.
+
+- Auto Monitor - When a developer starts a new project their application is automatically instrumented, alerts are set up and linked to GitLab alerts management, schedules are created and incidents are created for critical alerts.
+
+### Goals
+
+To release a generally available distributed tracing feature as part of GitLab.com SaaS with a minimum featureset such that it is valuable but can be iterated upon.
+
+Specific goals:
+
+- An HTTPS write API implemented in the [GitLab Observability Backend](https://GitLab.com/GitLab-org/opstrace/opstrace) project which receives spans sent to GitLab using [OTLP (OpenTelemetry Protocol)](https://opentelemetry.io/docs/specs/otel/protocol/). Users can collect and send distributed traces using either the [OpenTelemetry SDK](https://opentelemetry.io/docs/collector/deployment/no-collector/) or the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/).
+- UI to list and filter/search for traces by ID, service, attributes or time
+- UI to show a detail view of a trace and its corresponding spans
+- Apply sensible ingestion and storage limits per top-level namespace for all GitLab tiers
+
+## Timeline
+
+In order to achieve the group objectives, the following timelines must be met for [GitLab phased rollout](https://about.GitLab.com/handbook/product/GitLab-the-product/#experiment-beta-ga) of Tracing.
+
+- **Tracing Experiment Release**: 16.2
+- **Tracing Beta Release**: 16.3
+- **Tracing GA Release**: 16.4
+
+## Proposal
+
+Much of the proposed architecture already exists and is in operation for GitLab.com. Distributed tracing has already been in an internal **Beta** for quite some time and has internal users, with graduation to GA being blocked by UX requirements. These UX requirements resulted in the new UI strategy.
+
+ The following diagram outlines the architecture for GitLab Observability Backend and how clients, including the GitLab UI, will interact with it.
+
+<img src="./arch.png">
+
+### Key Components
+
+- Gatekeeper: Responsible for authentication, authorization and rate limit enforcement on all incoming requests. NGINX-Ingress interacts directly with Gatekeeper.
+- Ingress: NGINX-Ingress is used to handle all incoming requests
+- ClickHouse: ClickHouse is the backing store for all observability data
+- Query Service: A horizontally scalable service that retrieves data from ClickHouse in response to a query
+- GitLab UI: The UI hosted at GitLab.com
+- Redis: An HA Redis cluster for caching GitLab API responses
+
+### Data Ingest
+
+One data ingestion pipeline will be deployed for each top level GitLab namespace. Currently we deploy one pipeline _per GitLab Group that enables observability_ and this architecture is now unnecessarily expensive and complex without the presence of the multi-tenant Grafana instances. This multi-tenant ingestion system has the following benefits:
+
+- Beyond rate limits, resource limits can be enforced per user such that no user can steal more system resources (memory, cpu) than allocated.
+- Fine grained control of horizontal scaling for each user pipeline by adding more OTEL Collector instances
+- Manage the users tenant in accordance to GitLab subscription tier, for example, quota, throughput, cold storage, shard to different databases
+- Reduced complexity and enhanced security in the pipeline by leveraging off the shelf components like the [OpenTelemetry Collector](https://opentelemetry.io/docs/concepts/components/#collector) where data within that collector belongs to no more than a single user/customer.
+
+A pipeline is only deployed for the user upon enabling observability in the project settings, in the same way a user can enable error tracking for their project. When observability is enabled for any project in the users namespace, a pipeline will be deployed. This deployment is automated by our Kubernetes scheduler-operator and tenant-operator. Provisioning is currently managed through the iframe, but a preferred method would be to provision using a RESTful API. The GitLab UI would have a section in project settings that allow a user to "enable observability", much like they do for error tracking today.
+
+The opentelemetry collector is used as the core pipeline implementation for its excellent community development of receivers, processors and exporters. [An exporter for ClickHouse has emerged in the community](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter) which we intend to leverage and it currently has support for opentelemetry traces, metrics and logs. This will help accelerate the effort toward ingesting not just traces but also metrics and logs.
+
+### Limits
+
+In addition to the existing cpu and memory limits for each ingest pipeline, the following limits and quotas will also be enforced:
+
+- 100KB (possibly increase this to 1MB) total ingest rate of traces per second per top level namespace
+- 30 day data retention
+- TBD GB total storage
+
+All above limits are subject to change and will be driven by top level namespace configuration so scripts and future features can be built to make these more dynamic for each user or subscription tier. This configuration will be part of the tenant-operator custom resource.
+
+The ingest rate limit will utilize the internal Redis cluster to perform a simple, performant [sliding window rate limit like Cloudflare](https://blog.cloudflare.com/counting-things-a-lot-of-different-things/). The code for this will live in Gatekeeper, where a connection to Redis is already managed.
+
+The data retention and total storage limits will be enforced by a control loop in the tenant-operator that will periodically query ClickHouse and continuously delete the oldest whole day of data until quota is no longer exceeded. To do this efficiently, its important that ClickHouse tables are partitioned using `toDate(timestamp)` to partition by day.
+
+### Query API
+
+The query API, backed by the query service, will be a centralized, horizontally scalable component responsible for returning traces/spans back to the UI. A good starting point for this query service may be to leverage the Jaeger query service code and the [Jaeger query service swagger](https://github.com/Jaegertracing/Jaeger-idl/blob/main/swagger/api_v3/query_service.swagger.json). This query service will be extended to include support for metrics and logs in the future and will be queried directly by vue.js code in GitLab UI.
+
+The scope of effort for GA would include two APIs:
+
+- `/v1/traces` adhering to [this spec](https://github.com/Jaegertracing/Jaeger-idl/blob/main/swagger/api_v3/query_service.swagger.json#L64)
+- `/v1/traces/{trace_ID}` adhering to [this spec](https://github.com/Jaegertracing/Jaeger-idl/blob/main/swagger/api_v3/query_service.swagger.json#L142)
+
+### Authentication and Authorization
+
+<!-- markdownlint-disable-next-line MD044 -->
+GitLab Observability Backend utilizes an [instance-wide trusted GitLab OAuth](https://docs.GitLab.com/ee/integration/OAuth_provider.html#create-an-instance-wide-application) token to perform a seamless OAuth flow that authenticates the GitLab user against the GitLab Observability Backend (GOB). GOB creates an auth session and stores the session identifier in an http-only, secure cookie. This mechanism has already been examined and approved by AppSec. Now that the Observability UI will be native within the UI hosted at GitLab.com, a few small adjustments must be made for authentication to work against the new UI domain vs the embedded iframe that we previously relied upon (GitLab.com instead of observe.gitLab.com).
+
+A hidden iframe will be embedded in the GitLab UI only on pages where GOB authenticated APIs must be consumed. This allows GitLab.com UI to directly communicate with GOB APIs without the need for an intermediate proxy layer in rails and without relying on the less secure shared token between proxy and GOB. This iframe will be hidden and its sole purpose is to perform the OAuth flow and assign the http-only secure cookie containing the GOB user session. This flow is seamless and can be fully hidden from the user since its a **trusted** GitLab OAuth flow. Sessions currently expire after 30 days which is configurable in GOB deployment terraform.
+
+<!-- markdownlint-disable-next-line MD044 -->
+In order to allow requests from GitLab.com directly to observe.gitLab.com using this method, all requests will have to include `{withCredentials: true}` in order to include cookies. For these "readonly" APIs that GitLab.com will query for tracing data, we must:
+
+- Configure Ingress with `"NGINX.Ingress.Kubernetes.io/cors-allow-credentials": "true"` and `"NGINX.Ingress.Kubernetes.io/cors-allow-origin": "GitLab.com"`
+- Ensure we have effective CSRF protection enabled in our Gatekeeper component (Gatekeeper is responsible request authorization)
+
+<!-- markdownlint-disable-next-line MD044 -->
+All requests from GitLab.com will then include the GOB session cookie for observe.gitLab.com to validate. Authorization is handled by the Gatekeeper component which checks group/project membership against GitLab and handles access appropriately. Anyone with inherited developer or above membership will have access to the tracing UI for that project.
+
+### Database Schema
+
+[The community developed OTEL exporter for ClickHouse](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter) has already implemented database schemas for storing traces and spans. [This blog post from ClickHouse](https://clickhouse.com/blog/storing-traces-and-spans-open-telemetry-in-clickhouse) further delves into the details of the community developed exporter and we intend to use the suggested schema design as a starting point for us to test during experiment and beta phases. It's recommended to read the blog post to learn more about the schemas and corresponding SQL queries we intend to try.
+
+### UI Design
+
+The new UI will be built using the Pajamas Design System in accordance with GitLab UX design standards. The UI will interact with the GOB query service directly from vue.js (see architecture diagram above) by sending a fetch to the subdomain `observe.gitLab.com/v1/query` with `{withCredentials: true}`. See the Authentication and Authorization section above for more details on how this is enabled.
+
+[**TODO Figma UI designs and commentary**]
+
+## Iterations
+
+16.2
+
+- migrate all resources attached to the Group CR to the Tenant CR
+- [fork and build Clickhouse exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter)
+- add project_ID to all traces/spans
+- gatekeeper: check membership at project level instead of group level
+- basic query service for listing traces (no filtering/searching)
+- implement hidden iframe-based OAuth mechanism (reuse/adapt what already done for GOUI)
+- UI for traces list
+
+16.3
+
+- filtering/searching query service (by traceID, service, status, duration min/max, start/end time, span attributes)
+- add `read_observability` and `write_observability` scopes to Project access token and support Project access token for writing to project level ingest API
+- provision API
+- remove existing iframe provisioning
+- UI for trace detail
+- UI for filtering/searching traces
+- basic e2e test for provision, send data, query in UI
+- metrics, dashboards, alerts
+
+16.4
+
+- UI settings page to "enable observability" (this would interact with provisioning API)
+- production readiness review
+- documentation complete
+- alter GitLabNamespace CR to only represent a tenant (i.e. top level namespace)
+- delete Group CR and corresponding controller
+- e2e tests that haven't been added yet
+- in cluster smoke test
diff --git a/doc/architecture/blueprints/organization/index.md b/doc/architecture/blueprints/organization/index.md
index be99211754d..2cfaf33ff50 100644
--- a/doc/architecture/blueprints/organization/index.md
+++ b/doc/architecture/blueprints/organization/index.md
@@ -16,23 +16,23 @@ This document is a work in progress and represents the current state of the Orga
## Glossary
-- Organization: An Organization is the umbrella for one or multiple top-level groups. Organizations are isolated from each other by default meaning that cross-namespace features will only work for namespaces that exist in a single Organization.
-- Top-level group: Top-level group is the name given to the topmost group of all other groups. Groups and projects are nested underneath the top-level group.
+- Organization: An Organization is the umbrella for one or multiple top-level Groups. Organizations are isolated from each other by default meaning that cross-Namespace features will only work for Namespaces that exist in a single Organization.
+- Top-level Group: Top-level Group is the name given to the topmost Group of all other Groups. Groups and Projects are nested underneath the top-level Group.
- Cell: A Cell is a set of infrastructure components that contains multiple Organizations. The infrastructure components provided in a Cell are shared among Organizations, but not shared with other Cells. This isolation of infrastructure components means that Cells are independent from each other.
-- User: An Organization has many users. Joining an Organization makes someone a user of that Organization.
-- Member: Adding a user to a group or project within an Organization makes them a member. Members are always users, but users are not necessarily members of a group or project within an Organization. For instance, a user could just have accepted the invitation to join an Organization, but not be a member of any group or project it contains.
-- Non-user: A non-user of an Organization means a user is not part of that specific Organization.
+- User: An Organization has many Users. Joining an Organization makes someone a User of that Organization.
+- Member: Adding a User to a Group or Project within an Organization makes them a Member. Members are always Users, but Users are not necessarily Members of a Group or Project within an Organization. For instance, a User could just have accepted the invitation to join an Organization, but not be a Member of any Group or Project it contains.
+- Non-User: A Non-User of an Organization means a User is not part of that specific Organization.
## Summary
Organizations solve the following problems:
-1. Enables grouping of top-level groups. For example, the following top-level groups would belong to the Organization `GitLab`:
+1. Enables grouping of top-level Groups. For example, the following top-level Groups would belong to the Organization `GitLab`:
1. `https://gitlab.com/gitlab-org/`
1. `https://gitlab.com/gitlab-com/`
-1. Allows different Organizations to be isolated. Top-level groups of the same Organization can interact with each other but not with groups in other Organizations, providing clear boundaries for an Organization, similar to a self-managed instance. Isolation should have a positive impact on performance and availability as things like user dashboards can be scoped to Organizations.
+1. Allows different Organizations to be isolated. Top-level Groups of the same Organization can interact with each other but not with Groups in other Organizations, providing clear boundaries for an Organization, similar to a self-managed instance. Isolation should have a positive impact on performance and availability as things like User dashboards can be scoped to Organizations.
1. Allows integration with Cells. Isolating Organizations makes it possible to allocate and distribute them across different Cells.
-1. Removes the need to define hierarchies. An Organization is a container that could be filled with whatever hierarchy/entity set makes sense (Organization, top-level groups, etc.)
+1. Removes the need to define hierarchies. An Organization is a container that could be filled with whatever hierarchy/entity set makes sense (Organization, top-level Groups, etc.)
1. Enables centralized control of user profiles. With an Organization-specific user profile, administrators can control the user's role in a company, enforce user emails, or show a graphical indicator that a user as part of the Organization. An example could be adding a "GitLab employee" stamp on comments.
1. Organizations bring an on-premise-like experience to SaaS (GitLab.com). The Organization admin will have access to instance-equivalent Admin Area settings with most of the configuration controlled on Organization level.
@@ -43,19 +43,19 @@ Organizations solve the following problems:
The Organization focuses on creating a better experience for Organizations to manage their GitLab experience. By introducing Organizations and [Cells](../cells/index.md) we can improve the reliability, performance and availability of our SaaS Platforms.
- Wider audience: Many instance-level features are admin only. We do not want to lock out users of GitLab.com in that way. We want to make administrative capabilities that previously only existed for self-managed users available to our SaaS users as well. This also means we would give users of GitLab.com more independence from GitLab.com admins in the long run. Today, there are actions that self-managed admins can perform that GitLab.com users have to request from GitLab.com admins.
-- Improved UX: Inconsistencies between the features available at the project and group levels create navigation and usability issues. Moreover, there isn't a dedicated place for Organization-level features.
-- Aggregation: Data from all groups and projects in an Organization can be aggregated.
-- An Organization includes settings, data, and features from all groups and projects under the same owner (including personal namespaces).
-- Cascading behavior: Organization cascades behavior to all the projects and groups that are owned by the same Organization. It can be decided at the Organization level whether a setting can be overridden or not on the levels beneath.
-- Minimal burden on customers: The addition of Organizations should not change existing group and project paths to minimize the impact of URL changes.
+- Improved UX: Inconsistencies between the features available at the Project and Group levels create navigation and usability issues. Moreover, there isn't a dedicated place for Organization-level features.
+- Aggregation: Data from all Groups and Projects in an Organization can be aggregated.
+- An Organization includes settings, data, and features from all Groups and Projects under the same owner (including personal Namespaces).
+- Cascading behavior: Organization cascades behavior to all the Projects and Groups that are owned by the same Organization. It can be decided at the Organization level whether a setting can be overridden or not on the levels beneath.
+- Minimal burden on customers: The addition of Organizations should not change existing Group and Project paths to minimize the impact of URL changes.
### Non-Goals
-Due to urgency of delivering Organizations as a prerequisite for Cells, it is currently not a goal to build Organization functionality on the namespace framework.
+Due to urgency of delivering Organizations as a prerequisite for Cells, it is currently not a goal to build Organization functionality on the Namespace framework.
## Proposal
-We create Organizations as a new lightweight entity, with just the features and workflows which it requires. We already have much of the functionality present in groups and projects, and groups themselves are essentially already the top-level entity. It is unlikely that we need to add significant features to Organizations outside of some key settings, as top-level groups can continue to serve this purpose at least on SaaS.
+We create Organizations as a new lightweight entity, with just the features and workflows which it requires. We already have much of the functionality present in Groups and Projects, and Groups themselves are essentially already the top-level entity. It is unlikely that we need to add significant features to Organizations outside of some key settings, as top-level Groups can continue to serve this purpose at least on SaaS. From an infrastructure perspective, cluster-wide shared data must be both minimal (small in volume) and infrequently written.
```mermaid
graph TD
@@ -72,14 +72,30 @@ Self-managed instances would set a default Organization.
### Benefits
-- No changes to URL's for groups moving under an Organization, which makes moving around top-level groups very easy.
-- Low risk rollout strategy, as there is no conversion process for existing top-level groups.
+- No changes to URL's for Groups moving under an Organization, which makes moving around top-level Groups very easy.
+- Low risk rollout strategy, as there is no conversion process for existing top-level Groups.
- Organization becomes the key for identifying what is part of an Organization, which is likely on its own table for performance and clarity.
### Drawbacks
-- It is unclear right now how we would avoid continuing to spend effort to build instance (or not Organization) features, in particular much of the reporting. This is not an issue on SaaS as top-level groups already have this capability, however it is a challenge on self-managed. If we introduce a built-in Organization (or just none at all) for self-managed, it seems like we would need to continue to build instance/Organization level reporting features as we would not get that for free along with the work to add to groups.
-- Billing may need to be moved from top-level groups to Organization level.
+- It is unclear right now how we would avoid continuing to spend effort to build instance (or not Organization) features, in particular much of the reporting. This is not an issue on SaaS as top-level Groups already have this capability, however it is a challenge on self-managed. If we introduce a built-in Organization (or just none at all) for self-managed, it seems like we would need to continue to build instance/Organization level reporting features as we would not get that for free along with the work to add to Groups.
+- Billing may need to be moved from top-level Groups to Organization level.
+
+## Data Exploration
+
+From an initial [data exploration](https://gitlab.com/gitlab-data/analytics/-/issues/16166#note_1353332877), we retrieved the following information about Users and Organizations:
+
+- For the users that are connected to an organization the vast majority of them (98%) are only associated with a single organization. This means we expect about 2% of Users to navigate across multiple Organizations.
+- The majority of Users (78%) are only Members of a single top-level Group.
+- 25% of current top-level Groups can be matched to an organization.
+ - Most of these top-level Groups (83%) are associated with an organization that has more than one top-level Group.
+ - Of the organizations with more than one top-level Group the (median) average number of top-level Groups is 3.
+ - Most top-level Groups that are matched to organizations with more than one top-level Group are assumed to be intended to be combined into a single organization (82%).
+ - Most top-level Groups that are matched to organizations with more than one top-level Group are using only a single pricing tier (59%).
+- Most of the current top-level Groups are set to public visibility (85%).
+- Less than 0.5% of top-level Groups share Groups with another top-level Group. However, this means we could potentially break 76,000 links between top-level Groups by introducing the Organization.
+
+Based on this analysis we expect to see similar behavior when rolling out Organizations.
## Design and Implementation Details
@@ -88,40 +104,116 @@ Self-managed instances would set a default Organization.
The Organization MVC will contain the following functionality:
- Instance setting to allow the creation of multiple Organizations. This will be enabled by default on GitLab.com, and disabled for self-managed GitLab.
-- Every instance will have a default organization. Initially, all users will be managed by this default Organization.
-- Organization Owner. The creation of an Organization appoints that user as the Organization Owner. Once established, the Organization Owner can appoint other Organization Owners.
-- Organization users. A user is managed by one Organization, but can be part of multiple Organizations. Users are able to navigate between the different Organizations they are part of.
-- Setup settings. Containing the Organization name, ID, description, README, and avatar. Settings are editable by the Organization Owner.
-- Setup flow. Users are able to build an Organization on top of an existing top-level group. New users are able to create an Organization from scratch and to start building top-level groups from there.
-- Visibility. Options will be `public` and `private`. A nonuser of a specific Organization will not see private Organizations in the explore section. Visibility is editable by the Organization Owner.
+- Every instance will have a default organization. Initially, all Users will be managed by this default Organization.
+- Organization Owner. The creation of an Organization appoints that User as the Organization Owner. Once established, the Organization Owner can appoint other Organization Owners.
+- Organization Users. A User is managed by one Organization, but can be part of multiple Organizations. Users are able to navigate between the different Organizations they are part of.
+- Setup settings. Containing the Organization name, ID, description, and avatar. Settings are editable by the Organization Owner.
+- Setup flow. Users are able to build an Organization on top of an existing top-level Group. New Users are able to create an Organization from scratch and to start building top-level Groups from there.
+- Visibility. Options will be `public` and `private`. A Non-User of a specific Organization will not see private Organizations in the explore section. Visibility is editable by the Organization Owner.
- Organization settings page with the added ability to remove an Organization. Deletion of the default Organization is prevented.
-- Groups. This includes the ability to create, edit, and delete groups, as well as a Groups overview that can be accessed by the Organization Owner.
-- Projects. This includes the ability to create, edit, and delete projects, as well as a Projects overview that can be accessed by the Organization Owner.
+- Groups. This includes the ability to create, edit, and delete Groups, as well as a Groups overview that can be accessed by the Organization Owner.
+- Projects. This includes the ability to create, edit, and delete Projects, as well as a Projects overview that can be accessed by the Organization Owner.
### Organization Access
#### Organization Users
-Organization Users can get access to groups and projects as:
+Organization Users can get access to Groups and Projects as:
-- A group member: this grants access to the group and all its projects, regardless of their visibility.
-- A project member: this grants access to the project, and limited access to parent groups, regardless of their visibility.
-- A non-member: this grants access to public and internal groups and projects of that Organization. To access a private group or project in an Organization, a user must become a member.
+- A Group Member: this grants access to the Group and all its Projects, regardless of their visibility.
+- A Project Member: this grants access to the Project, and limited access to parent Groups, regardless of their visibility.
+- A Non-Member: this grants access to public and internal Groups and Projects of that Organization. To access a private Group or Project in an Organization, a User must become a Member.
Organization Users can be managed in the following ways:
-- As [Enterprise Users](../../../user/enterprise_user/index.md), managed by the Organization. This includes control over their user account and the ability to block the user.
-- As Non-Enterprise Users, managed by the Default Organization. Non-Enterprise Users can be removed from an Organization, but the user keeps ownership of their user account.
+- As [Enterprise Users](../../../user/enterprise_user/index.md), managed by the Organization. This includes control over their User account and the ability to block the User.
+- As Non-Enterprise Users, managed by the Default Organization. Non-Enterprise Users can be removed from an Organization, but the User keeps ownership of their User account.
Enterprise Users are only available to Organizations with a Premium or Ultimate subscription. Organizations on the free tier will only be able to host Non-Enterprise Users.
+##### How do Users join an Organization?
+
+Users are visible across all Organizations. This allows Users to move between Organizations. Users can join an Organization by:
+
+1. Becoming a Member of a Namespace (Group, Subgroup, or Project) contained within an Organization. A User can become a Member of a Namespace by:
+
+ - Being invited by username
+ - Being invited by email address
+ - Requesting access. This requires visibility of the Organization and Namespace and must be accepted by the owner of the Namespace. Access cannot be requested to private Groups or Projects.
+
+1. Becoming an Enterprise Users of an Organization. Bringing Enterprise Users to the Organization level is planned post MVC.
+
+##### When can Users see an Organization?
+
+##### What can Users see in an Organization?
+
+Users can see the things that they have access to in an Organization. For instance, an Organization User would be able to access only the private Groups and Projects that they are a Member of, but could see all public Groups and Projects. Actionable items such as issues, merge requests and the to-do list are seen in the context of the Organization. This means that a User might see 10 merge requests they created in `Organization A`, and 7 in `Organization B`, when in total they have created 17 merge requests across both Organizations.
+
+##### What is a Billable Member?
+
+How Billable Members are defined differs between GitLabs two main offerings:
+
+- Self-managed (SM): [Billable Members are Users who consume seats against the SM License](../../../subscriptions/self_managed/index.md#subscription-seats). Custom roles elevated above the Guest role are consuming seats.
+- GitLab.com (SaaS): [Billable Members are Users who are Members of a Namespace (Group or Project) that consume a seat against the SaaS subscription for the top-level Group](../../../subscriptions/gitlab_com/index.md#how-seat-usage-is-determined). Currently, [Users with Minimal Access](../../../user/permissions.md#users-with-minimal-access) and Users without a Group count towards a licensed seat, but [that's changing](https://gitlab.com/gitlab-org/gitlab/-/issues/330663#note_1133361094).
+
+These differences and how they are calculated and displayed often cause confusion. For both SM and SaaS, we evaluate whether a User consumes a seat against the same core rule set:
+
+1. They are active users
+1. They are not bot users
+1. For the Ultimate tier, they are not a Guest
+
+For (1) this is determined differently per offering, in terms of both what classifies as active and also due to the underlying model that we refer to (User vs Member).
+To help demonstrate the various associations used in GitLab relating to Billable Members, here is a relationship diagram:
+
+```mermaid
+graph TD
+ A[Group] <-.type of.- B[Namespace]
+ C[Project] -.belongs to.-> A
+
+ E[GroupMember] <-.type of.- D[Member]
+ G[User] -.has many.-> F
+ F -.belongs to.-> C
+ F[ProjectMember] <-.type of.- D
+ G -.has many.-> E -.belongs to.-> A
+
+ GGL[GroupGroupLink<br/> See note 1] -.belongs_to.->A
+ PGL[ProjectGroupLink<br/> See note 2] -.belongs_to.->A
+ PGL -.belongs_to.->C
+```
+
+GroupGroupLink is the join table between two Group records, indicating that one Group has invited the other.
+ProjectGroupLink is the join table between a Group and a Project, indicating the Group has been invited to the Project.
+
+SaaS has some additional complexity when it comes to the relationships that determine whether or not a User is considered a Billable Member, particularly relating to Group/Project membership that can often lead to confusion. An example of that are Members of a Group that have been invited into another Group or Project and therewith become billable.
+There are two charts as the flow is different for each: [SaaS](https://mermaid.live/view#pako:eNqNVl1v2jAU_StXeS5M-3hCU6N2aB3SqKbSPkyAhkkuxFsSs9hpVUX899mxYxsnlOWFcH1877nnfkATJSzFaBLtcvaSZKQS8DhdlWCeijGxXBCygCeOFdzSPCfbHOGrRK9Ho2tlvUkEfcZmo97HXBCBG6AcSGuOj86ZA8No_BP5eHQNMz7HYovV8kuGyR-gOx1I3Qd9Ap-31btrtgORITxIPnBXsfoAGcWKVEn2uj4T4Z6pAPdMdKyX8t2mIG-5ex0LkCnBdO4OOrOhO-O3TDQzrkkSkN9izW-BCCUTCB-8hGU866Bl45FxKJ-GdGiDDYI7SOtOp7o0GW90rA20NYjXQxE6cWSaGr1Q2BnX9hCnIbZWc1reJAly3pisMsJ19vKEFiQHfQw5PmMenwqhPQ5Uxa-DjeAa5IJk_g3t-hvdZ8jFA8vxrpYvccfWHIA6aVmrLtMQj2rvuqPynSZYcnx8PWDzlAuZsay3MfouPJxl1c9hKFCIPedzSBuH5fV2X5FDBrT8Zadk2bbszJur_xsp9UznzZRWmIizV-Njx346X9TbPpwoVqO9xobebUZmF3gse0yk9wA-jDBkflTst2TS-EyMTcrTZmGz7hPrkG8HdChdv1n5TAWmGuxHLmXI9qgTza9aO93-TVfnobAh1M6V0VDtuk7E0w313tMUy3Swc_Tyll9VLUwMPcFxUJGBNdKYTTTwY-ByesC_qusx1Yk0bXtao9kk8Snzj8eLsX0lwqV2ujnUE5Bw7FT4g7QbQGM-4YWoXPRZ2C7BnT4TXZPSiAHFUIP3nVhGbiN3G9-OyKWsTvpSS60yMYZA5U_HtyQzdy7p7GCBon65OyXNWJwT9DSNMwF7YB3Xly1o--gqKrAqCE3l359GHa4iuQ8KXEUT-ZrijtS5WEWr8iihpBZs8Vom0WRHco5XUX1IZd9NKZETUxjr8R82ROYl) and [SM](https://mermaid.live/view#pako:eNqFk1FvwiAQx7_KhefVD-CDZo2JNdmcWe3DYpeI7alsLRgKLob0u48qtqxRx9Plz4-7-3NgSCZyJEOyLcRPtqdSwXKScnBLVyhXswrUHiGxMYSsKOimwPHnXwiCYNQAsaIKzXOm2BFh3ShrOGvjujvQghAMPrAaBCOITKRLyu9Rc9FAc6Gu9VPegVELLEKzkOILMwWhUH6yRdhCcWJilEeWXSz5VJzcqrWycWvc830rOmdwnmZ8KoU-vEnXU6-bf6noPmResdzYWxdboHDeAiHBbfqOuqifonX6Ym-CV7g8HfAhfZ0U2-2xUu-iwKm2wdg4BRoJWAUXufZH5JnqH-8ye42YpFCsbGbvRN-Tx7UmunfxqFCfvZfTNeS9AfJESpQlZbn9K6Y5lxL7KUpMydCGOZXfKUl5bTmqlYhPPCNDJTU-EX3IrZEJoztJy4tY_wJJwxFj).
+
+##### How can Users switch between different Organizations?
+
+Users can utilize a [context switcher](https://gitlab.com/gitlab-org/gitlab/-/issues/411637). This feature allows easy navigation and access to different Organizations' content and settings. By clicking on the context switcher and selecting a specific Organization from the provided list, Users can seamlessly transition their view and permissions, enabling them to interact with the resources and functionalities of the chosen Organization.
+
+##### What happens when a User is deleted?
+
+We've identified three different scenarios where a User can be removed from an Organization:
+
+1. Removal: The User is removed from the organization_users table. This is similar to the User leaving a company, but the User can join the Organization again after access approval.
+1. Banning: The User is banned. This can happen in case of misconduct but the User cannot be added again to the Organization until they are unbanned. In this case, we keep the organization_users entry and change the permission to none.
+1. Deleting: The User is deleted. We assign everything the User has authored to the Ghost User and delete the entry from the organization_users table.
+
+As part of the Organization MVC, Organization Owners can remove Organization Users. This means that the User's membership entries are deleted from all Groups and Projects that are contained within the Organization. In addition, the User entry is removed from the `organization_users` table.
+
+Actions such as banning and deleting a User will be added to the Organization at a later point.
+
#### Organization Non-Users
-Non-users are external to the Organization and can only access the public resources of an Organization, such as public projects.
+Non-Users are external to the Organization and can only access the public resources of an Organization, such as public Projects.
### Routing
-Today only users, projects, namespaces and container images are considered routable entities which require global uniqueness on `https://gitlab.com/<path>/-/`. Initially, Organization routes will be [unscoped](../../../development/routing.md). Organizations will follow the path `https://gitlab.com/-/organizations/org-name/` as one of the design goals is that the addition of Organizations should not change existing group and project paths.
+Today only Users, Projects, Namespaces and container images are considered routable entities which require global uniqueness on `https://gitlab.com/<path>/-/`. Initially, Organization routes will be [unscoped](../../../development/routing.md). Organizations will follow the path `https://gitlab.com/-/organizations/org-name/` as one of the design goals is that the addition of Organizations should not change existing Group and Project paths.
+
+### Impact of the Organization on Other Features
+
+We want a minimal amount of infrequently written tables in the shared database. If we have high write volume or large amounts of data in the shared database then this can become a single bottleneck for scaling and we lose the horizontal scalability objective of Cells.
## Iteration Plan
@@ -129,48 +221,58 @@ The following iteration plan outlines how we intend to arrive at the Organizatio
### Iteration 1: Organization Prototype (FY24Q2)
-In iteration 1, we introduce the concept of an Organization as a way to group top-level groups together. Support for Organizations does not require any [Cells](../cells/index.md) work, but having them will make all subsequent iterations of Cells simpler. The goal of iteration 1 will be to generate a prototype that can be used by GitLab teams to test moving functionality to the Organization. It contains everything that is necessary to move an Organization to a Cell:
+In iteration 1, we introduce the concept of an Organization as a way to Group top-level Groups together. Support for Organizations does not require any [Cells](../cells/index.md) work, but having them will make all subsequent iterations of Cells simpler. The goal of iteration 1 will be to generate a prototype that can be used by GitLab teams to test moving functionality to the Organization. It contains everything that is necessary to move an Organization to a Cell:
- The Organization can be named, has an ID and an avatar.
-- Only non-enterprise user can be part of an Organization.
-- A user can be part of multiple Organizations.
+- Only a Non-Enterprise User can be part of an Organization.
+- A User can be part of multiple Organizations.
- A single Organization Owner can be assigned.
- Groups can be created in an Organization. Groups are listed in the Groups overview.
- Projects can be created in a Group. Projects are listed in the Projects overview.
### Iteration 2: Organization MVC Experiment (FY24Q3)
-In iteration 2, an Organization MVC Experiment will be released. We will test the functionality with a select set of customers and improve the MVC based on these learnings. Users will be able to build an Organization on top of their existing top-level group.
+In iteration 2, an Organization MVC Experiment will be released. We will test the functionality with a select set of customers and improve the MVC based on these learnings. Users will be able to build an Organization on top of their existing top-level Group.
-- The Organization has a description and a README.
+- The Organization has a description.
+- Organizations can be deleted.
+- Users can navigate between the different Organizations they are part of.
### Iteration 3: Organization MVC Beta (FY24Q4)
In iteration 3, the Organization MVC Beta will be released.
- Multiple Organization Owners can be assigned.
-- Enterprise users can be added to an Organization.
+- Organization Owners can change the visibility of an organization between `public` and `private`. A Non-User of a specific Organization will not see private Organizations in the explore section.
### Iteration 4: Organization MVC GA (FY25Q1)
+In iteration 4, the Organization MVC will be rolled out.
+
### Post-MVC Iterations
After the initial rollout of Organizations, the following functionality will be added to address customer needs relating to their implementation of GitLab:
1. Internal visibility will be made available on Organizations that are part of GitLab.com.
-1. Move billing from top-level group to Organization.
+1. Enterprise Users will be made available at the Organization level.
+1. Organizations are able to ban and delete Users.
+1. Projects can be created from the Organization-level Projects overview.
+1. Groups can be created from the Organization-level Groups overview.
+1. Move billing from top-level Group to Organization.
1. Audit events at the Organization level.
-1. Set merge request approval rules at the Organization level and cascade to all groups and projects.
+1. Set merge request approval rules at the Organization level and cascade to all Groups and Projects.
1. Security policies at the Organization level.
-1. Vulnerability reports at the Organization level.
+1. Vulnerability Report and Dependency List at the Organization level.
1. Cascading Organization setting to enforce security scans.
1. Scan result policies at the Organization level.
1. Compliance frameworks.
1. [Support the agent for Kubernetes sharing at the Organization level](https://gitlab.com/gitlab-org/gitlab/-/issues/382731).
+## Organization Rollout
+
## Alternative Solutions
-An alternative approach to building Organizations is to convert top-level groups into Organizations. The main advantage of this approach is that features could be built on top of the namespace framework and therewith leverage functionality that is already available at the group level. We would avoid building the same feature multiple times. However, Organizations have been identified as a critical driver of Cells. Due to the urgency of delivering Cells, we decided to opt for the quickest and most straightforward solution to deliver an Organization, which is the lightweight design described above. More details on comparing the two Organization proposals can be found [here](https://gitlab.com/gitlab-org/tenant-scale-group/group-tasks/-/issues/56).
+An alternative approach to building Organizations is to convert top-level Groups into Organizations. The main advantage of this approach is that features could be built on top of the Namespace framework and therewith leverage functionality that is already available at the Group level. We would avoid building the same feature multiple times. However, Organizations have been identified as a critical driver of Cells. Due to the urgency of delivering Cells, we decided to opt for the quickest and most straightforward solution to deliver an Organization, which is the lightweight design described above. More details on comparing the two Organization proposals can be found [here](https://gitlab.com/gitlab-org/tenant-scale-group/group-tasks/-/issues/56).
## Decision Log
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md
index db6bb85d839..f42a70aa97a 100644
--- a/doc/architecture/blueprints/rate_limiting/index.md
+++ b/doc/architecture/blueprints/rate_limiting/index.md
@@ -108,7 +108,7 @@ quota and by a policy.
- _Example:_ maximum artifact upload size of 1 GB
- **Quota:** A global constraint in application usage that is aggregated across an
entire namespace over the duration of their billing cycle.
- - _Example:_ 400 units of compute per namespace per month
+ - _Example:_ 400 compute minutes per namespace per month
- _Example:_ 10 GB transfer per namespace per month
- **Policy:** A representation of business logic that is decoupled from application
code. Decoupled policy definitions allow logic to be shared across multiple services
diff --git a/doc/architecture/blueprints/remote_development/index.md b/doc/architecture/blueprints/remote_development/index.md
index 16b71840f9e..ce55f23f828 100644
--- a/doc/architecture/blueprints/remote_development/index.md
+++ b/doc/architecture/blueprints/remote_development/index.md
@@ -215,7 +215,7 @@ end note
note top of "Load Balancer IP"
For local development,
it includes all local loopback interfaces
- e.g. 127.0.0.1, 172.16.123.1, 192.168.0.1, etc.
+ for example, 127.0.0.1, 172.16.123.1, 192.168.0.1, etc.
end note
@enduml
@@ -439,7 +439,7 @@ Stopped -up-> Starting : status=Starting
Terminated: Workspace has been deleted
-Failed: Workspace is not ready due to\nvarious reasons(e.g. crashing container)
+Failed: Workspace is not ready due to\nvarious reasons(for example, crashing container)
Failed -up-> Starting : status=Starting\n(container\nnot crashing)
Failed -right-> Stopped : status=Stopped
Failed -down-> Terminated : status=Terminated
diff --git a/doc/architecture/blueprints/repository_backups/index.md b/doc/architecture/blueprints/repository_backups/index.md
new file mode 100644
index 00000000000..afd86e4979c
--- /dev/null
+++ b/doc/architecture/blueprints/repository_backups/index.md
@@ -0,0 +1,268 @@
+---
+status: proposed
+creation-date: "2023-04-26"
+authors: [ "@proglottis" ]
+coach: "@DylanGriffith"
+approvers: []
+owning-stage: "~devops::systems"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# Repository Backups
+
+<!-- For long pages, consider creating a table of contents. The `[_TOC_]`
+function is not supported on docs.gitlab.com. -->
+
+## Summary
+
+This proposal seeks to provide an out-of-a-box repository backup solution to
+GitLab that gives more opportunities to apply Gitaly specific optimisations. It
+will do this by moving repository backups out of `backup.rake` into a
+coordination worker that enumerates repositories and makes per-repository
+decisions to trigger repository backups that are streamed directly from Gitaly
+to object-storage.
+
+The advantages of this approach are:
+
+- The backups are only transferred once, from the Gitaly hosting the physical
+ repository to object-storage.
+- Smarter decisions can be made by leveraging specific repository access
+ patterns.
+- Distributes backup and restore load.
+- Since the entire process is run within Gitaly existing monitoring can be
+ used.
+- Provides architecture for future WAL archiving and other optimisations.
+
+This should relieve the major pain points of the existing two strategies:
+
+- `backup.rake` - Repository backups are streamed from outside of Gitaly using
+ RPCs and stored in a single large tar file. Due to the amount of data
+ transferred these backups are limited to small installations.
+- Snapshots - Cloud providers allow taking physical storage snapshots. These
+ are not an out-of-a-box solution as they are specific to the cloud provider.
+
+## Motivation
+
+### Goals
+
+- Improve time to create and restore repository backups.
+- Improve monitoring of repository backups.
+
+### Non-Goals
+
+- Improving filesystem based snapshots.
+
+### Filesystem based Snapshots
+
+Snapshots rely on cloud platforms to be able to take physical snapshots of the
+disks that Gitaly and Praefect use to store data. While never officially
+recommended this strategy tends to be used once creating or restoring backups
+using `backup.rake` takes too long.
+
+Gitaly and Git use lock files and fsync in order to prevent repository
+corruption from concurrent processes and partial writes from a crash. This
+generally means that if a file is written, then it will be valid. However,
+because Git repositories are composed of many files and many write operations
+may be taking place, it would be impossible to schedule a snapshot while no
+file operations are ongoing. This means the consistency of a snapshot cannot be
+guaranteed and restoring from a snapshot backup may require manual
+intervention.
+
+[WAL](https://gitlab.com/groups/gitlab-org/-/epics/8911) may improve crash
+resistance and so improve automatic recovery from snapshots, but each
+repository will likely still require a majority of voting replicas in sync.
+
+Since each node in a Gitaly Cluster is not homogeneous, depending on
+replication factor, in order to create a complete snapshot backup all nodes
+would need to have snapshots taken. This means that snapshot backups have a lot
+of repository data duplication.
+
+Snapshots are heavily dependent on the cloud provider and so they would not
+provide an out-of-a-box experience.
+
+### Downtime
+
+An ideal repository backup solution would allow both backup and restore
+operations to be done online. Specifically we would not want to shutdown or
+pause writes to ensure that each node/repository is consistent.
+
+### Consistency
+
+Consistency in repository backups means:
+
+- That the Git repositories are valid after restore. There are no partially
+ applied operations.
+- That all repositories in a cluster are healthy after restore, or are made
+ healthy automatically.
+
+Backups without consistency may result in data-loss or require manual
+intervention on restore.
+
+Both types of consistency are difficult to achieve using snapshots as this
+requires that snapshots of the filesystems on multiple hosts are taken
+synchronously and without repositories on any of those hosts currently being
+mutated.
+
+### Distribute Work
+
+We want to distribute the backup/restore work such that it isn't bottlenecked
+on the machine running `backup.rake`, a single Gitaly node, or a single network
+connection.
+
+On backup, `backup.rake` aggregates all repository backups onto its local
+filesystem. This means that all repository data needs to be streamed from
+Gitaly (possibly via Praefect) to where the Rake task is being run. If this is
+CNG then it also requires a large volume on Kubernetes. The resulting backup
+tar file then gets transferred to object storage. A similar process happens on
+restore, the entire tar file needs to be downloaded and extracted on the local
+filesystem, even for a partial restore when restoring a subset of repositories.
+Effectively all repository data gets transferred, in full, multiple times
+between multiple hosts.
+
+If each Gitaly could directly upload backups it would mean only transferring
+repository data a single time, reducing the number of hosts and so the amount
+of data transferred over all.
+
+### Gitaly Controlled
+
+Gitaly is looking to become self-contained and so should own its backups.
+
+`backup.rake` currently determines which repositories to backup and where those
+backups are stored. This restricts the kind of optimisations that Gitaly could
+apply and adds development/testing complexity.
+
+### Monitoring
+
+`backup.rake` is run in a variety of different environments. Historically
+backups from Gitaly's perspective are a series of disconnected RPC calls. This
+has resulted in backups having almost zero monitoring. Ideally the process
+would run within Gitaly such that the process could be monitored using existing
+metrics and log scraping.
+
+### Automatic Backups
+
+When `backup.rake` is set up on cron it can be difficult to tell if it has been
+running successfully, if it is still running, how long it took, and how much
+space it has taken. It is difficult to ensure that cron always has access to
+the previous backup to allow for incremental backups or to determine if
+updating the backup is required at all.
+
+Having a coordination process running continuously will allow moving from a
+single-shot backup strategy to one where each repository determines its own
+backup schedule based on usage patterns and priority. This way each repository
+should be able to have a reasonably up-to-date backup without adding excess
+load to any Gitaly node.
+
+### Updated Repositories Only
+
+`backup.rake` packages all repository backups into a tar file and generally has
+no access to the previous backup. This makes it difficult to determine if the
+repository has changed since last backup.
+
+Having access to previous backups on object-storage would mean that Gitaly
+could more easily determine if a backup needs to be taken at all. This allows
+us to waste less time backing up repositories that are no longer being
+modified.
+
+### Point-in-time Restores
+
+There should be a mechanism by which a set of repositories can be restored to a
+specific point in time. The identifier (backup ID) used should be able to be
+determined by an admin and apply to all repositories.
+
+### WAL (write ahead log)
+
+We want to be able to provide infrastructure to allow continuous archiving of
+the WAL. This means providing a central place to stream the archives to and
+being able to match any full backup to a place in the log such that
+repositories can be restored from the full backup, and the WAL applied up to a
+specific point in time.
+
+### WORM
+
+Any Gitaly accessible storage should be WORM (write once, read many) in order
+to prevent existing backups being modified in the case an attacker gains access
+to a nodes object-storage credentials.
+
+[The pointer layout](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/gitaly-backup.md#pointer-layout)
+currently used by repository backups relies on being able to overwrite the
+pointer files, and as such would not be suitable for use on a WORM file store.
+
+WORM is likely object-storage provider specific:
+
+- [AWS object lock](https://aws.amazon.com/blogs/storage/protecting-data-with-amazon-s3-object-lock/)
+- [Google Cloud WORM retention policy](https://cloud.google.com/blog/products/storage-data-transfer/protecting-cloud-storage-with-worm-key-management-and-more-updates).
+- [MinIO object lock](https://min.io/docs/minio/linux/administration/object-management/object-retention.html)
+
+### `bundle-uri`
+
+Having direct access backup data may open the door for clone/fetch transfer
+optimisations using bundle-uri. This allows us to point Git clients directly to
+a bundle file instead of transferring packs from the repository itself. The
+bulk repository transfer can then be faster and is offloaded to a plain http
+server, rather than the Gitaly servers.
+
+## Proposal
+
+The proposal is broken down into an initial MVP and per-repository coordinator.
+
+### MVP
+
+The goal of the MVP is to validate that moving backup processing server-side
+will improve the worst case, total-loss, scenario. That is, reduce the total
+time to create and restore a full backup.
+
+The MVP will introduce backup and restore repository RPCs. There will be no
+coordination worker. The RPCs will stream a backup directly from the
+called Gitaly node to object storage. These RPCs will be called from
+`backup.rake` via the `gitaly-backup` tool. `backup.rake` will no longer
+package repository backups into the backup archive.
+
+This work is already underway, tracked by the [Server-side Backups MVP epic](https://gitlab.com/groups/gitlab-org/-/epics/10077).
+
+### Per-Repository Coordinator
+
+Instead of taking a backup of all repositories at once via `backup.rake`, a
+backup coordination worker will be created. This worker will periodically
+enumerate all repositories to decide if a backup needs to be taken. These
+decisions could be determined by usage patterns or priority of the repository.
+
+When restoring, since each repository will have a different backup state, a
+timestamp will be provided by the user. This timestamp will be used to
+determine which backup to restore for each repository. Once WAL archiving is
+implemented, the WAL could then be replayed up to the given timestamp.
+
+This wider effort is tracked in the [Server-side Backups epic](https://gitlab.com/groups/gitlab-org/-/epics/10826).
+
+## Design and implementation details
+
+### MVP
+
+There will be a pair of RPCs `BackupRepository` and `RestoreRepository`. These
+RPCs will synchronously create/restore backups directly onto object storage.
+`backup.rake` will continue to use `gitaly-backup` with a new `--server-side`
+flag. Each Gitaly will need a backup configuration to specify the
+object-storage service to use.
+
+Initially the structure of the backups in object-storage will be the same as
+the existing [pointer layout](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/gitaly-backup.md#pointer-layout).
+
+For MVP the backup ID must match an exact backup ID on object-storage.
+
+The configuration of object-storage will be controlled by a new config
+`config.backup.go_cloud_url`. The [Go Cloud Development Kit](https://gocloud.dev)
+tries to use a provider specific way to configure authentication. This can be
+inferred from the VM or from environment variables.
+See [Supported Storage Services](https://gocloud.dev/howto/blob/#services).
+
+## Alternative Solutions
+
+<!--
+It might be a good idea to include a list of alternative solutions or paths considered, although it is not required. Include pros and cons for
+each alternative solution/path.
+
+"Do nothing" and its pros and cons could be included in the list too.
+-->
diff --git a/doc/architecture/blueprints/runner_admission_controller/index.md b/doc/architecture/blueprints/runner_admission_controller/index.md
new file mode 100644
index 00000000000..d73ffb21ef3
--- /dev/null
+++ b/doc/architecture/blueprints/runner_admission_controller/index.md
@@ -0,0 +1,241 @@
+---
+status: proposed
+creation-date: "2023-03-07"
+authors: [ "@ajwalker" ]
+coach: [ "@ayufan" ]
+approvers: [ "@DarrenEastman", "@engineering-manager" ]
+owning-stage: "~devops::<stage>"
+participating-stages: []
+---
+
+# GitLab Runner Admissions Controller
+
+The GitLab `admission controller` (inspired by the [Kubernetes admission controller concept](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)) is a proposed technical solution to intercept jobs before they're persisted or added to the build queue for execution.
+
+An admission controller can be registered to the GitLab instance and receive a payload containing jobs to be created. Admission controllers can be _mutating_, _validating_, or both.
+
+- When _mutating_, mutatable job information can be modified and sent back to the GitLab instance. Jobs can be modified to conform to organizational policy, security requirements, or have, for example, their tag list modified so that they're routed to specific runners.
+- When _validating_, a job can be denied execution.
+
+## Motivation
+
+To comply with the segregation of duties, organizational policy, or security requirements, customers in financial services, the US federal government market segment, or other highly regulated industries must ensure that only authorized users can use runners associated with particular CI job environments.
+
+In this context, using the term environments is not equivalent to the definition of the environment used in the GitLab CI environments and deployments documentation. Using the definition from the [SLSA guide](https://slsa.dev/spec/v0.1/terminology), an environment is the "machine, container, VM, or similar in which the job runs."
+
+An additional requirement comes from the Remote Computing Enablement (RCE) group at [Lawrence Livermore National Laboratory](https://hpc.llnl.gov/). In this example, users must have a user ID on the target Runner CI build environment for the CI job to run. To simplify administration across the entire user base, RCE needs to be able to associate a Runner with a GitLab user entity.
+
+### Current GitLab CI job handling mechanism
+
+Before going further, it is helpful to level-set the current job handling mechanism in GitLab CI and GitLab Runners.
+
+- First, a runner associated with a GitLab instance continuously queries the GitLab instance API to check if there is a new job that it could run.
+- With every push to a project repository on GitLab with a `.gitlab-ci.yml` file present, the CI service present on the GitLab instance catches the event and triggers a new CI job.
+- The CI job enters a pending state in the queue until a Runner requests a job from the instance.
+- On the request from a runner to the API for a job, the database is queried to verify that the job parameters matches that of the runner. In other words, when runners poll a GitLab instance for a job to execute they're assigned a job if it matches the specified criteria.
+- If the job matches the runner in question, then the GitLab instance connects the job to the runner and changes the job state to running. In other words, GitLab connects the `job` object with the `Runner` object.
+- A runner can be configured to run un-tagged jobs. Tags are the primary mechanism used today to enable customers to have some control of which Runners run certain types of jobs.
+- So while runners are scoped to the instance, group, or project, there are no additional access control mechanisms today that can easily be expanded on to deny access to a runner based on a user or group identifier.
+
+The current CI jobs queue logic is as follows. **Note - in the code ww still use the very old `build` naming construct, but we've migrated from `build` to `job` in the product and documentation.
+
+```ruby
+jobs =
+ if runner.instance_type?
+ jobs_for_shared_runner
+ elsif runner.group_type?
+ jobs_for_group_runner
+ else
+ jobs_for_project_runner
+ end
+
+# select only jobs that have tags known to the runner
+jobs = jobs.matches_tag_ids(runner.tags.ids)
+
+# select builds that have at least one tag if required
+unless runner.run_untagged?
+ jobs = jobs.with_any_tags
+end
+
+```
+
+## Goals
+
+- Implement an initial solution that provides an easy-to-configure and use mechanism to `allow`, `deny` or `redirect` CI job execution on a specific runner entity based on some basic job details (like user, group or project membership).
+
+## Non-Goals
+
+- A re-design of the CI job queueing mechanism is not in the scope of this blueprint.
+
+## Proposal
+
+Implement a mechanism, `admission controllers`, to intercept CI jobs, allowing them to either mutate jobs, validate them or do both. An admission controller is a mutating webhook that can modify the CI job or reject the job according to a policy. The webhook is called before the job is inserted into the CI jobs queue.
+
+### Guiding principles
+
+- The webhook payload schema will be part of our public facing APIs.
+- We must maintain backwards compatibility when extending the webhook payload.
+- Controllers should be idempotent.
+
+### How will the admissions controller work?
+
+**Scenario 1**: I want to deny access to a certain runner.
+
+1. Configure an admissions controller to only accept jobs from specific projects.
+1. When a job is created the `project information` (`project_id`, `job_id`, `api_token`) will be used to query GitLab for specific details.
+1. If the `project information` matches the allow list, then the job payload is not modified and the job is able to run on the target runner.
+1. If the `project information` does not match the allow list, then the job payload is not modified and the job is dropped.
+1. The job tags are not changed.
+1. Admission controller may optionally send back an arbitrary text description of why a decline decision was made.
+
+**Scenario 2**: Large runner fleet with using a common configuration and tags.
+
+Each runner has a tag such as `zone_a`, `zone_b`. In this scenario the customer does not know where a specific job can run as some users have access to `zone_a`, and some to `zone_b`. The customer does not want to fail a job that should run on `zone_a`, but instead redirect a job if it is not correctly tagged to run in `zone_a.`
+
+1. Configure an admissions controller to mutate jobs based on `user_id`.
+1. When a job is created the `project information` (`project_id`, `job_id`, `api_token`) will be used to query GitLab for specific details.
+1. If the `user_id` matches then the admissions controller modifies the job tag list. `zone_a` is added to the tag list as the controller has detected that the user triggering the job should have their jobs run IN `zone_a`.
+
+### MVC
+
+#### Admission controller
+
+1. A single admission controller can be registered at the instance level only.
+1. The admission controller must respond within 30 seconds.
+1. The admission controller will receive an array of individual jobs. These jobs may or may not be related to each other. The response must contain only responses to the jobs made as part of the request.
+
+#### Job Lifecycle
+
+1. The lifecycle of a job will be updated to include a new `validating` state.
+
+ ```mermaid
+ stateDiagram-v2
+ created --> validating
+ state validating {
+ [*] --> accept
+ [*] --> reject
+ }
+ reject --> failed
+ accept --> pending
+ pending --> running: picked by runner
+ running --> executed
+ state executed {
+ [*] --> failed
+ [*] --> success
+ [*] --> canceled
+ }
+ executed --> created: retry
+ ```
+
+1. When the state is `validating`, the mutating webhook payload is sent to the admission controller.
+1. For jobs where the webhook times out (30 seconds) their status should be set as though the admission was denied. This should
+be rare in typical circumstances.
+1. Jobs with denied admission can be retried. Retried jobs will be resent to the admission controller along with any mutations that they received previously.
+1. [`allow_failure`](../../../ci/yaml/index.md#allow_failure) should be updated to support jobs that fail on denied admissions, for example:
+
+ ```yaml
+ job:
+ script:
+ - echo "I will fail admission"
+ allow_failure:
+ on_denied_admission: true
+ ```
+
+1. The UI should be updated to display the reason for any job mutations (if provided).
+1. A table in the database should be created to store the mutations. Any changes that were made, like tags, should be persisted and attached to `ci_builds` with `acts_as_taggable :admission_tags`.
+
+#### Payload
+
+1. The payload is comprised of individual job entries consisting of:
+ - Job ID.
+ - [Predefined variables](../../../ci/variables/predefined_variables.md)
+ - Job tag list.
+1. The response payload is comprised of individual job entries consisting of:
+ - Job ID.
+ - Admission state: `accepted` or `denied`.
+ - Mutations: Only `tags` is supported for now. The tags provided replaces the original tag list.
+ - Reason: A controller can provide a reason for admission and mutation.
+
+##### Example request
+
+```json
+[
+ {
+ "id": 123,
+ "variables": {
+ # predefined variables: https://docs.gitlab.com/ee/ci/variables/predefined_variables.html
+ "CI_PROJECT_ID": 123,
+ "CI_PROJECT_NAME": "something",
+ "GITLAB_USER_ID": 98123,
+ ...
+ },
+ "tags": [ "docker", "windows" ]
+ },
+ {
+ "id": 245,
+ "variables": {
+ "CI_PROJECT_ID": 245,
+ "CI_PROJECT_NAME": "foobar",
+ "GITLAB_USER_ID": 98123,
+ ...
+ },
+ "tags": [ "linux", "eu-west" ]
+ },
+ {
+ "id": 666,
+ "variables": {
+ "CI_PROJECT_ID": 666,
+ "CI_PROJECT_NAME": "do-bad-things",
+ "GITLAB_USER_ID": 98123,
+ ...
+ },
+ "tags": [ "secure-runner" ]
+ },
+]
+```
+
+##### Example response
+
+```json
+[
+ {
+ "id": 123,
+ "admission": "accepted",
+ "reason": "it's always-allow-day-wednesday"
+ },
+ {
+ "id": 245,
+ "admission": "accepted",
+ "mutations": {
+ "tags": [ "linux", "us-west" ]
+ },
+ "reason": "user is US employee: retagged region"
+ },
+ {
+ "id": 666,
+ "admission": "rejected",
+ "reason": "you have no power here"
+ },
+]
+```
+
+### MVC +
+
+1. Multiple admissions controllers on groups and project levels.
+1. Passing job definition through a chain of the controllers (starting on the project, through all defined group controllers up to the instance controller).
+1. Each level gets the definition modified by the previous controller in the chain and makes decisions based on the current state.
+1. Modification reasons, if reported by multiple controllers, are concatenated.
+1. Usage of the admission controller is optional, so we can have a chain containing project+instance, project+group+parent group+instance, project+group, group+instance, etc
+
+### Implementation Details
+
+1. [placeholder for steps required to code the admissions controller MVC]
+
+## Technical issues to resolve
+
+| issue | resolution|
+| ------ | ------ |
+|We may have conflicting tag-sets as mutating controller can make it possible to define AND, OR and NONE logical definition of tags. This can get quite complex quickly. | |
+|Rule definition for the queue web hook|
+|What data to send to the admissions controller? Is it a subset or all of the [predefined variables](../../../ci/variables/predefined_variables.md)?|
+|Is the `queueing web hook` able to run at GitLab.com scale? On GitLab.com we would trigger millions of webhooks per second and the concern is that would overload Sidekiq or be used to abuse the system.
diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md
index de1203843aa..a6df58aa405 100644
--- a/doc/architecture/blueprints/runner_scaling/index.md
+++ b/doc/architecture/blueprints/runner_scaling/index.md
@@ -232,12 +232,12 @@ coupled in the current implementation so we will break them out here to consider
them each separately.
- **Virtual Machine (VM) shape**. The underlying provider of a VM requires configuration to
- know what kind of machine to create. E.g. Cores, memory, failure domain,
+ know what kind of machine to create. For example, Cores, memory, failure domain,
etc... This information is very provider specific.
- **VM lifecycle management**. Multiple machines will be created and a
system must keep track of which machines belong to this executor. Typically
a cloud provider will have a way to manage a set of homogeneous machines.
- E.g. GCE Instance Group. The basic operations are increase, decrease and
+ For example, GCE Instance Group. The basic operations are increase, decrease and
usually delete a specific machine.
- **VM autoscaling**. In addition to low-level lifecycle management,
job-aware capacity decisions must be made to the set of machines to provide
@@ -255,7 +255,7 @@ See also Glossary below.
#### Current state
The current architecture has several points of coupling between concerns.
-Coupling reduces opportunities for abstraction (e.g. community supported
+Coupling reduces opportunities for abstraction (for example, community supported
plugins) and increases complexity, making the code harder to understand,
test, maintain and extend.
diff --git a/doc/architecture/blueprints/runner_tokens/index.md b/doc/architecture/blueprints/runner_tokens/index.md
index c83586f851a..29d7c05c553 100644
--- a/doc/architecture/blueprints/runner_tokens/index.md
+++ b/doc/architecture/blueprints/runner_tokens/index.md
@@ -238,7 +238,7 @@ The new workflow looks as follows:
1. Creates a new runner in the `ci_runners` table (and corresponding `glrt-` prefixed authentication token);
1. Presents the user with instructions on how to configure this new runner on a machine,
- with possibilities for different supported deployment scenarios (e.g. shell, `docker-compose`, Helm chart, etc.)
+ with possibilities for different supported deployment scenarios (for example, shell, `docker-compose`, Helm chart, etc.)
This information contains a token which is available to the user only once, and the UI
makes it clear to the user that the value shall not be shown again, as registering the same runner multiple times
is discouraged (though not impossible).
@@ -319,7 +319,7 @@ The respective `CiRunner` fields must return the values for the `ci_runner_machi
#### Stale runner cleanup
The functionality to
-[clean up stale runners](../../../ci/runners/configure_runners.md#clean-up-stale-runners) needs
+[clean up stale runners](../../../ci/runners/runners_scope.md#clean-up-stale-group-runners) needs
to be adapted to clean up `ci_runner_machines` records instead of `ci_runners` records.
At some point after the removal of the registration token support, we'll want to create a background
diff --git a/doc/architecture/blueprints/work_items/index.md b/doc/architecture/blueprints/work_items/index.md
index f067d9fab52..9924b0db9f4 100644
--- a/doc/architecture/blueprints/work_items/index.md
+++ b/doc/architecture/blueprints/work_items/index.md
@@ -30,11 +30,17 @@ Base type for issue, requirement, test case, incident and task (this list is pla
A set of predefined types for different categories of work items. Currently, the available types are:
-- Issue
-- Incident
-- Test case
-- Requirement
-- Task
+- [Incident](/ee/operations/incident_management/incidents.md)
+- [Test case](/ee/ci/test_cases/index.md)
+- [Requirement](/ee/user/project/requirements/index.md)
+- [Task](/ee/user/tasks.md)
+- [OKRs](/ee/user/okrs.md)
+
+Work is underway to convert existing objects to Work Item Types or add new ones:
+
+- [Issue](https://gitlab.com/groups/gitlab-org/-/epics/9584)
+- [Epic](https://gitlab.com/groups/gitlab-org/-/epics/9290)
+- [Ticket](https://gitlab.com/groups/gitlab-org/-/epics/10419)
#### Work Item properties
@@ -58,7 +64,7 @@ All Work Item types share the same pool of predefined widgets and are customized
### Work Item widget types (updating)
| widget type | feature flag |
-|---|---|---|
+|---|---|
| assignees | |
| description | |
| hierarchy | |
@@ -68,7 +74,7 @@ All Work Item types share the same pool of predefined widgets and are customized
| start and due date | |
| status\* | |
| weight | |
-| [notes](https://gitlab.com/gitlab-org/gitlab/-/issues/378949) | work_items_mvc |
+| [notes](https://gitlab.com/gitlab-org/gitlab/-/issues/378949) | |
\* status is not currently a widget, but a part of the root work item, similar to title
@@ -119,7 +125,7 @@ Work Items main goal is to enhance the planning toolset to become the most popul
### Scalability
-Currently, different entities like issues, epics, merge requests etc share many similar features but these features are implemented separately for every entity type. This makes implementing new features or refactoring existing ones problematic: for example, if we plan to add new feature to issues and incidents, we would need to implement it separately on issue and incident types respectively. With work items, any new feature is implemented via widgets for all existing types which makes the architecture more scalable.
+Currently, different entities like issues, epics, merge requests etc share many similar features but these features are implemented separately for every entity type. This makes implementing new features or refactoring existing ones problematic: for example, if we plan to add new feature to issues and incidents, we would need to implement it separately on issue and incident types. With work items, any new feature is implemented via widgets for all existing types which makes the architecture more scalable.
### Flexibility