Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/architecture/blueprints')
-rw-r--r--doc/architecture/blueprints/_template.md10
-rw-r--r--doc/architecture/blueprints/activity_pub/index.md451
-rw-r--r--doc/architecture/blueprints/ai_gateway/img/architecture.pngbin142929 -> 90644 bytes
-rw-r--r--doc/architecture/blueprints/ai_gateway/index.md20
-rw-r--r--doc/architecture/blueprints/bundle_uri/index.md216
-rw-r--r--doc/architecture/blueprints/capacity_planning/images/image-20230911144743188.pngbin0 -> 24672 bytes
-rw-r--r--doc/architecture/blueprints/capacity_planning/images/tamland-as-a-service.pngbin0 -> 46896 bytes
-rw-r--r--doc/architecture/blueprints/capacity_planning/images/tamland-as-part-of-stack.pngbin0 -> 38591 bytes
-rw-r--r--doc/architecture/blueprints/capacity_planning/index.md139
-rw-r--r--doc/architecture/blueprints/cells/deployment-architecture.md155
-rw-r--r--doc/architecture/blueprints/cells/diagrams/deployment-before-cells.drawio.pngbin0 -> 93906 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/deployment-development-cells.drawio.pngbin0 -> 66384 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/deployment-hybrid-cells.drawio.pngbin0 -> 119817 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/deployment-initial-cells.drawio.pngbin0 -> 134402 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/deployment-target-cells.drawio.pngbin0 -> 122070 bytes
-rw-r--r--doc/architecture/blueprints/cells/diagrams/index.md4
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/contributions-forks.md55
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/group-transfer.md28
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/issues.md28
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/merge-requests.md28
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md102
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/project-transfer.md28
-rw-r--r--doc/architecture/blueprints/cells/index.md42
-rw-r--r--doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md14
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md15
-rw-r--r--doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md2
-rw-r--r--doc/architecture/blueprints/cloud_connector/index.md274
-rw-r--r--doc/architecture/blueprints/email_ingestion/index.md169
-rw-r--r--doc/architecture/blueprints/feature_flags_development/index.md2
-rw-r--r--doc/architecture/blueprints/gitaly_transaction_management/index.md427
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/decisions/001_hierarchical_events.md62
-rw-r--r--doc/architecture/blueprints/gitlab_ci_events/index.md8
-rw-r--r--doc/architecture/blueprints/gitlab_ml_experiments/index.md2
-rw-r--r--doc/architecture/blueprints/gitlab_observability_backend/index.md693
-rw-r--r--doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.pngbin74153 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/gitlab_services/img/architecture.pngbin0 -> 64365 bytes
-rw-r--r--doc/architecture/blueprints/gitlab_services/index.md129
-rw-r--r--doc/architecture/blueprints/gitlab_steps/index.md6
-rw-r--r--doc/architecture/blueprints/google_artifact_registry_integration/backend.md131
-rw-r--r--doc/architecture/blueprints/google_artifact_registry_integration/index.md42
-rw-r--r--doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md2
-rw-r--r--doc/architecture/blueprints/modular_monolith/index.md2
-rw-r--r--doc/architecture/blueprints/modular_monolith/packages_extraction.md52
-rw-r--r--doc/architecture/blueprints/new_diffs.md103
-rw-r--r--doc/architecture/blueprints/observability_metrics/index.md286
-rw-r--r--doc/architecture/blueprints/observability_metrics/metrics-read-path.pngbin0 -> 24109 bytes
-rw-r--r--doc/architecture/blueprints/observability_metrics/metrics_indexing_at_ingestion.pngbin0 -> 49164 bytes
-rw-r--r--doc/architecture/blueprints/observability_metrics/query-service-internals.pngbin0 -> 38793 bytes
-rw-r--r--doc/architecture/blueprints/observability_tracing/index.md6
-rw-r--r--doc/architecture/blueprints/organization/index.md8
-rw-r--r--doc/architecture/blueprints/permissions/index.md2
-rw-r--r--doc/architecture/blueprints/remote_development/index.md3
-rw-r--r--doc/architecture/blueprints/runway/img/runway-architecture.pngbin426450 -> 115461 bytes
-rw-r--r--doc/architecture/blueprints/runway/img/runway_vault_4_.drawio.pngbin134342 -> 69309 bytes
-rw-r--r--doc/architecture/blueprints/secret_manager/decisions/001_envelop_encryption.md69
-rw-r--r--doc/architecture/blueprints/secret_manager/index.md139
-rw-r--r--doc/architecture/blueprints/secret_manager/secrets-manager-overview.pngbin0 -> 419952 bytes
-rw-r--r--doc/architecture/blueprints/secret_manager/secrets_manager.md14
-rw-r--r--doc/architecture/blueprints/work_items/index.md38
59 files changed, 3166 insertions, 840 deletions
diff --git a/doc/architecture/blueprints/_template.md b/doc/architecture/blueprints/_template.md
index e22cc2e6857..18f88322906 100644
--- a/doc/architecture/blueprints/_template.md
+++ b/doc/architecture/blueprints/_template.md
@@ -9,9 +9,13 @@ participating-stages: []
---
<!--
-**Note:** Please remove comment blocks for sections you've filled in.
-When your blueprint ready for review, all of these comment blocks should be
-removed.
+Before you start:
+
+- Copy this file to a sub-directory and call it `index.md` for it to appear in
+ the blueprint directory.
+- Please remove comment blocks for sections you've filled in.
+ When your blueprint ready for review, all of these comment blocks should be
+ removed.
To get started with a blueprint you can use this template to inform you about
what you may want to document in it at the beginning. This content will change
diff --git a/doc/architecture/blueprints/activity_pub/index.md b/doc/architecture/blueprints/activity_pub/index.md
new file mode 100644
index 00000000000..1612d0916e3
--- /dev/null
+++ b/doc/architecture/blueprints/activity_pub/index.md
@@ -0,0 +1,451 @@
+---
+status: proposed
+creation-date: "2023-09-12"
+authors: [ "@oelmekki", "@jpcyiza" ]
+coach: "@tkuah"
+approvers: [ "@derekferguson" ]
+owning-stage: ""
+participating-stages: [ "~section::dev" ]
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# ActivityPub support
+
+## Summary
+
+The end goal of this proposal is to build interoperability features into
+GitLab so that it's possible on one instance of GitLab to open a merge
+request to a project hosted on an other instance, merging all willing
+instances in a global network.
+
+To achieve that, we propose to use ActivityPub, the w3c standard used by
+the Fediverse. This will allow us to build upon a robust and battle-tested
+protocol, and it will open GitLab to a wider community.
+
+Before starting implementing cross-instance merge requests, we want to
+start with smaller steps, helping us to build up domain knowledge about
+ActivityPub and creating the underlying architecture that will support the
+more advanced features. For that reason, we propose to start with
+implementing social features, allowing people on the Fediverse to subscribe
+to activities on GitLab, for example to be notified on their social network
+of choice when their favorite project hosted on GitLab makes a new release.
+As a bonus, this is an opportunity to make GitLab more social and grow its
+audience.
+
+## Description of the related tech and terms
+
+Feel free to jump to [Motivation](#motivation) if you already know what
+ActivityPub and the Fediverse are.
+
+Among the push for [decentralization of the web](https://en.wikipedia.org/wiki/Decentralized_web),
+several projects tried different protocols with different ideals behind their reasoning.
+Some examples:
+
+- [Secure Scuttlebutt](https://en.wikipedia.org/wiki/Secure_Scuttlebutt) (or SSB for short)
+- [Dat](https://en.wikipedia.org/wiki/Dat_%28software%29)
+- [IPFS](https://en.wikipedia.org/wiki/InterPlanetary_File_System),
+- [Solid](https://en.wikipedia.org/wiki/Solid_%28web_decentralization_project%29)
+
+One gained traction recently: [ActivityPub](https://en.wikipedia.org/wiki/ActivityPub),
+better known for the colloquial [Fediverse](https://en.wikipedia.org/wiki/Fediverse) built
+on top of it, through applications like
+[Mastodon](https://en.wikipedia.org/wiki/Mastodon_%28social_network%29)
+(which could be described as some sort of decentralized Facebook) or
+[Lemmy](https://en.wikipedia.org/wiki/Lemmy_%28software%29) (which could be
+described as some sort of decentralized Reddit).
+
+ActivityPub has several advantages that makes it attractive
+to implementers and could explain its current success:
+
+- **It's built on top of HTTP**. You don't need to install new software or
+ to tinker with TCP/UDP to implement ActivityPub, if you have a webserver
+ or an application that provides an HTTP API (like a rails application),
+ you already have everything you need.
+- **It's built on top of JSON**. All communications are basically JSON
+ objects, which web developers are already used to, which simplifies adoption.
+- **It's a W3C standard and already has multiple implementations**. Being
+ piloted by the W3C is a guarantee of stability and quality work. They
+ have profusely demonstrated in the past through their work on HTML, CSS
+ or other web standards that we can build on top of their work without
+ the fear of it becoming deprecated or irrelevant after a few years.
+
+### The Fediverse
+
+The core idea behind Mastodon and Lemmy is called the Fediverse. Rather
+than full decentralization, those applications rely on federation, in the
+sense that there still are servers and clients. It's not P2P like SSB,
+Dat and IPFS, but instead a galaxy of servers chatting with each other
+instead of having central servers controlled by a single entity.
+
+The user signs up to one of those servers (called **instances**), and they
+can then interact with users either on this instance, or on other ones.
+From the perspective of the user, they access a global network, and not
+only their instance. They see the articles posted on other instances, they
+can comment on them, upvote them, etc.
+
+What happens behind the scenes:
+their instance knows where the user they reply to is hosted. It
+contacts that other instance to let them know there is a message for them -
+somewhat similar to SMTP. Similarly, when a user subscribes
+to a feed, their instance informs the instance where the feed is
+hosted of this subscription. That target instance then posts back
+messages when new activities are created. This allows for a push model, rather
+than a constant poll model like RSS. Of course, what was just described is
+the happy path; there is moderation, validation and fault tolerance
+happening all the way.
+
+### ActivityPub
+
+Behind the Fediverse is the ActivityPub protocol. It's a HTTP API
+attempting to be as general a social network implementation as possible,
+while giving options to be extendable.
+
+The basic idea is that an `actor` sends and receives `activities`. Activities
+are structured JSON messages with well-defined properties, but are extensible
+to cover any need. An actor is defined by four endpoints, which are
+contacted with the
+`application/ld+json; profile="https://www.w3.org/ns/activitystreams"` HTTP Accept header:
+
+- `GET /inbox`: used by the actor to find new activities intended for them.
+- `POST /inbox`: used by instances to push new activities intended for the actor.
+- `GET /outbox`: used by anyone to read the activities created by the actor.
+- `POST /outbox`: used by the actor to publish new activities.
+
+Among those, Mastodon and Lemmy only use `POST /inbox` and `GET /outbox`, which
+are the minimum needed to implement federation:
+
+- Instances push new activities for the actor on the inbox.
+- Reading the outbox allows reading the feed of an actor.
+
+Additionally, Mastodon and Lemmy implement a `GET /` endpoint (with the
+mentioned Accept header). This endpoint responds with general information about the
+actor, like name and URL of the inbox and outbox. While not required by the
+standard, it makes discovery easier.
+
+While a person is the main use case for an actor, an actor does not
+necessarily map to a person. Anything can be an actor: a topic, a
+subreddit, a group, an event. For GitLab, anything with activities (in the sense
+of what GitLab means by "activity") can be an ActivityPub actor. This includes
+items like projects, groups, and releases. In those more abstract examples,
+an actor can be thought of as an actionable feed.
+
+ActivityPub by itself does not cover everything that is needed to implement
+the Fediverse. Most notably, these are left for the implementers to figure out:
+
+- Finding a way to deal with spam. Spam is handled by authorizing or
+ blocking ("defederating") other instances.
+- Discovering new instances.
+- Performing network-wide searches.
+
+## Motivation
+
+Why would a social media protocol be useful for GitLab? People want a single,
+global GitLab network to interact between various projects, without having to
+register on each of their hosts.
+
+Several very popular discussions around this have already happened:
+
+- [Share events externally via ActivityPub](https://gitlab.com/gitlab-org/gitlab/-/issues/21582)
+- [Implement cross-server (federated) merge requests](https://gitlab.com/gitlab-org/gitlab/-/issues/14116)
+- [Distributed merge requests](https://gitlab.com/groups/gitlab-org/-/epics/260).
+
+The ideal workflow would be:
+
+1. Alice registers to her favorite GitLab instance, like `gitlab.example.org`.
+1. She looks for a project on a given topic, and sees Bob's project, even though
+ Bob is on `gitlab.com`.
+1. Alice selects **Fork**, and the `gitlab.com/Bob/project.git` is
+ forked to `gitlab.example.org/Alice/project.git`.
+1. She makes her edits, and opens a merge request, which appears in Bob's
+ project on `gitlab.com`.
+1. Alice and Bob discuss the merge request, each one from their own GitLab
+ instance.
+1. Bob can send additional commits, which are picked up by Alice's instance.
+1. When Bob accepts the merge request, his instance picks up the code from
+ Alice's instance.
+
+In this process, ActivityPub would help in:
+
+- Letting Bob know a fork happened.
+- Sending the merge request to Bob.
+- Enabling Alice and Bob to discuss the merge request.
+- Letting Alice know the code was merged.
+
+It does _not_ help in these cases, which need specific implementations:
+
+- Implementing a network-wide search.
+- Implementing cross-instance forks. (Not needed, thanks to Git.)
+
+Why use ActivityPub here rather than implementing cross-instance merge requests
+in a custom way? Two reasons:
+
+1. **Building on top of a standard helps reach beyond GitLab**.
+ While the workflow presented above only mentions GitLab, building on top
+ of a W3C standard means other forges can follow GitLab
+ there, and build a massive Fediverse of code sharing.
+1. **An opportunity to make GitLab more social**. To prepare the
+ architecture for the workflow above, smaller steps can be taken, allowing
+ people to subscribe to activity feeds from their Fediverse social
+ network. Anything that has a RSS feed could become an ActivityPub feed.
+ People on Mastodon could follow their favorite developer, project, or topic
+ from GitLab and see the news in their feed on Mastodon, hopefully raising
+ engagement with GitLab.
+
+### Goals
+
+- allowing to share interesting events on ActivityPub based social media
+- allowing to open an issue and discuss it from one instance to an other
+- allowing to fork a project from one instance to an other
+- allowing to open a merge request, discuss it and merge it from one instance to an other
+- allowing to perform a network wide search?
+
+### Non-Goals
+
+- federation of private resources
+- allowing to perform a network wide search?
+
+## Proposal
+
+The idea of this implementation path is not to take the fastest route to
+the feature with the most value added (cross-instance merge requests), but
+to go on with the smallest useful step at each iteration, making sure each step
+brings something immediately useful.
+
+1. **Implement ActivityPub for social following**.
+ After this, the Fediverse can follow activities on GitLab instances.
+ 1. ActivityPub to subscribe to project releases.
+ 1. ActivityPub to subscribe to project creation in topics.
+ 1. ActivityPub to subscribe to project activities.
+ 1. ActivityPub to subscribe to group activities.
+ 1. ActivityPub to subscribe to user activities.
+1. **Implement cross-instance forks** to enable forking a project from an other instance.
+1. **Implement ActivityPub for cross-instance discussions** to enable discussing
+ issues and merge requests from another instance:
+ 1. In issues.
+ 1. In merge requests.
+1. **Implement ActivityPub to submit cross-instance merge requests** to enable
+ submitting merge requests to other instances.
+1. **Implement cross-instance search** to enable discovering projects on other instances.
+
+It's open to discussion if this last step should be included at all.
+Currently, in most Fediverse apps, when you want to display a resource from
+an instance that your instance does not know about (typically a user you
+want to follow), you paste the URL of the resource in the search box of
+your instance, and it fetches and displays the remote resource, now
+actionable from your instance. We plan to do that at first.
+
+The question is : do we keep it at that? This UX has severe frictions,
+especially for users not used to Fediverse UX patterns (which is probably
+most GitLab users). On the other hand, distributed search is a subject
+complicated enough to deserve its own blueprint (although it's not as
+complicated as it used to be, now that decentralization protocols and
+applications worked on it for a while).
+
+## Design and implementation details
+
+First, it's a good idea to get familiar with the specifications of the
+three standards we're going to use:
+
+- [ActivityPub](https://www.w3.org/TR/activitypub/) defines the HTTP
+ requests happening to implement federation.
+- [ActivityStreams](https://www.w3.org/TR/activitystreams-core/) defines the
+ format of the JSON messages exchanged by the users of the protocol.
+- [Activity Vocabulary](https://www.w3.org/TR/activitystreams-vocabulary/)
+ defines the various messages recognized by default.
+
+Feel free to ping @oelmekki if you have questions or find the documents too
+dense to follow.
+
+### Production readiness
+
+TBC
+
+### The social following part
+
+This part is laying the ground work allowing to
+[add new ActivityPub actors](../../../development/activitypub/actors/index.md) to
+GitLab.
+
+There are 5 actors we want to implement:
+
+- the `releases` actor, to be notified when given project makes a new
+ release
+- the `topic` actor, to be notified when a new project is added to a topic
+- the `project` actor, regarding all activities from a project
+- the `group` actor, regarding all activities from a group
+- the `user` actor, regarding all activities from a user
+
+We're only dealing with public resources for now. Allowing federation of
+private resources is a tricky subject that will be solved later, if it's
+possible at all.
+
+#### Endpoints
+
+Each actor needs 3 endpoints:
+
+- the profile endpoint, containing basic info, like name, description, but
+ also including links to the inbox and outbox
+- the outbox endpoint, allowing to show previous activities for an actor
+- the inbox endpoint, on which to post to submit follow and unfollow
+ requests (among other things we won't use for now).
+
+The controllers providing those endpoints are in
+`app/controllers/activity_pub/`. It's been decided to use this namespace to
+avoid mixing the ActivityPub JSON responses with the ones meant for the
+frontend, and also because we may need further namespacing later, as the
+way we format activities may be different for one Fediverse app, for an
+other, and for our later cross-instance features. Also, this namespace
+allow us to easily toggle what we need on all endpoints, like making sure
+no private project can be accessed.
+
+#### Serializers
+
+The serializers in `app/serializers/activity_pub/` are the meat of our
+implementation, are they provide the ActivityStreams objects. The abstract
+class `ActivityPub::ActivityStreamsSerializer` does all the heavy lifting
+of validating developer provided data, setting up the common fields and
+providing pagination.
+
+That pagination part is done through `Gitlab::Serializer::Pagination`, which
+uses offset pagination.
+[We need to allow it to do keyset pagination](https://gitlab.com/gitlab-org/gitlab/-/issues/424148).
+
+#### Subscription
+
+Subscription to a resource is done by posting a
+[Follow activity](https://www.w3.org/TR/activitystreams-vocabulary/#dfn-follow)
+to the actor inbox. When receiving a Follow activity,
+[we should generate an Accept or Reject activity in return](https://www.w3.org/TR/activitypub/#follow-activity-inbox),
+sent to the subscriber's inbox.
+
+The general workflow of the implementation is as following:
+
+- A POST request is made to the inbox endpoint, with the Follow activity
+ encoded as JSON
+- if the activity received is not of a supported type (e.g. someone tries to
+ comment on the activity), we ignore it ; otherwise:
+- we create an `ActivityPub::Subscription` with the profile URL of the
+ subscriber
+- we queue a job to resolve the subscriber's inbox URL
+ - in which we perform a HTTP request to the subscriber profile to find
+ their inbox URL (and the shared inbox URL if any)
+ - we store that URL in the subscription record
+- we queue a job to accept the subscription
+ - in which we perform a HTTP request to the subscriber inbox to post an
+ Accept activity
+ - we update the state of the subscription to `:accepted`
+
+`ActivityPub::Subscription` is a new abstract model, from which inherit
+models related to our actors, each with their own table:
+
+- ActivityPub::ReleasesSubscription, table `activity_pub_releases_subscriptions`
+- ActivityPub::TopicSubscription, table `activity_pub_topic_subscriptions`
+- ActivityPub::ProjectSubscription, table `activity_pub_project_subscriptions`
+- ActivityPub::GroupSubscription, table `activity_pub_group_subscriptions`
+- ActivityPub::UserSubscription, table `activity_pub_user_subscriptions`
+
+The reason to go with a multiple models rather than, say, a simpler `actor`
+enum in the Subscription model with a single table is because we needs
+specific associations and validations for each (an
+`ActivityPub::ProjectSubscription` belongs to a Project, an
+`ActivityPub::UserSubscription` does not). It also gives us more room for
+extensibility in the future.
+
+#### Unfollow
+
+When receiving
+[an Undo activity](https://www.w3.org/TR/activitypub/#undo-activity-inbox)
+mentioning previous Follow, we remove the subscription from our database.
+
+We are not required to send back any activity, so we don't need any worker
+here, we can directly remove the record from database.
+
+#### Sending activities out
+
+When specific events (which ones?) happen related to our actors, we should
+queue events to issue activities on the subscribers inboxes (the activities
+are the same than we display in the actor's outbox).
+
+We're supposed to deduplicate the subscriber list to make sure we don't
+send an activity twice to the same person - although it's probably better
+handled by a uniqueness validation from the model when receiving the Follow
+activity.
+
+More importantly, we should group requests for a same host : if ten users
+are all on `https://mastodon.social/`, we should issue a single request on
+the shared inbox provided, adding all the users as recipients, rather than
+sending one request per user.
+
+#### [Webfinger](https://gitlab.com/gitlab-org/gitlab/-/issues/423079)
+
+Mastodon
+[requires instance to implement the Webfinger protocol](https://docs.joinmastodon.org/spec/webfinger/).
+This protocol is about adding an endpoint at a well known location which
+allows to query for a resource name and have it mapped to whatever URL we
+want (so basically, it's used for discovery). Mastodon uses this to query
+other fediverse apps for actor names, in order to find their profile URLs.
+
+Actually, GitLab already implements the Webfinger protocol endpoint through
+Doorkeeper
+([this is the action that maps to its route](https://github.com/doorkeeper-gem/doorkeeper-openid_connect/blob/5987683ccc22262beb6e44c76ca4b65288d6067a/app/controllers/doorkeeper/openid_connect/discovery_controller.rb#L14-L16)),
+implemented in GitLab
+[in JwksController](https://gitlab.com/gitlab-org/gitlab/-/blob/efa76816bd0603ba3acdb8a0f92f54abfbf5cc02/app/controllers/jwks_controller.rb).
+
+There is no incompatibility here, we can just extend this controller.
+Although, we'll probably have to rename it, as it won't be related to Jwks
+alone anymore.
+
+One difficulty we may have is that contrary to Mastodon, we don't only deal
+with users. So we need to figure something to differentiate asking for a
+user from asking for a project, for example. One obvious way would be to
+use a prefix, like `user-<username>`, `project-<project_name>`, etc. I'm
+pondering that from afar, while we haven't implemented much code in the
+epic and I haven't dig deep into Webfinger's specs, this remark may be
+deprecated when we reach actual implementation.
+
+#### [HTTP signatures](https://gitlab.com/gitlab-org/gitlab/-/issues/423083)
+
+Mastodon
+[requires HTTP signatures](https://docs.joinmastodon.org/spec/security/#http),
+which is yet an other standard, in order to make sure no spammer tries to
+impersonate a given server.
+
+This is asymmetrical cryptography, with a private key and a public key,
+like SSH or PGP. We will need to implement both signing requests, and
+verifying them. This will be of considerable help when we'll want to have
+various GitLab instances communicate later in the epic.
+
+### Host allowlist and denylist
+
+To give GitLab instance owners control over potential spam, we need to
+allow to maintain two mutually exclusive lists of hosts:
+
+- the allowlist : only hosts mentioned in this list can be federated with.
+- the denylist : all hosts can be federated with but the ones mentioned in
+ that list.
+
+A setting should allow the owner to switch between the allowlist and the denylist.
+In the beginning, this can be managed in rails console, but it will
+ultimately need a section in the admin interface.
+
+### Limits and rollout
+
+In order to control the load when releasing the feature in the first
+months, we're going to set `gitlab.com` to use the allowlist and rollout
+federation to a few Fediverse servers at a time, so that we can see how it
+takes the load progressively, before ultimately switching to denylist
+(note: there are
+[some ongoing discussions](https://gitlab.com/gitlab-org/gitlab/-/issues/426373#note_1584232842)
+regarding if federation should be activated on `gitlab.com` or not).
+
+We also need to implement limits to make sure the federation is not abused:
+
+- limit to the number of subscriptions a resource can receive.
+- limit to the number of subscriptions a third party server can generate.
+
+### The cross-instance issues and merge requests part
+
+We'll wait to be done with the social following part before designing this
+part, to have ground experience with ActivityPub.
diff --git a/doc/architecture/blueprints/ai_gateway/img/architecture.png b/doc/architecture/blueprints/ai_gateway/img/architecture.png
index e63b4ba45d1..60b298376ac 100644
--- a/doc/architecture/blueprints/ai_gateway/img/architecture.png
+++ b/doc/architecture/blueprints/ai_gateway/img/architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/ai_gateway/index.md b/doc/architecture/blueprints/ai_gateway/index.md
index 08cd8b691d4..8c5a13d2e76 100644
--- a/doc/architecture/blueprints/ai_gateway/index.md
+++ b/doc/architecture/blueprints/ai_gateway/index.md
@@ -32,7 +32,7 @@ translate the content of the redirected request where needed.
![architecture diagram](img/architecture.png)
-[src of the architecture diagram](https://docs.google.com/drawings/d/1PYl5Q5oWHnQAuxM-Jcw0C3eYoGw8a9w8atFpoLhhEas/edit)
+[Diagram source](https://docs.google.com/drawings/d/1PYl5Q5oWHnQAuxM-Jcw0C3eYoGw8a9w8atFpoLhhEas/edit)
By using a hosted service under the control of GitLab we can ensure
that we provide all GitLab instances with AI features in a scalable
@@ -385,15 +385,19 @@ different.
## Authentication & Authorization
-GitLab will provide the first layer of authorization: It authenticate
-the user and check if the license allows using the feature the user is
-trying to use. This can be done using the authentication and license
+GitLab provides the first layer of authorization: It authenticates
+the user and checks if the license allows using the feature the user is
+trying to use. This can be done using the authentication, policy and license
checks that are already built into GitLab.
-Authenticating the GitLab-instance on the AI-gateway will be discussed
-in[#177](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/177).
-Because the AI-gateway exposes proxied endpoints to AI providers, it
-is important that the authentication tokens have limited validity.
+Authenticating the GitLab-instance on the AI-gateway was discussed
+in:
+
+- [Issue 177](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/177)
+- [Epic 10808](https://gitlab.com/groups/gitlab-org/-/epics/10808)
+
+The specific mechanism by which trust is delegated between end-users, GitLab instances,
+and the AI-gateway is covered in the [AI gateway access token validation documentation](../../../development/cloud_connector/code_suggestions_for_sm.md#ai-gateway-access-token-validation).
## Embeddings
diff --git a/doc/architecture/blueprints/bundle_uri/index.md b/doc/architecture/blueprints/bundle_uri/index.md
new file mode 100644
index 00000000000..a056649a798
--- /dev/null
+++ b/doc/architecture/blueprints/bundle_uri/index.md
@@ -0,0 +1,216 @@
+---
+status: proposed
+creation-date: "2023-08-04"
+authors: [ "@toon" ]
+coach: ""
+approvers: [ "@mjwood", "@jcaigitlab" ]
+owning-stage: "~devops::systems"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# Utilize bundle-uri to reduce Gitaly CPU load
+
+## Summary
+
+[bundle-URI](https://git-scm.com/docs/bundle-uri) is a fairly new concept
+in Git that allows the client to download one or more bundles in order to
+bootstrap the object database in advance of fetching the remaining objects from
+a remote. By having the client download static files from a simple HTTP(S)
+server in advance, the work that needs to be done on the remote side is reduced.
+
+Git bundles are files that store a packfile along with some extra metadata,
+including a set of refs and a (possibly empty) set of necessary commits. When a
+user clones a repository, the server can advertise one or more URIs that serve
+these bundles. The client can download these to populate the Git object
+database. After it has done this, the negotiation process between server and
+client start to see which objects need be fetched. When the client pre-populated
+the database with some data from the bundles, the negotiation and transfer of
+objects from the server is reduced, putting less load on the server's CPU.
+
+## Motivation
+
+When a user pushes changes, it usually kicks off a CI pipeline with
+a bunch of jobs. When the CI runners all clone the repository from scratch,
+if they use [`git clone`](/ee/ci/pipelines/settings.md#choose-the-default-git-strategy),
+they all start negotiating with the server what they need to clone. This is
+really CPU intensive for the server.
+
+Some time ago we've introduced the
+[pack-objects](/ee/administration/gitaly/configure_gitaly.md#pack-objects-cache),
+but it has some pitfalls. When the tip of a branch changes, a new packfile needs
+to be calculated, and the cache needs to be refreshed.
+
+Git bundles are more flexible. It's not a big issue if the bundle doesn't have
+all the most recent objects. When it contains a fairly recent state, but is
+missing the latest refs, the client (that is, the CI runner) will do a "catch up" and
+fetch additional objects after applying the bundle. The set of objects it has to
+fetch from will Gitaly be a lot smaller.
+
+### Goals
+
+Reduce the work that needs to be done on the Gitaly servers when a client clones
+a repository. This is particularly useful for CI build farms, which generate a
+lot of traffic on each commit that's pushed to the server.
+
+With the use bundles, the server has to craft a smaller delta packfiles
+compared to the pack files that contain all the objects when no bundles are
+used. This reduces the load on the CPU of the server. This has a benefit on the
+packfile cache as well, because now the packfiles are smaller and faster to
+generate, reducing the chances on cache misses.
+
+### Non-Goals
+
+Using bundle-URIs will **not** reduce the size of repositories stored on disk.
+This feature will not be used to offload repositories, neither fully nor
+partially, from the Gitaly node to some cloud storage. In contrary, because
+bundles are stored elsewhere, some data is duplicated, and will cause increased
+storage costs.
+
+In this phase it's not the goal to boost performance for incremental
+fetches. When the client has already cloned the repository, bundles won't be
+used to optimize fetches new data.
+
+Currently bundle-URI is not fully compatible with shallow clones, therefore
+we'll leave that out of scope. More info about that in
+[Git issue #170](https://gitlab.com/gitlab-org/git/-/issues/170).
+
+## Proposal
+
+When a client clones a repository, Gitaly advertises a bundle URI. This URI
+points to a bundle that's refreshed on a regular interval, for example during
+housekeeping. For each repository only one bundle will exist, so when a new one
+is created, the old one is invalidated.
+
+The bundles will be stored on a cloud Object Storage. To use bundles, the
+administrator should configure this in Gitaly.
+
+## Design and implementation details
+
+When a client initiates a `git clone`, on the server-side Gitaly spawns a
+`git upload-pack` process. Gitaly can pass along additional Git
+configuration. To make `git upload-pack` advertise bundle URIs, it should pass
+the following configuration:
+
+- `uploadpack.advertiseBundleURIs` :: This should be set to `true` to enable to
+ use of advertised bundles.
+- `bundle.version` :: At the moment only `1` is accepted.
+- `bundle.mode` :: This can be either `any` or `all`. Since we only want to use
+ bundles for the initial clone, `any` is advised.
+- `bundle.<id>.uri` :: This is the actual URI of the bundle identified with
+ `<id>`. Initially we will only have one bundle per repository.
+
+### Enable the use of advertised bundles on the client-side
+
+The current version of Git does not use the advertised bundles by default when
+cloning or fetching from a remote.
+Luckily, we control most of the CI runners ourself. So to use bundle URI, we can
+modify the Git configuration used by the runners and set
+`transfer.bundleURI=true`.
+
+### Access control
+
+We don't want to leak data from private repositories through public HTTP(S)
+hosts. There are a few options for how we can overcome this:
+
+- Only activate the use of bundle-URI on public repositories.
+- Use a solution like [signed-URLs](https://cloud.google.com/cdn/docs/using-signed-urls).
+
+#### Public repositories only
+
+Gitaly itself does not know if a project, and its repository, is public, so to
+determine whether bundles can be used, GitLab Rails has to tell Gitaly. It's
+complex to pass this information to Gitaly, and using this approach will make
+the feature only available for public projects, so we will not proceed with this
+solution.
+
+#### Signed URLs
+
+The use of [signed-URLs](https://cloud.google.com/cdn/docs/using-signed-urls) is
+another option to control access to the bundles. This feature, provided by
+Google Cloud, allows Gitaly to create a URI that has a short lifetime.
+
+The downside to this approach is it depends on a feature that is
+cloud-specific, so each cloud provider might provide such feature slightly
+different, or not have it. But we want to roll this feature out on GitLab.com
+first, which is hosted on Google Cloud, so for a first iteration we will use
+this.
+
+### Bundle creation
+
+#### Use server-side backups
+
+At the moment Gitaly knows how to back up repositories into bundles onto cloud
+storage. The [documentation](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/gitaly-backup.md#user-content-server-side-backups)
+describes how to use it.
+
+For the initial implementation of bundle-URI we can piggy-back onto this
+feature. An admin should create backups for the repositories they want to use
+bundle-URI. With the existing configuration for backups, Gitaly can access cloud
+storage.
+
+#### As part of housekeeping
+
+Gitaly has a housekeeping worker that daily looks for repositories to optimize.
+Ideally we create a bundle right after the housekeeping (that is, garbage collection
+and repacking) is done. This ensures the most optimal bundle file.
+
+There are a few things to keep in mind when automatically creating bundles:
+
+- **Does the bundle need to be recreated?** When there wasn't much activity on
+ the repository it's probably not needed to create a new bundle file, as the
+ client can fetch missing object directly from Gitaly anyway. The housekeeping
+ tasks uses various heuristics to determine which strategy is taken for the
+ housekeeping job, we can reuse parts of this logic in the creation of bundles.
+- **Is it even needed to create a bundle?** Some repositories might be very
+ small, or see very little activity. Creating a bundle for these, and
+ duplicating it's data to object storage doesn't provide much value and only
+ generates cost and maintenance.
+
+#### Controlled by GitLab Rails
+
+Because bundles increase the cost on storage, we eventually want to give the
+GitLab administrator full control over the creation of bundles. To achieve this,
+bundle-URI settings will be available on the GitLab admin interface. Here the
+admin can configure per project which have bundle-URI enabled.
+
+### Configuration
+
+To use this feature, Gitaly needs to be configured. For this we'll add the
+following settings to Gitaly's configuration file:
+
+- `bundle_uri.strategy` :: This indicates which strategy should be used to
+ create and serve bundle-URIs. At the moment the only supported value is
+ "backups". When this setting to that value, Gitaly checks if a server-side
+ backup is available and use that.
+- `bundle_uri.sign_urls` :: When set to true, the cloud storage URLs are not
+ passed to the client as-is, but are transformed into a signed URL. This
+ setting is optional and only support Google Cloud Storage (for now).
+
+The credentials to access cloud storage are reused as described in the Gitaly
+Backups documentation.
+
+### Storing metadata
+
+For now all metadata needed to store bundles on the cloud is managed by Gitaly
+server-side backups.
+
+### Bundle cleanup
+
+At some point the admin might decide to cleanup bundles for one or more
+repositories, an admin command should be added for this. Because we're now only
+using bundles created by `gitaly-backup`, we leave this out of scope.
+
+### Gitaly Cluster compatibility
+
+Creating server-side backups doesn't happen through Praefect at the moment. It's
+up to the admin to address the nodes where they want to create backups from. If
+they make sure the node is up-to-date, all nodes will have access to up-to-date
+bundles and can pass proper bundle-URI parameters to the client. So no extra
+work is needed to reuse server-side backup bundles with bundle-URI.
+
+## Alternative Solutions
+
+No alternative solutions are suggested at the moment.
diff --git a/doc/architecture/blueprints/capacity_planning/images/image-20230911144743188.png b/doc/architecture/blueprints/capacity_planning/images/image-20230911144743188.png
new file mode 100644
index 00000000000..f56f5b391fc
--- /dev/null
+++ b/doc/architecture/blueprints/capacity_planning/images/image-20230911144743188.png
Binary files differ
diff --git a/doc/architecture/blueprints/capacity_planning/images/tamland-as-a-service.png b/doc/architecture/blueprints/capacity_planning/images/tamland-as-a-service.png
new file mode 100644
index 00000000000..fa8f1223917
--- /dev/null
+++ b/doc/architecture/blueprints/capacity_planning/images/tamland-as-a-service.png
Binary files differ
diff --git a/doc/architecture/blueprints/capacity_planning/images/tamland-as-part-of-stack.png b/doc/architecture/blueprints/capacity_planning/images/tamland-as-part-of-stack.png
new file mode 100644
index 00000000000..0b47d91e133
--- /dev/null
+++ b/doc/architecture/blueprints/capacity_planning/images/tamland-as-part-of-stack.png
Binary files differ
diff --git a/doc/architecture/blueprints/capacity_planning/index.md b/doc/architecture/blueprints/capacity_planning/index.md
new file mode 100644
index 00000000000..ed014f545f9
--- /dev/null
+++ b/doc/architecture/blueprints/capacity_planning/index.md
@@ -0,0 +1,139 @@
+---
+status: proposed
+creation-date: "2023-09-11"
+authors: [ "@abrandl" ]
+coach: "@andrewn"
+approvers: [ "@swiskow", "@rnienaber", "@o-lluch" ]
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# Capacity planning for GitLab Dedicated
+
+## Summary
+
+This document outlines how we plan to set up infrastructure capacity planning for GitLab Dedicated tenant environments, which is a [FY24-Q3 OKR](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3507).
+
+We make use of Tamland, a tool we built to provide saturation forecasting insights for GitLab.com infrastructure resources. We propose to include Tamland as a part of the GitLab Dedicated stack and execute forecasting from within the tenant environments.
+
+Tamland predicts SLO violations and their respective dates, which need to be reviewed and acted upon. In terms of team organisation, the Dedicated team is proposed to own the tenant-side setup for Tamland and to own the predicted SLO violations, with the help and guidance of the Scalability::Projections team, which drives further development, documentation and overall guidance for capacity planning, including for Dedicated.
+
+With this setup, we aim to turn Tamland into a more generic tool, which can be used in various environments including but not limited to Dedicated tenants. Long-term, we think of including Tamland in self-managed installations and think of Tamland as a candidate for open source release.
+
+## Motivation
+
+### Background: Capacity planning for GitLab.com
+
+[Tamland](https://gitlab.com/gitlab-com/gl-infra/tamland) is an infrastructure resource forecasting project owned by the [Scalability::Projections](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/projections.html) group. It implements [capacity planning](https://about.gitlab.com/handbook/engineering/infrastructure/capacity-planning/) for GitLab.com, which is a [controlled activity covered by SOC 2](https://gitlab.com/gitlab-com/gl-security/security-assurance/security-compliance-commercial-and-dedicated/observation-management/-/issues/604). As of today, it is used exclusively for GitLab.com to predict upcoming SLO violations across hundreds of monitored infrastructure components.
+
+Tamland produces a [report](https://gitlab-com.gitlab.io/gl-infra/tamland/intro.html) (hosted on GitLab Pages) containing forecast plots, information around predicted violations and other information around the components monitored. Any predicted SLO violation result in a capacity warning issue being created in the [issue tracker for capacity planning](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/boards/2816983) on GitLab.com.
+
+At present, Tamland is quite tailor made and specific for GitLab.com:
+
+1. GitLab.com specific parameters and assumptions are built into Tamland
+1. We execute Tamland from a single CI project, exclusively for GitLab.com
+
+[Turning Tamland into a tool](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1106) we can use more generically and making it independent of GitLab.com specifics is subject of ongoing work.
+
+For illustration, we can see a saturation forecast plot below for the `disk_space` resource for a PostgreSQL service called `patroni-ci`. Within the 90 days forecast horizon, we predict a violation of the `soft` SLO (set at 85% saturation) and this resulted in the creation of a [capacity planning issue](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/1219) for further review and potential actions. At present, the Scalability::Projections group reviews those issues and engages with the respective DRI for the service in question to remedy a saturation concern.
+
+<img src="images/image-20230911144743188.png" alt="image-20230911144743188" style="zoom:67%;" />
+
+For GitLab.com capacity planning, we operate Tamland from a scheduled CI pipeline with access to the central Thanos, which provides saturation and utilization metrics for GitLab.com. The CI pipeline produces the desired report, exposes it on GitLab Pages and also creates capacity planning issues. Scalability::Projections runs a capacity planning triage rotation which entails reviewing and prioritizing any open issues and their respective saturation concerns.
+
+### Problem Statement
+
+With the number of [GitLab Dedicated](https://about.gitlab.com/dedicated/) deployments increasing, we need to establish capacity planning processes for Dedicated tenants. This is going to help us notice any pending resource constraints soon enough to be able to upgrade the infrastructure for a given tenant before the resource saturates and causes an incident.
+
+Each Dedicated tenant is an isolated GitLab environment, with a full set of metrics monitored. These metrics are standardized in the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/blob/master/reference-architectures/get-hybrid/src/gitlab-metrics-config.libsonnet?ref_type=heads) and on top of these, we have defined saturation metrics along with respective SLOs.
+
+In order to provide capacity planning and forecasts for saturation metrics for each tenant, we'd like to get Tamland set up for GitLab Dedicated.
+
+While Tamland is developed by the Scalability::Projections and this team also owns the capacity planning process for GitLab.com, they don't have access to any of the Dedicated infrastructure as we have strong isolation implemented for Dedicated environments. As such, the technical design choices are going to affect how those teams interact and vice versa. We include this consideration into this documentation as we think the organisational aspect is a crucial part of it.
+
+### Key questions
+
+1. How does Tamland access Prometheus data for each tenant?
+1. Where does Tamland execute and how do we scale that?
+1. Where do we store resulting forecasting data?
+1. How do we consume the forecasts?
+
+### Goals: Iteration 0
+
+1. Tamland is flexible enough to forecast saturation events for a Dedicated tenant and for GitLab.com separately
+1. Forecasting is executed at least weekly, for each Dedicated tenant
+1. Tamland's output is forecasting data only (plots, SLO violation dates, etc. - no report, no issue management - see below)
+1. Tamland stores the output data in a S3 bucket for further inspection
+
+#### Non-goals
+
+##### Reporting
+
+As of today, it's not quite clear yet how we'd like to consume forecasting data across tenants. In contrast to GitLab.com, we generate forecasts across a potentially large number of tenants. At this point, we suspect that we're more interested in an aggregate report across tenants rather than individual, very detailed saturation forecasts. As such, this is subject to refinement in a further iteration once we have the underlying data available and gathered practical insight in how we consume this information.
+
+##### Issue management
+
+While each predicted SLO violation results in the creation of a GitLab issue, this may not be the right mode of raising awareness for Dedicated. Similar to the reporting side, this is subject to further discussion once we have data to look at.
+
+##### Customizing forecasting models
+
+Forecasting models can and should be tuned and informed with domain knowledge to produce accurate forecasts. This information is a part of the Tamland manifest. In the first iteration, we don't support per-tenant customization, but this can be added later.
+
+## Proposed Design for Dedicated: A part of the Dedicated stack
+
+Dedicated environments are fully isolated and run their own Prometheus instance to capture metrics, including saturation metrics. Tamland will run from each individual Dedicated tenant environment, consume metrics from Prometheus and store the resulting data in S3. From there, we consume forecast data and act on it.
+
+![tamland-as-part-of-stack](images/tamland-as-part-of-stack.png)
+
+### Storage for output and cache
+
+Any data Tamland relies on is stored in a S3 bucket. We use one bucket per tenant to clearly separate data between tenants.
+
+1. Resulting forecast data and other outputs
+1. Tamland's internal cache for Prometheus metrics data
+
+There is no need for a persistent state across Tamland runs aside from the S3 bucket.
+
+### Benefits of executing inside tenant environments
+
+Each Tamland run for a single environment (tenant) can take a few hours to execute. With the number of tenants expected to increase significantly, we need to consider scaling the execution environment for Tamland.
+
+In this design, Tamland becomes a part of the Dedicated stack and a component of the individual tenant environment. As such, scaling the execution environment for Tamland is solved by design, because tenant forecasts execute inherently parallel in their respective environments.
+
+### Distribution model: Docker
+
+Tamland is released as a Docker image, see [Tamland's README](https://gitlab.com/gitlab-com/gl-infra/tamland/-/blob/main/README.md) for further details.
+
+### Tamland manifest
+
+The manifest contains information about which saturation metrics to forecast on (see this [manifest example](https://gitlab.com/gitlab-com/gl-infra/tamland/-/blob/62854e1afbc2ed3160a55a738ea587e0cf7f994f/saturation.json) for GitLab.com). This will be generated from the metrics catalog and will be the same for all tenants for starters.
+
+In order to generate the manifest from the metrics catalog, we setup dedicated GitLab project `tamland-dedicated` . On a regular basis, a scheduled pipeline grabs the metrics catalog, generates the JSON manifest from it and commits this to the project.
+
+On the Dedicated tenants, we download the latest version of the committed JSON manifest from `tamland-dedicated` and use this as input to execute Tamland.
+
+### Acting on forecast insights
+
+When Tamland forecast data is available for a tenant, the Dedicated teams consume this data and act on it accordingly. The Scalability::Projections group is going to support and guide this process to get started and help interpret data, along with implementing Tamland features required to streamline this process for Dedicated in further iterations.
+
+## Alternative Solution
+
+### Tamland as a Service (not chosen)
+
+An alternative design, we don't consider an option at this point, is to setup Tamland as a Service and run it fully **outside** of tenant environments.
+
+![tamland-as-a-service](images/tamland-as-a-service.png)
+
+In this design, a central Prometheus/Thanos instance is needed to provide the metrics data for Tamland. Dedicated tenants use remote-write to push their Prometheus data to the central Thanos instance.
+
+Tamland is set up to run on a regular basis and consume metrics data from the single Thanos instance. It stores its results and cache in S3, similar to the other design.
+
+In order to execute forecasts regularly, we need to provide an execution environment to run Tamland in. With an increasing number of tenants, we'd need to scale up resources for this cluster.
+
+This design **has not been chosen** because of both technical and organisational concerns:
+
+1. Our central Thanos instance currently doesn't have metrics data for Dedicated tenants as of the start of FY24Q3.
+1. Extra work required to set up scalable execution environment.
+1. Thanos is considered a bottleneck as it provides data for all tenants and this poses a risk of overloading it when we execute the forecasting for a high number of tenants.
+1. We strive to build out Tamland into a tool of more general use. We expect a better outcome in terms of design, documentation and process efficiency by building it as a tool for other teams to use and not offering it as a service. In the long run, we might be able to integrate Tamland (as a tool) inside self-managed environments or publish Tamland as an open source forecasting tool. This would not be feasible if we were hosting it as a service.
diff --git a/doc/architecture/blueprints/cells/deployment-architecture.md b/doc/architecture/blueprints/cells/deployment-architecture.md
new file mode 100644
index 00000000000..57dabd447b4
--- /dev/null
+++ b/doc/architecture/blueprints/cells/deployment-architecture.md
@@ -0,0 +1,155 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Deployment Architecture'
+---
+
+# Cells: Deployment Architecture
+
+This section describes the existing deployment architecture
+of GitLab.com and contrasts it with the expected Cells architecture.
+
+## 1. Before Cells - Monolithic architecture
+
+<img src="diagrams/deployment-before-cells.drawio.png" width="800">
+
+The diagram represents simplified GitLab.com deployment components before the introduction of a Cells architecture.
+This diagram intentionally misses some services that are not relevant for the architecture overview (Cloudflare, Consul, PgBouncers, ...).
+Those services are considered to be Cell-local, with the exception of Cloudflare.
+
+The component blocks are:
+
+- Separate components that can be deployed independently.
+- Components that are independent from other components and offer a wide range of version compatibility.
+
+The application layer services are:
+
+- Strongly interconnected and require to run the same version of the application.
+ Read more in [!131657](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/131657#note_1563513431).
+- Each service is run across many nodes and scaled horizontally to provide sufficient throughput.
+- Services that interact with other services using an API (REST, gRPC), Redis or DB.
+
+The dependent services are:
+
+- Updated infrequently and selectively.
+- Might use cloud managed services.
+- Each service is clustered and might be run across different availability zones to provide high availability.
+- Object storage is also accessible directly to users if a pre-signed URL is provided.
+
+## 2. Development Cells - Adapting application to Cellular architecture
+
+<img src="diagrams/deployment-development-cells.drawio.png" width="800">
+
+The purpose of **Development Cells** is to model a production-like architecture for the purpose of testing and validating the changes introduced.
+This could be achieved with testing Cells on top of the [Reference Architectures](../../../administration/reference_architectures/index.md).
+Read more in [#425197](https://gitlab.com/gitlab-org/gitlab/-/issues/425197).
+
+The differences compared to [Before Cells](#1-before-cells---monolithic-architecture) are:
+
+- A Routing Service is developed by Cells.
+- Development Cells are meant to be run using a development environment only to allow prototyping of Cells without the overhead of managing all auxiliary services.
+- Development Cells represent a simplified GitLab.com architecture by focusing only on essential services required to be split.
+- Development Cells are not meant to be used in production.
+- Cluster-wide data sharing is done with a read-write connection to the main database of Cell 1: PostgreSQL main database, and Redis user-sessions database.
+
+## 3. Initial Cells deployment - Transforming monolithic architecture to Cells architecture
+
+<img src="diagrams/deployment-initial-cells.drawio.png" width="800">
+
+The differences compared to [Development Cells](#2-development-cells---adapting-application-to-cellular-architecture) are:
+
+- A Cluster-wide Data Provider is introduced by Cells.
+- The Cluster-wide Data Provider is deployed with Cell 1 to be able to access cluster-wide data directly.
+- The cluster-wide database is isolated from the main PostgreSQL database.
+- A Cluster-wide Data Provider is responsible for storing and sharing user data,
+ user sessions (currently stored in Redis sessions cluster), routing information
+ and cluster-wide settings across all Cells.
+- Access to the cluster-wide database is done asynchronously:
+ - Read access always uses a database replica.
+ - A database replica might be deployed with the Cell.
+ - Write access uses the dedicated Cluster-wide Data Provider service.
+- Additional Cells are deployed, upgraded and maintained via a [GitLab Dedicated-like](../../../subscriptions/gitlab_dedicated/index.md) control plane.
+- Each Cell aims to run as many services as possible in isolation.
+- A Cell can run its own Gitaly cluster, or can use a shared Gitaly cluster, or both.
+ Read more in [!131657](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/131657#note_1569151454).
+- Shared Runners provided by GitLab are expected to be run locally on the Cell.
+- Infrastructure components might be shared across the cluster and be used by different Cells.
+- It is undefined whether Elasticsearch service would be better run cluster-wide or Cell-local.
+- Delay the decision how to scale the **GitLab Pages - `gitlab.io`** component.
+- Delay the decision how to scale the **Registry - `registry.gitlab.com`** component.
+
+## 4. Hybrid Cells deployment - Initial complete Cells architecture
+
+<img src="diagrams/deployment-hybrid-cells.drawio.png" width="800">
+
+The differences compared to [Initial Cells deployment](#3-initial-cells-deployment---transforming-monolithic-architecture-to-cells-architecture) are:
+
+- Removes coupling of Cell N to Cell 1.
+- The Cluster-wide Data Provider is isolated from Cell 1.
+- The cluster-wide databases (PostgreSQL, Redis) are moved to be run with the Cluster-wide Data Provider.
+- All application data access paths to cluster-wide data use the Cluster-wide Data Provider.
+- Some services are shared across Cells.
+
+## 5. Target Cells - Fully isolated Cells architecture
+
+<img src="diagrams/deployment-target-cells.drawio.png" width="800">
+
+The differences compared to [Hybrid Cells deployment](#4-hybrid-cells-deployment---initial-complete-cells-architecture) are:
+
+- The Routing Service is expanded to support [GitLab Pages](../../../user/project/pages/index.md) and [GitLab Container Registry](../../../user/packages/container_registry/index.md).
+- Each Cell has all services isolated.
+- It is allowed that some Cells will follow a [hybrid architecture](#4-hybrid-cells-deployment---initial-complete-cells-architecture).
+
+## Isolation of Services
+
+Each service can be considered individually regarding its requirements, the risks associated
+with scaling the service, its location (cluster-wide or Cell-local), and impact on our ability to migrate data between Cells.
+
+### Cluster-wide services
+
+| Service | Type | Uses | Description |
+| ------------------------------ | ------------ | ------------------------------- | --------------------------------------------------------------------------------------------------- |
+| **Routing Service** | GitLab-built | Cluster-wide Data Provider | A general purpose routing service that can redirect requests from all GitLab SaaS domains to the Cell |
+| **Cluster-wide Data Provider** | GitLab-built | PostgreSQL, Redis, Event Queue? | Provide user profile and routing information to all clustered services |
+
+As per the architecture, the above services are required to be run cluster-wide:
+
+- Those are additional services that are introduced by the Cells architecture.
+
+### Cell-local services
+
+| Service | Type | Uses | Description |
+| ------------------------------ | ------------ | ------------------------------- | --------------------------------------------------------------------------------------------------- |
+| **Redis Cluster** | Managed service | Disk storage | No problem | Redis is used to hold user sessions, application caches, or Sidekiq queues. Most of that data is only applicable to Cells. |
+| **GitLab Runners Manager** | Managed service | API, uses Google Cloud VM Instances | No problem | Significant changes required to API and execution of CI jobs |
+
+As per the architecture, the above services are required to be run Cell-local:
+
+- The consumer data held by the Cell-local services needs to be migratable to another Cell.
+- The compute generated by the service is substational and is strongly desired to reduce impact of [single Cell failure](goals.md#high-resilience-to-a-single-cell-failure).
+- It is complex to run the service cluster-wide from the Cells architecture perspective.
+
+### Hybrid Services
+
+| Service | | Uses | Migrate from cluster-wide to Cell | Description |
+| ------------------- | --------------- | ------------------------------- | ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
+| **GitLab Pages** | GitLab-built | Routing Service, Rails API | No problem | Serving CI generated pages under `.gitlab.io` or custom domains |
+| **GitLab Registry** | GitLab-built | Object Storage, PostgreSQL | Non-trivial data migration in case of split | Service to provide GitLab Container Registry |
+| **Gitaly Cluster** | GitLab-built | Disk storage, PostgreSQL | No problem: Built-in migration routines to balance Gitaly nodes | Gitaly holds Git repository data. Many Gitaly clusters can be configured in application. |
+| **Elasticsearch** | Managed service | Many nodes required by sharding | Time consuming: Rebuild cluster from scratch | Search across all projects |
+| **Object Storage** | Managed service | | Not straightforward: Rather hard to selectively migrate between buckets | Holds all user and CI uploaded files that is served by GitLab |
+
+As per the architecture, the above services are allowed to be run either cluster-wide or Cell-local:
+
+- The ability to run hybrid services cluster-wide might reduce the amount of work to migrate data between Cells due to some services being shared.
+- The hybrid services that are run cluster-wide might negatively impact Cell availability and resiliency due to increased impact caused by [single Cell failure](goals.md#high-resilience-to-a-single-cell-failure).
+
+| Service | Type | Uses | Description |
+| ------------------------------ | ------------ | ------------------------------- | --------------------------------------------------------------------------------------------------- |
+| **Elasticsearch** | Managed service | Many nodes requires by sharding | Time consuming: Rebuild cluster from scratch | Search across all projects |
+| **Object Storage** | Managed service | | Not straightforward: Rather hard to selectively migrate between buckets | Holds all user and CI uploaded files that is served by GitLab |
+
+As per the architecture, the above services are allowed to be run either cluster-wide or Cell-local:
+
+- The ability to run above services cluster-wide might reduce the amount of work to migrate data between Cells due to some services being shared.
+- The hybrid services that are run cluster-wide might negatively impact Cell availability and resiliency due to increased impact caused by [single Cell failure](goals.md#high-resilience-to-a-single-cell-failure).
diff --git a/doc/architecture/blueprints/cells/diagrams/deployment-before-cells.drawio.png b/doc/architecture/blueprints/cells/diagrams/deployment-before-cells.drawio.png
new file mode 100644
index 00000000000..5e9a2227781
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/deployment-before-cells.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/deployment-development-cells.drawio.png b/doc/architecture/blueprints/cells/diagrams/deployment-development-cells.drawio.png
new file mode 100644
index 00000000000..07e2d91adad
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/deployment-development-cells.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/deployment-hybrid-cells.drawio.png b/doc/architecture/blueprints/cells/diagrams/deployment-hybrid-cells.drawio.png
new file mode 100644
index 00000000000..248842c4e8f
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/deployment-hybrid-cells.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/deployment-initial-cells.drawio.png b/doc/architecture/blueprints/cells/diagrams/deployment-initial-cells.drawio.png
new file mode 100644
index 00000000000..0650f948ce4
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/deployment-initial-cells.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/deployment-target-cells.drawio.png b/doc/architecture/blueprints/cells/diagrams/deployment-target-cells.drawio.png
new file mode 100644
index 00000000000..86360e5eecb
--- /dev/null
+++ b/doc/architecture/blueprints/cells/diagrams/deployment-target-cells.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/cells/diagrams/index.md b/doc/architecture/blueprints/cells/diagrams/index.md
index 14db888382e..6d5d54acdb9 100644
--- a/doc/architecture/blueprints/cells/diagrams/index.md
+++ b/doc/architecture/blueprints/cells/diagrams/index.md
@@ -24,12 +24,14 @@ To create a diagram from a file:
1. Copy existing file and rename it. Ensure that the extension is `.drawio.png` or `.drawio.svg`.
1. Edit the diagram.
1. Save the file.
+1. Optimize images with `pngquant -f --ext .png *.drawio.png` to reduce their size by 2-3x.
To create a diagram from scratch using [draw.io desktop](https://github.com/jgraph/drawio-desktop/releases):
1. In **File > New > Create new diagram**, select **Blank diagram**.
1. In **File > Save As**, select **Editable Bitmap .png**, and save with `.drawio.png` extension.
-1. To improve image quality, in **File > Properties**, set **Zoom** to **400%**.
+1. To improve image quality, in **File > Properties**, set **Zoom** to **200%**.
1. To save the file with the new zoom setting, select **File > Save**.
+1. Optimize images with `pngquant -f --ext .png *.drawio.png` to reduce their size by 2-3x.
DO NOT use the **File > Export** function. The diagram should be embedded into `.png` for easy editing.
diff --git a/doc/architecture/blueprints/cells/impacted_features/contributions-forks.md b/doc/architecture/blueprints/cells/impacted_features/contributions-forks.md
index 2053b87b125..ccac5a24718 100644
--- a/doc/architecture/blueprints/cells/impacted_features/contributions-forks.md
+++ b/doc/architecture/blueprints/cells/impacted_features/contributions-forks.md
@@ -53,7 +53,9 @@ From a [data exploration](https://gitlab.com/gitlab-data/product-analytics/-/iss
- The remaining 14% are forked from a source Project within a different company.
- 9% of top-level Groups (95k) with activity in the last 12 months have a project with a fork relationship, compared to 5% of top-level Groups (91k) with no activity in the last 12 months. We expect these top-level Groups to be impacted by Cells.
-## 3. Proposal - Forks are created in a dedicated contribution space of the current Organization
+## 3. Proposals
+
+### 3.1. Forks are created in a dedicated contribution space of the current Organization
Instead of creating Projects across Organizations, forks are created in a contribution space tied to the Organization.
A contribution space is similar to a personal namespace but rather than existing in the default Organization, it exists within the Organization someone is trying to contribute to.
@@ -74,11 +76,9 @@ Example:
- Data in contribution spaces do not contribute to customer usage from a billing perspective.
- Today we do not have organization-scoped runners but if we do implement that they will likely need special settings for how or if they can be used by contribution space projects.
-## 4. Alternative proposals considered
-
-### 4.1. Intra-cluster forks
+### 3.2. Intra-cluster forks
-This proposal implements forks as intra-Cluster forks where communication is done via API between all trusted Cells of a cluster:
+This proposal implements forks as intra-cluster forks where communication is done via API between all trusted Cells of a cluster:
- Forks are created always in the context of a user's choice of Group.
- Forks are isolated from the Organization.
@@ -98,7 +98,7 @@ Cons:
- However, this is no different to the ability of users today to clone a repository to a local computer and push it to any repository of choice.
- Access control of the source Project can be lower than that of the target Project. Today, the system requires that in order to contribute back, the access level needs to be the same for fork and upstream Project.
-### 4.2. Forks are created as internal Projects under current Projects
+### 3.3. Forks are created as internal Projects under current Projects
Instead of creating Projects across Organizations, forks are attachments to existing Projects.
Each user forking a Project receives their unique Project.
@@ -114,13 +114,37 @@ Cons:
- Does not answer how to handle and migrate all existing forks.
- Might share current Group/Project settings, which could be breaking some security boundaries.
-## 5. Evaluation
+### 3.4. Forks are created in personal namespaces of the current Organization
+
+Every User can potentially have a personal namespace in each public Organization.
+On the first visit to an Organization the User will receive a personal namespace scoped to that Organization.
+A User can fork into a personal namespace provided the upstream repository is in the same Organization as the personal namespace.
+Removal of an Organization will remove any personal namespaces in the Organization.
+
+Pros:
+
+- We re-use most existing code paths.
+- We re-use most existing product design rules.
+- Organization boundaries are naturally isolated.
+- Multiple personal namespaces will mean Users can divide personal Projects across Organizations instead of having them mixed together.
+- We expect most Users to work in one Organization, which means that the majority of them would not need to remember in which Organization they stored each of their personal Projects.
+
+Cons:
+
+- Redundant personal namespaces will be created. We expect to improve this in future iterations.
+- Multiple personal namespaces could be difficult to navigate, especially when working across a large number of Organizations. We expect this to be an edge case.
+- The life cycle of personal namespaces will be dependent on the Organization as is already the case for user accounts privately owned (such as Enterprise Users), and self-managed installations that are not public.
+- Organization personal namespaces will need new URL paths.
+- The legacy personal namespace path will need to be adapted.
+
+URL path changes are under [discussion](https://gitlab.com/gitlab-org/gitlab/-/issues/427367).
-### 5.1. Pros
+## 4. Evaluation
-### 5.2. Cons
+We will follow [3.4. Forks are created in personal namespaces of the current Organization](#34-forks-are-created-in-personal-namespaces-of-the-current-organization) because it has already solved a lot of the hard problems.
+The short falls of this solution like reworking URL paths or handling multiple personal namespaces are manageable and less critical than problems created through other alternative proposals.
-## 6. Example
+## 5. Example
As an example, we will demonstrate the impact of this proposal for the case that we move `gitlab-org/gitlab` to a different Organization.
`gitlab-org/gitlab` has [over 8K forks](https://gitlab.com/gitlab-org/gitlab/-/forks).
@@ -128,9 +152,9 @@ As an example, we will demonstrate the impact of this proposal for the case that
### Does this direction impact the canonical URLs of those forks?
Yes canonical URLs will change for forks.
-Existing users that have forks in personal namespaces and want to continue contributing merge requests, will be required to migrate their fork to a new fork in a contribution space.
-For example, a personal namespace fork at `https://gitlab.com/DylanGriffith/gitlab` will
-need to be migrated to `https://gitlab.com/-/contributions/gitlab-inc/@DylanGriffith/gitlab`.
+Specific path changes are under [discussion](https://gitlab.com/gitlab-org/gitlab/-/issues/427367).
+Existing Users that have forks in legacy personal namespaces and want to continue contributing merge requests, will be required to migrate their fork to their personal namespace in the source project Organization.
+For example, a personal namespace fork at `https://gitlab.com/DylanGriffith/gitlab` will need to be migrated to `https://gitlab.com/-/organizations/gitlab-inc/@DylanGriffith/gitlab`.
We may offer automated ways to move this, but manually the process would involve:
1. Create the contribution space fork
@@ -140,12 +164,11 @@ We may offer automated ways to move this, but manually the process would involve
### Does it impact the Git URL of the repositories themselves?
Yes.
-In the above the example the Git URL would change from
-`gitlab.com:DylanGriffith/gitlab.git` to `gitlab.com:/-/contributions/gitlab-inc/@DylanGriffith/gitlab.git`.
+Specific path changes are under [discussion](https://gitlab.com/gitlab-org/gitlab/-/issues/427367).
### Would there be any user action required to accept their fork being moved within an Organization or towards a contribution space?
-If we offer an automated process we'd present this as an option for the user as they will become the new owner of the contribution space.
+No. If the Organization is public, then a user will have a personal namespace.
### Can we make promises that we will not break the existing forks of public Projects hosted on GitLab.com?
diff --git a/doc/architecture/blueprints/cells/impacted_features/group-transfer.md b/doc/architecture/blueprints/cells/impacted_features/group-transfer.md
new file mode 100644
index 00000000000..3b361a1459f
--- /dev/null
+++ b/doc/architecture/blueprints/cells/impacted_features/group-transfer.md
@@ -0,0 +1,28 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Group Transfer'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the Cells design.
+Significant aspects are not documented, though we expect to add them in the future.
+This is one possible architecture for Cells, and we intend to contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that we can document the reasons for not choosing this approach.
+
+# Cells: Group Transfer
+
+> TL;DR
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/cells/impacted_features/issues.md b/doc/architecture/blueprints/cells/impacted_features/issues.md
new file mode 100644
index 00000000000..3ae5056240f
--- /dev/null
+++ b/doc/architecture/blueprints/cells/impacted_features/issues.md
@@ -0,0 +1,28 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Issues'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the Cells design.
+Significant aspects are not documented, though we expect to add them in the future.
+This is one possible architecture for Cells, and we intend to contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that we can document the reasons for not choosing this approach.
+
+# Cells: Issues
+
+> TL;DR
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/cells/impacted_features/merge-requests.md b/doc/architecture/blueprints/cells/impacted_features/merge-requests.md
new file mode 100644
index 00000000000..4cbc1134feb
--- /dev/null
+++ b/doc/architecture/blueprints/cells/impacted_features/merge-requests.md
@@ -0,0 +1,28 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Merge Requests'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the Cells design.
+Significant aspects are not documented, though we expect to add them in the future.
+This is one possible architecture for Cells, and we intend to contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that we can document the reasons for not choosing this approach.
+
+# Cells: Merge Requests
+
+> TL;DR
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md b/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
index 55d974bb351..757f83c32d3 100644
--- a/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
+++ b/doc/architecture/blueprints/cells/impacted_features/personal-namespaces.md
@@ -13,14 +13,14 @@ This documentation will be kept even if we decide not to implement this so that
# Cells: Personal Namespaces
-Personal Namespaces do not easily fit with our overall architecture in Cells because the Cells architecture depends on all data belonging to a single Organization.
+Personal Namespaces do not easily fit with our overall architecture in Cells, because the Cells architecture depends on all data belonging to a single Organization.
When Users are allowed to work across multiple Organizations there is no natural fit for picking a single Organization to store personal Namespaces and their Projects.
-One important engineering constraint in Cells will be that data belonging to some Organization should not be linked to data belonging to another Organization.
-And specifically that functionality in GitLab can be scoped to a single Organization at a time.
+One important engineering constraint in Cells will be that data belonging to one Organization should not be linked to data belonging to another Organization.
+Specifically, functionality in GitLab should be scoped to a single Organization at a time.
This presents a challenge for personal Namespaces as forking is one of the important workloads for personal Namespaces.
Functionality related to forking and the UI that presents forked MRs to users will often require data from both the downstream and upstream Projects at the same time.
-Implementing such functionality would be very difficult if that data belonged in different Organizations stored on different
+Implementing such functionality would be very difficult if that data belonged to different Organizations stored on different
Cells.
This is especially the case with the merge request, as it is one of the most complicated and performance critical features in GitLab.
@@ -46,26 +46,98 @@ As described above, personal Namespaces serve two purposes today:
1. A place for users to store forks when they want to contribute to a Project where they don't have permission to push a branch.
In this proposal we will only focus on (1) and assume that (2) will be replaced by suitable workflows described in [Cells: Contributions: Forks](../impacted_features/contributions-forks.md).
-
Since we plan to move away from using personal Namespaces as a home for storing forks, we can assume that the main remaining use case does not need to support cross-Organization linking.
-In this case the easiest thing to do is to keep all personal Namespaces in the default Organization.
-Depending on the amount of workloads happening in personal Namespaces we may be required in the future to migrate them to different Cells.
-This may necessitate that they all get moved to some Organization created just for the user.
+
+### 3.1. One personal Namespace that can move between Organizations
+
+For existing Users personal Namespaces will exist within the default Organization in the short term.
+This implies that all Users will, at first, have an association to the default Organization via their personal Namespace.
+When a new Organization is created, new Users can be created in that Organization as well.
+A new User's personal Namespace will be associated with that new Organization, rather than the default.
+Also, Users can become members of Organizations other than the default Organization.
+In this case, they will have to switch to the default Organization to access their personal Namespace until we have defined a way for them to move their personal Namespace into a different Home Organization.
+Doing so may necessitate that personal Namespaces are converted to Groups before being moved.
+When an Organization is deleted, we will need to decide what should happen with the personal Namespaces associated with it.
If we go this route, there may be breakage similar to what will happen to when we move Groups or Projects into their own Organization, though the full impact may need further investigation.
This decision, however, means that existing personal Namespaces that were used as forks to contribute to some upstream Project will become disconnected from the upstream as soon as the upstream moves into an Organization.
On GitLab.com 10% of all projects in personal Namespaces are forks.
This may be a slightly disruptive workflow but as long as the forks are mainly just storing branches used in merge requests then it may be reasonable to ask the affected users to recreate the fork in the context of the Organization.
-For existing Users, we suggest to keep their existing personal Namespaces in the default Organization.
-New Users joining an Organization other than the default Organization will also have their personal Namespace hosted on the default Organization. Having all personal Namespaces in the default Organization means we don't need to worry about deletion of the parent organization and the impact of that on personal Namespaces, which would be the case if they existed in other organizations.
-This implies that all Users will have an association to the default Organization via their personal Namespace, requiring them to switch to the default Organization to access their personal Namespace.
-
We will further explore the idea of a `contribution space` to give Users a place to store forks when they want to contribute to a Project where they don't have permission to push a branch.
That discussion will be handled as part of the larger discussion of the [Cells impact on forks](../impacted_features/contributions-forks.md).
-## 4. Evaluation
+Pros:
+
+- Easy access to personal Namespace via a User's Home Organization. We expect most Users to work in only a single Organization.
+- Contribution graph would remain intact for Users that only work in one Organization, because their personal and organizational activity would be aggregated as part of the same Organization.
+
+Cons:
+
+- A transfer mechanism to move personal Namespaces between Organizations would need to be built, which is extremely complex. This would be in violation of the current Cells architecture, because Organizations can be located on different Cells. To make this possible, we would need to break Organization isolation.
+- High risk that transfer between Organizations would lead to breaking connections and data loss.
+- [Converting personal Namespaces to Groups](../../../../tutorials/convert_personal_namespace_to_group/index.md) before transfer is not a straightforward process.
+
+### 3.2. One personal Namespace that remains in the default Organization
+
+For existing Users personal Namespaces will exist within the default Organization in the short term.
+This implies that all Users will, at first, have an association to the default Organization via their personal Namespace.
+New Users joining GitLab as part of an Organization other than the default Organization would also receive a personal Namespace in the default Organization.
+Organization other than the default Organization would not contain personal Namespaces.
+
+Pros:
+
+- No transfer mechanism necessary.
+
+Cons:
+
+- Users that are part of multiple Organizations need to remember that their personal content is stored in the default Organization. To access it, they would have to switch back to the default Organization.
+- New Users might not understand why they are part of the default Organization.
+- Some impact on the User Profile page. No personal Projects would be shown in Organizations other than the default Organization. This would result in a lot of whitespace on the page. The `Personal projects` list would need to be reworked as well.
-## 4.1. Pros
+### 3.3. One personal Namespace in each Organization
+
+For existing Users personal Namespaces will exist within the default Organization in the short term.
+As new Organizations are created, Users receive additional personal Namespaces for each Organization they interact with.
+For instance, when a User views a Group or Project in an Organization, a personal Namespace is created.
+This is necessary to ensure that community contributors will be able to continue contributing to Organizations without becoming a member.
+
+Pros:
+
+- Content of personal Projects is owned by the Organization. Low risk for enterprises to leak content outside of their organizational boundaries.
+- No transfer mechanism necessary.
+- No changes to the User Profile page are necessary.
+- Users can keep personal Projects in each Organization they work in.
+- No contribution space for [forking](../impacted_features/contributions-forks.md) necessary.
+- No need to make the default Organization function differently than other Organizations.
+
+Cons:
+
+- Users have to remember which personal content they store in each Organization.
+- Personal content would be owned by the Organization. However, this would be similar to how self-managed operates today and might be desired by enterprises.
+
+### 3.4. Discontinue personal Namespaces
+
+All existing personal Namespaces are converted into Groups.
+The Group path is identical to the current username.
+Upon Organization release, these Groups would be part of the default Organization.
+We disconnect Users from the requirement of having personal Namespaces, making the User a truly global entity.
+
+Pros:
+
+- Users would receive the ability to organize personal Projects into Groups, which is a highly requested feature.
+- No need to create personal Namespaces upon User creation.
+- No path changes necessary for existing personal Projects.
+
+Cons:
+
+- A concept of personal Groups would need to be established.
+- It is unclear how @-mentions would work. Currently it is possible to tag individual Users and Groups. Following the existing logic all group members belonging to a personal Group would be tagged.
+- Significant impact on the User Profile page. Personal Projects would be disconnected from the User Profile page and possibly replaced by new functionality to highlight specific Projects selected by the User (via starring or pinning).
+- It is unclear whether Groups could be migrated between Organizations using the same mechanism as needed to migrate top-level Groups. We expect this functionality to be highly limited at least in the mid-term. Similar transfer limitations as described in [section 3.1.](#31-one-personal-namespace-that-can-move-between-organizations) are expected.
+
+## 4. Evaluation
-## 4.2. Cons
+The most straightforward solution requiring the least engineering effort is to create [one personal Namespace in each Organization](#33-one-personal-namespace-in-each-organization).
+We recognize that this solution is not ideal for users working across multiple Organizations, but find this acceptable due to our expectation that most users will mainly work in one Organization.
+At a later point, this concept will be reviewed and possibly replaced with a better solution.
diff --git a/doc/architecture/blueprints/cells/impacted_features/project-transfer.md b/doc/architecture/blueprints/cells/impacted_features/project-transfer.md
new file mode 100644
index 00000000000..e5fb11c21a9
--- /dev/null
+++ b/doc/architecture/blueprints/cells/impacted_features/project-transfer.md
@@ -0,0 +1,28 @@
+---
+stage: enablement
+group: Tenant Scale
+description: 'Cells: Project Transfer'
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+This document is a work-in-progress and represents a very early state of the Cells design.
+Significant aspects are not documented, though we expect to add them in the future.
+This is one possible architecture for Cells, and we intend to contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that we can document the reasons for not choosing this approach.
+
+# Cells: Project Transfer
+
+> TL;DR
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
index 28414f9b68c..1366d308487 100644
--- a/doc/architecture/blueprints/cells/index.md
+++ b/doc/architecture/blueprints/cells/index.md
@@ -18,7 +18,13 @@ Cells is a new architecture for our software as a service platform. This archite
For more information about Cells, see also:
-- [Goals, Glossary and Requirements](goals.md)
+## Goals
+
+See [Goals, Glossary and Requirements](goals.md).
+
+## Deployment Architecture
+
+See [Deployment Architecture](deployment-architecture.md).
## Work streams
@@ -87,11 +93,11 @@ The first 2-3 quarters are required to define a general split of data and build
The Admin Area section for the most part is shared across a cluster.
-1. **User accounts are shared across cluster.**
+1. **User accounts are shared across cluster.** ✓
The purpose is to make `users` cluster-wide.
-1. **User can create Group.**
+1. **User can create Group.** ✓ ([demo](https://www.youtube.com/watch?v=LUyV0ncfdRs))
The purpose is to perform a targeted decomposition of `users` and `namespaces`, because `namespaces` will be stored locally in the Cell.
@@ -115,9 +121,13 @@ The first 2-3 quarters are required to define a general split of data and build
The purpose is that `ci_pipelines` (like `ci_stages`, `ci_builds`, `ci_job_artifacts`) and adjacent tables are properly attributed to be Cell-local.
-1. **User can create issue, merge request, and merge it after it is green.**
+1. **User can create issue.**
- The purpose is to ensure that `issues` and `merge requests` are properly attributed to be `Cell-local`.
+ The purpose is to ensure that `issues` are properly attributed to be `Cell-local`.
+
+1. **User can create merge request, and merge it after it is green.**
+
+ The purpose is to ensure `merge requests` are properly attributed to be `Cell-local`.
1. **User can manage Group and Project members.**
@@ -265,34 +275,34 @@ One iteration describes one quarter's worth of work.
- Data access layer: Initial Admin Area settings are shared across cluster.
- Essential workflows: Allow to share cluster-wide data with database-level data access layer
-1. [Iteration 2](https://gitlab.com/groups/gitlab-org/-/epics/9813) - Expected delivery: 16.2 FY24Q2 | Actual delivery: 16.4 FY24Q3 - In progress
+1. [Iteration 2](https://gitlab.com/groups/gitlab-org/-/epics/9813) - Expected delivery: 16.2 FY24Q2, Actual delivery: 16.4 FY24Q3 - Complete
- Essential workflows: User accounts are shared across cluster.
- Essential workflows: User can create Group.
-1. [Iteration 3](https://gitlab.com/groups/gitlab-org/-/epics/10997) - Expected delivery: 16.7 FY24Q4 - Planned
+1. [Iteration 3](https://gitlab.com/groups/gitlab-org/-/epics/10997) - Expected delivery: 16.7 FY24Q4 - In Progress
- Essential workflows: User can create Project.
- Routing: Technology.
- Routing: Cell discovery.
- - Data access layer: Evaluate the efficiency of database-level access vs. API-oriented access layer.
- - Data access layer: Data access layer.
1. [Iteration 4](https://gitlab.com/groups/gitlab-org/-/epics/10998) - Expected delivery: 16.10 FY25Q1 - Planned
- - Essential workflows: User can create organization on Cell 2.
+ - Essential workflows: User can create Organization on Cell 2.
- Data access layer: Cluster-unique identifiers.
+ - Data access layer: Evaluate the efficiency of database-level access vs. API-oriented access layer.
+ - Data access layer: Data access layer.
- Routing: User can use single domain to interact with many Cells.
- Cell deployment: Extend GitLab Dedicated to support GCP.
1. Iteration 5..N - starting FY25Q1
- Essential workflows: User can push to Git repository.
- - Essential workflows: User can create issue, merge request, and merge it after it is green.
- Essential workflows: User can run CI pipeline.
- Essential workflows: Instance-wide settings are shared across cluster.
- Essential workflows: User can change profile avatar that is shared in cluster.
- - Essential workflows: User can create issue, merge request, and merge it after it is green.
+ - Essential workflows: User can create issue.
+ - Essential workflows: User can create merge request, and merge it after it is green.
- Essential workflows: User can manage Group and Project members.
- Essential workflows: User can manage instance-wide runners.
- Essential workflows: User is part of Organization and can only see information from the Organization.
@@ -317,6 +327,7 @@ Below is a list of known affected features with preliminary proposed solutions.
- [Cells: Admin Area](impacted_features/admin-area.md)
- [Cells: Backups](impacted_features/backups.md)
+- [Cells: CI/CD Catalog](impacted_features/ci-cd-catalog.md)
- [Cells: CI Runners](impacted_features/ci-runners.md)
- [Cells: Container Registry](impacted_features/container-registry.md)
- [Cells: Contributions: Forks](impacted_features/contributions-forks.md)
@@ -338,10 +349,13 @@ Below is a list of known affected features with preliminary proposed solutions.
The following list of impacted features only represents placeholders that still require work to estimate the impact of Cells and develop solution proposals.
- [Cells: Agent for Kubernetes](impacted_features/agent-for-kubernetes.md)
-- [Cells: CI/CD Catalog](impacted_features/ci-cd-catalog.md)
- [Cells: Data pipeline ingestion](impacted_features/data-pipeline-ingestion.md)
- [Cells: GitLab Pages](impacted_features/gitlab-pages.md)
+- [Cells: Group Transfer](impacted_features/group-transfer.md)
+- [Cells: Issues](impacted_features/issues.md)
+- [Cells: Merge Requests](impacted_features/merge-requests.md)
- [Cells: Personal Access Tokens](impacted_features/personal-access-tokens.md)
+- [Cells: Project Transfer](impacted_features/project-transfer.md)
- [Cells: Router Endpoints Classification](impacted_features/router-endpoints-classification.md)
- [Cells: Schema changes (Postgres and Elasticsearch migrations)](impacted_features/schema-changes.md)
- [Cells: Uploads](impacted_features/uploads.md)
@@ -407,7 +421,7 @@ The design goals of the Cells architecture describe that [all Cells are under a
- Cell-local features should be limited to those related to managing the Cell, but never be a feature where the Cell semantic is exposed to the customer.
- The Cells architecture wants to freely control the distribution of Organization and customer data across Cells without impacting users when data is migrated.
-
+
Cluster-wide features are strongly discouraged because:
- They might require storing a substantial amount of data cluster-wide which decreases [scalability headroom](goals.md#provides-100x-headroom).
diff --git a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
index 29b2bd0fd28..104a6ee2136 100644
--- a/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
+++ b/doc/architecture/blueprints/ci_builds_runner_fleet_metrics/index.md
@@ -84,16 +84,20 @@ so we only use `finished` builds.
### Developing behind feature flags
It's hard to fully test data ingestion and query performance in development/staging environments.
-That's why we plan to deliver those features to production behing feature flags and test the performance on real data.
-Feature flags for data ingestion and API's will be separate.
+That's why we plan to deliver those features to production behind feature flags and test the performance on real data.
+Feature flags for data ingestion and APIs will be separate.
### Data ingestion
-A background worker will push `ci_builds` sorted by `(finished_at, id)` from Posgres to ClickHouse.
-Every time the worker starts, it will find the most recently inserted build and continue from there.
+Every time a job finished, a record will be created in a new `p_ci_finished_build_ch_sync_events` table, which includes
+the `build_id` and a `processed` value.
+A background worker loops through unprocessed `p_ci_finished_build_ch_sync_events` records and push the denormalized
+`ci_builds` information from Postgres to ClickHouse.
At some point we most likely will need to
[parallelize this worker because of the number of processed builds](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126863#note_1494922639).
+This will be achieved by having the cron worker accept an argument determining the number of workers. The cron worker
+will use that argument to queue the respective number of workers that will actually perform the syncing to ClickHouse.
We will start with most recent builds and will not upload all historical data.
@@ -129,4 +133,4 @@ continue developing mechanisms for migrations.
#### Re-uploading data after changing the schema
If we need to modify database schema, old data maybe incomplete.
-In that case we can simply truncate the ClickHouse tables and reupload (part of) the data.
+In that case we can simply truncate the ClickHouse tables and re-upload (part of) the data.
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 9a8084f290b..46b8f361949 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -81,7 +81,7 @@ direction for iterations and improvements to the solution.
the bar for new users.
- Customers are already trying to rollout their ad-hoc catalog of shared configurations. We could provide a
standardized way to write, package and share pipeline constructs directly in the product.
-- As we implement new pipeline constructs (for example, reusable job steps) they could be items of the
+- As we implement new pipeline constructs (for example, [reusable job steps](../gitlab_steps/index.md)) they could be items of the
catalog. The catalog can boost the adoption of new constructs.
- The catalog can be a place where we strengthen our relationship with partners, having components offered
and maintained by our partners.
@@ -96,14 +96,15 @@ direction for iterations and improvements to the solution.
This section defines some terms that are used throughout this document. With these terms we are only
identifying abstract concepts and are subject to changes as we refine the design by discovering new insights.
-- **Component** Is the reusable unit of pipeline configuration.
-- **Components repository** represents a collection of CI components stored in the same project.
+- **Component** Is the generic term for a reusable unit of pipeline configuration. The component can be a template (usable via the `include` syntax) or a [step](../gitlab_steps/index.md).
+- **Components repository** is a GitLab repository that contains 1 or more components.
- **Project** is the GitLab project attached to a single components repository.
+- **Catalog resource** is the generic term for a single item displayed in the catalog. A components repository is a catalog resource.
- **Catalog** is a collection of resources like components repositories.
-- **Catalog resource** is the single item displayed in the catalog. A components repository is a catalog resource.
-- **Version** is a specific revision of catalog resource. It maps to the released tag in the project,
- which allows components to be pinned to a specific revision.
-- **Steps** is a collection of instructions for how jobs can be executed.
+- **Version** is a specific revision of the catalog resource. It maps to a project release and
+ allows components to be pinned to a specific revision.
+- **Step** is a type of component that contains a collection of instructions for job execution.
+- **Template** is a type of component that contains a snippet of CI/CD configuration that can be [included](../../../ci/yaml/includes.md) in a project's pipeline configuration.
## Definition of pipeline component
diff --git a/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md b/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
index 66089085d0d..9ce41b51b0c 100644
--- a/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
+++ b/doc/architecture/blueprints/clickhouse_ingestion_pipeline/index.md
@@ -45,7 +45,7 @@ ClickHouse is an online, analytical processing (OLAP) database that powers use-c
At GitLab, [our current and future ClickHouse uses/capabilities](https://gitlab.com/groups/gitlab-com/-/epics/2075) reference & describe multiple use-cases that could be facilitated by using ClickHouse as a backing datastore. A majority of these talk about the following two major areas of concern:
-1. Being able to leverage [ClickHouse's OLAP capabilities](https://clickhouse.com/docs/en/faq/general/olap/) enabling underlying systems to perform an aggregated analysis of data, both over short and long periods of time.
+1. Being able to leverage [ClickHouse's OLAP capabilities](https://clickhouse.com/docs/en/faq/general/olap) enabling underlying systems to perform an aggregated analysis of data, both over short and long periods of time.
1. The fact that executing these operations with our currently existing datasets primarily in Postgres, is starting to become challenging and non-performant.
Looking forward, assuming a larger volume of data being produced by our application(s) and the rate at which it gets produced, the ability to ingest it into a *more* capable system, both effectively and efficiently helps us scale our applications and prepare for business growth.
diff --git a/doc/architecture/blueprints/cloud_connector/index.md b/doc/architecture/blueprints/cloud_connector/index.md
new file mode 100644
index 00000000000..840e17a438a
--- /dev/null
+++ b/doc/architecture/blueprints/cloud_connector/index.md
@@ -0,0 +1,274 @@
+---
+status: proposed
+creation-date: "2023-09-28"
+authors: [ "@mkaeppler" ]
+coach: "@ayufan"
+approvers: [ "@rogerwoo", "@pjphillips" ]
+owning-stage: "~devops::data stores"
+participating-stages: ["~devops::fulfillment", "~devops::ai-powered"]
+---
+
+# Cloud Connector gateway service
+
+## Summary
+
+This design doc proposes a new GitLab-hosted edge service for our
+[Cloud Connector product offering](https://gitlab.com/groups/gitlab-org/-/epics/308), which would act as a public
+gateway into all features offered under the Cloud Connector umbrella.
+
+## Motivation
+
+We currently serve only AI related features to Cloud Connector customers, and our
+[current architecture](../../../development/cloud_connector/code_suggestions_for_sm.md)
+is a direct reflection of that.
+Both SaaS and Self-managed/Dedicated GitLab instances (SM hereafter) talk to the [AI gateway](../ai_gateway/index.md)
+directly, which also implements an `Access Layer` to verify that a given request is allowed
+to access the respective AI feature endpoint. The mechanism through which this verification happens
+for SM instances is detailed in [the CustomersDot architecture documentation](https://gitlab.com/gitlab-org/customers-gitlab-com/-/blob/main/doc/architecture/add_ons/code_suggestions/authorization_for_self_managed.md).
+
+This approach has served us well because it:
+
+- Required minimal changes from an architectural standpoint to allow SM users to consume AI features hosted by us.
+- Caused minimal friction with ongoing development on SaaS.
+- Reduced time to market.
+
+It is clear that the AI gateway alone does not sufficiently abstract over a wider variety of features, as by definition it is designed to serve AI features only.
+Adding non-AI features to the Cloud Connector offering would leave us with
+three choices:
+
+1. Evolve the AI gateway into something that also hosts non-AI features.
+1. Expose new Cloud Connector offerings by creating new publicly available services next to the AI-gateway.
+1. Introduce a new Cloud Connector public gateway service (CC gateway hereafter) that fronts all feature gateways.
+ Feature gateways would become privately routed instead. This approach follows the North/South traffic pattern established
+ by the AI gateway.
+
+Option 3 is the primary focus of this blueprint. We briefly explore options 1 and 2 in [Alternative solutions](#alternative-solutions).
+
+### Goals
+
+Introducing a dedicated edge service for Cloud Connector serves the following goals:
+
+- **Provide single access point for customers.** We found that customers are not keen on configuring their web proxies and firewalls
+ to allow outbound traffic to an ever growing list of GitLab-hosted services. While we investigated ways to
+ [minimize the amount of configuration](https://gitlab.com/gitlab-org/gitlab/-/issues/424780) required,
+ a satisfying solution has yet to be found. Ideally, we would have _one host only_ that is configured and contacted by a GitLab instance.
+- **Reduce risk surface.** With a single entry point facing the public internet, we reduce the attack surface to
+ malicious users and the necessity to guard each internal service individually from potential abuse. In face of security issues
+ with a particular milestone release, we could guard against this in the single CC gateway service rather than in
+ feature gateways individually, improving the pace at which we can respond to security incidents.
+- **Provide CC specific telemetry.** User telemetry was added hastily for current Cloud Connector features and was originally
+ designed for SaaS, which is directly hooked up to Snowplow; that is not true for SM instances.
+ In order to track usage telemetry specific to CC use cases, it could be valuable to have a dedicated place to collect it and that can be connected
+ to GitLab-internal data pipelines.
+- **Reduce duplication of efforts.** Certain tasks such as instance authorization and "clearing requests" against CustomersDot
+ that currently live in the AI gateway would have to be duplicated to other services without a central gateway.
+- **Improve control over rate limits.** With all requests going to a single AI gateway currently, be it from SM or SaaS, rate
+ limiting gets more complicated because we need to inspect request metadata to understand where a request originated from.
+ Moreover, having a dedicated service would allow us, if desired, to implement application-level request budgets, something
+ we do not currently support.
+- **Independently scalable.** For reasons of fault tolerance and scalability, it is beneficial to have all SM traffic go
+ through a separate service. For example, if an excess of unexpected requests arrive from SM instances due to a bug
+ in a milestone release, this traffic could be absorbed at the CC gateway level without cascading downstream, thus leaving
+ SaaS users unaffected.
+
+### Non-goals
+
+- **We are not proposing to build a new feature service.** We consider Cloud Connector to run orthogonal to the
+ various stage groups efforts that build end user features. We would not want actual end user feature development
+ to happen in this service, but rather provide a vehicle through which these features can be delivered in a consistent manner
+ across all deployments (SaaS, SM and Dedicated).
+- **Changing the existing mechanism by which we authenticate instances and verify permissions.** We intend to keep
+ the current mechanism in place that emits access tokens from CustomersDot that are subsequently verified in
+ other systems using public key cryptographic checks. We may move some of the code around that currently implements this,
+ however.
+
+## Proposal
+
+We propose to make two major changes to the current architecture:
+
+1. Build and deploy a new Cloud Connector edge service that acts as a gateway into all features included
+ in our Cloud Connector product offering.
+1. Make the AI gateway a GitLab-internal service so it does not face the public internet anymore. The new
+ edge service will front the AI gateway instead.
+
+At a high level, the new architecture would look as follows:
+
+```plantuml
+@startuml
+node "GitLab Inc. infrastructure" {
+ package "Private services" {
+ [AI gateway] as AI
+ [Other feature gateway] as OF
+ }
+
+ package "Public services" {
+ [GitLab (SaaS)] as SAAS
+ [Cloud Connector gateway] as CC #yellow
+ [Customers Portal] as CDOT
+ }
+}
+
+node "Customer/Dedicated infrastructure" {
+ [GitLab] as SM
+ [Sidekiq] as SK
+}
+
+SAAS --> CC : " access CC feature"
+CC --> AI: " access AI feature"
+CC --> OF: " access non-AI feature"
+CC -> SAAS : "fetch JWKS"
+
+SM --> CC : "access CC feature"
+SK -> CDOT : "sync CC access token"
+CC -> CDOT : "fetch JWKS"
+
+@enduml
+```
+
+## Design and implementation details
+
+### CC gateway roles & responsibilities
+
+The new service would be made available at `cloud.gitlab.com` and act as a "smart router".
+It will have the following responsibilities:
+
+1. **Request handling.** The service will make decisions about whether a particular request is handled
+ in the service itself or forwarded to a downstream service. For example, a request to `/ai/code_suggestions/completions`
+ could be handled by forwarding this request to an appropriate endpoint in the AI gateway unchanged, while a request
+ to `/-/metrics` could be handled by the service itself. As mentioned in [non-goals](#non-goals), the latter would not
+ include domain logic as it pertains to an end user feature, but rather cross-cutting logic such as telemetry, or
+ code that is necessary to make an existing feature implementation accessible to end users.
+
+ When handling requests, the service should be unopinionated about which protocol is used, to the extent possible.
+ Reasons for injecting custom logic could be setting additional HTTP header fields. A design principle should be
+ to not require CC service deployments if a downstream service merely changes request payload or endpoint definitions. However,
+ supporting more protocols on top of HTTP may require adding support in the CC service itself.
+1. **Authentication/authorization.** The service will be the first point of contact for authenticating clients and verifying
+ they are authorized to use a particular CC feature. This will include fetching and caching public keys served from GitLab SaaS
+ and CustomersDot to decode JWT access tokens sent by GitLab instances, including matching token scopes to feature endpoints
+ to ensure an instance is eligible to consume this feature. This functionality will largely be lifted out of the AI gateway
+ where it currently lives. To maintain a ZeroTrust environment, the service will implement a more lightweight auth/z protocol
+ with internal services downstream that merely performs general authenticity checks but forgoes billing and permission
+ related scoping checks. How this protocol will look like is to be decided, and might be further explored in
+ [Discussion: Standardized Authentication and Authorization between internal services and GitLab Rails](https://gitlab.com/gitlab-org/gitlab/-/issues/421983).
+1. **Organization-level rate limits.** It is to be decided if this is needed, but there could be value in having application-level rate limits
+ and or "pressure relief valves" that operate at the customer/organization level rather than the network level, the latter of which
+ Cloudflare already affords us with. These controls would most likely be controlled by the Cloud Connector team, not SREs or
+ infra engineers. We should also be careful to not simply extend the existing rate limiting configuration that is mainly concerned with GitLab SaaS.
+1. **Recording telemetry.** In any cases where telemetry is specific to Cloud Connector feature usage or would result in
+ duplication of efforts when tracked further down the stack (for example, counting unique users), it should be recorded here instead.
+ To record usage/business telemetry, the service will talk directly to Snowplow. For operational telemetry, it will provide
+ a Prometheus metrics endpoint. We may decide to also route Service Ping telemetry through the CC service because this
+ currently goes to [`version-gitlab-com`](https://gitlab.com/gitlab-services/version-gitlab-com/).
+
+### Implementation choices
+
+We suggest to use one of the following language stacks:
+
+1. **Go.** There is substantial organizational knowledge in writing and running
+Go systems at GitLab, and it is a great systems language that gives us efficient ways to handle requests where
+they merely need to be forwarded (request proxying) and a powerful concurrency mechanism through goroutines. This makes the
+service easier to scale and cheaper to run than Ruby or Python, which scale largely at the process level due to their use
+of Global Interpreter Locks, and use inefficient memory models especially as regards byte stream handling and manipulation.
+A drawback of Go is that resource requirements such as memory use are less predictable because Go is a garbage collected language.
+1. **Rust.** We are starting to build up knowledge in Rust at GitLab. Like Go, it is a great systems language that is
+also starting to see wider adoption in the Ruby ecosystem to write CRuby extensions. A major benefit is more predictable
+resource consumption because it is not garbage collected and allows for finer control of memory use.
+It is also very fast; we found that the Rust implementation for `prometheus-client-mmap` outperformed the original
+extension written in C.
+
+## Alternative solutions
+
+### Cloudflare Worker
+
+One promising alternative to writing and deploying a service from scratch is to use
+[Cloudflare Workers](https://developers.cloudflare.com/workers/),
+a serverless solution to deploying application code that:
+
+- Is auto-scaled through Cloudflare's service infrastructure.
+- Supports any language that compiles to Webassembly, including Rust.
+- Supports various options for [cloud storage](https://developers.cloudflare.com/workers/learning/storage-options/)
+ including a [key-value store](https://developers.cloudflare.com/kv/) we can use to cache data.
+- Supports a wide range of [network protocols](https://developers.cloudflare.com/workers/learning/protocols/)
+ including WebSockets.
+
+We are exploring this option in issue [#427726](https://gitlab.com/gitlab-org/gitlab/-/issues/427726).
+
+### Per-feature public gateways
+
+This approach would be a direct extrapolation of what we're doing now. Because we only host AI features for
+Cloud Connector at the moment, we have a single publicly routed gateway that acts as the entry point for
+Cloud Connector features and implements all the necessary auth/z and telemetry logic.
+
+Were we to introduce any non-AI features, each of these would receive their own gateway service, all
+publicly routed and be accessed by GitLab instances through individual host names. For example:
+
+- `ai.gitlab.com`: Services AI features for GitLab instances
+- `cicd.gitlab.com`: Services CI/CD features for GitLab instances
+- `foo.gitlab.com`: Services foo features for GitLab instances
+
+A benefit of this approach is that in the absence of an additional layer of indirection, latency
+may be improved.
+
+A major question is how shared concerns are handled because duplicating auth/z, telemetry, rate limits
+etc. across all such services may mean re-inventing the wheel for different language stacks (the AI gateway was
+written in Python; a non-AI feature gateway will most likely be written in Ruby or Go, which are far more popular
+at GitLab).
+
+One solution to this could be to extract shared concerns into libraries, although these too, would have to be
+made available in different languages. This is what we do with `labkit` (we have 3 versions already, for Go, Ruby and Python),
+which creates organizational challenges because we are already struggling as an organization to properly allocate
+people to maintaining foundational libraries, which is often handled on a best-effort, crowd-sourced basis.
+
+Another solution could be to extract services that handle some of these concerns. One pattern I have seen used
+with multiple edge services is for them to contact a single auth/z service that maps user identity and clears permissions
+before handling the actual request, thus reducing code duplication between feature services.
+
+Other drawbacks of this approach:
+
+- Increases risk surface by number of feature domains we pull into Cloud Connector because we need to deploy
+ and secure these services on the public internet.
+- Higher coupling of GitLab to feature services. Where and how a particular feature is made
+ available as an implementation detail. By coupling GitLab to specific network endpoints like `ai.gitlab.com`
+ we reduce our flexibility to shuffle around both our service architecture but also how we map technology to features
+ and customer plans/tiers because some customers stay on older GitLab
+ versions for a very long time. This would necessitate putting special routing/DNS rules in place to address any
+ larger changes we make to this topology.
+- Higher config overhead for customers. Because they may have to configure web proxies and firewalls, they need to
+ permit-list every single host/IP-range we expose this way.
+
+### Envoy
+
+[Envoy](https://www.envoyproxy.io/docs/envoy/v1.27.0/) is a Layer 7 proxy and communication bus that allows
+us to overlay a service mesh to solve cross-cutting
+problems with multi-service access such as service discovery and rate limiting. Envoy runs as a process sidecar
+to the actual application service it manages traffic for.
+A single LB could be deployed as Ingress to this service mesh so we can reach it at `cloud.gitlab.com`.
+
+A benefit of this approach would be that we can use an off-the-shelves solution to solve common networking
+and scaling problems.
+
+A major drawback of this approach is that it leaves no room to run custom application code, which would be necessary
+to validate access tokens or implement request budgets at the customer or organization level. In this solution,
+these functions would have to be factored out into libraries or other shared services instead, so it shares
+other drawbacks with the [per-feature public gateways alternative](#per-feature-public-gateways).
+
+### Evolving the AI gateway into a CC gateway
+
+This was the original idea behind the first iteration of the [AI gateway](../ai_gateway/index.md) architecture,
+which defined the AI gateway as a "prospective GitLab Plus" service (GitLab Plus was the WIP name for
+Cloud Connector.)
+
+This is our least favorite option for several reasons:
+
+- Low code cohesion. This would lead us to build another mini-monolith with wildly unrelated responsibilities
+ that would span various feature domains (AI, CI/CD, secrets management, observability etc.) and teams
+ having to coordinate when contributing to this service, introducing friction.
+- Written in Python. We chose Python for the AI gateway because it seemed a sensible choice, considering the AI
+ landscape has a Python bias. However, Python is almost non-existent at GitLab outside of this space, and most
+ of our engineers are Ruby or Go developers, with years of expertise built up in these stacks. We would either
+ have to rewrite the AI gateway in Ruby or Go to make it more broadly accessible, or invest heavily into Python
+ training and hiring as an organization.
+ Furthermore, Python has poor scaling characteristics because like CRuby it uses a Global Interpreter Lock and
+ therefore primarily scales through processes, not threads.
+- Ownership. The AI gateway is currently owned by the AI framework team. This would not make sense if we evolved this into a CC gateway, which should be owned by the Cloud Connector group instead.
diff --git a/doc/architecture/blueprints/email_ingestion/index.md b/doc/architecture/blueprints/email_ingestion/index.md
new file mode 100644
index 00000000000..9579a903133
--- /dev/null
+++ b/doc/architecture/blueprints/email_ingestion/index.md
@@ -0,0 +1,169 @@
+---
+status: proposed
+creation-date: "2023-06-05"
+authors: [ "@msaleiko" ]
+coach: "@stanhu"
+approvers: [ ]
+owning-stage: ""
+participating-stages: [ "~group::incubation" ]
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+<!-- vale gitlab.CurrentStatus = NO -->
+
+# Replace `mail_room` email ingestion with scheduled Sidekiq jobs
+
+## Summary
+
+GitLab users can submit new issues and comments via email. Administrators configure special mailboxes that GitLab polls on a regular basis and fetches new unread emails. Based on the slug and a hash in the sub-addressing part of the email address, we determine whether this email will file an issue, add a Service Desk issue, or a comment to an existing issue.
+
+Right now emails are ingested by a separate process called `mail_room`. We would like to stop ingesting emails via `mail_room` and instead use scheduled Sidekiq jobs to do this directly inside GitLab.
+
+This lays out the foundation for [custom email address ingestion for Service Desk](https://gitlab.com/gitlab-org/gitlab/-/issues/329990), detailed health logging and makes it easier to integrate other service provider adapters (for example Gmail via API). We will also reduce the infrastructure setup and maintenance costs for customers on self-managed and make it easier for team members to work with email ingestion in GDK.
+
+## Glossary
+
+- Email ingestion: Reading emails from a mailbox via IMAP or an API and forwarding it for processing (for example create an issue or add a comment)
+- Sub-addressing: An email address consist of a local part (everything before `@`) and a domain part. With email sub-addressing you can create unique variations of an email address by adding a `+` symbol followed by any text to the local part. You can use these sub-addresses to filter, categorize or distinguish between them as all these emails will be delivered to the same mailbox. For example `user+subaddress@example.com` and `user+1@example.com` and sub-addresses for `user@example.com`.
+- `mail_room`: [An executable script](https://gitlab.com/gitlab-org/ruby/gems/gitlab-mail_room) that spawns a new process for each configured mailbox, reads new emails on a regular basis and forwards the emails to a processing unit.
+- [`incoming_email`](../../../administration/incoming_email.md): An email address that is used for adding comments and issues via email. When you reply on a GitLab notification of an issue comment, this response email will go to the configured `incoming_email` mailbox, read via `mail_room` and processed by GitLab. You can also use this address as a Service Desk email address. The configuration is per instance and needs full IMAP or Microsoft Graph API credentials to access the mailbox.
+- [`service_desk_email`](../../../user/project/service_desk/configure.md#use-an-additional-service-desk-alias-email): Additional alias email address that is only used for Service Desk. You can also use an address generated from `incoming_email` to create Service Desk issues.
+- `delivery_method`: Administrators can define how `mail_room` forwards fetched emails to GitLab. The legacy and now deprecated approach is called `sidekiq`, which directly adds a new job to the Redis queue. The current and recommended way is called `webhook`, which sends a POST request to an internal GitLab API endpoint. This endpoint then adds a new job using the full framework for compressing job data etc. The downside is, that `mail_room` and GitLab need a shared key file, which might be challenging to distribute in large setups.
+
+## Motivation
+
+The current implementation lacks scalability and requires significant infrastructure maintenance. Additionally, there is a lack of [proper observability for configuration errors](https://gitlab.com/gitlab-org/gitlab/-/issues/384530) and [overall system health](https://gitlab.com/groups/gitlab-org/-/epics/9407). Furthermore, [setting up and providing support for multi-node Linux package (Omnibus) installations](https://gitlab.com/gitlab-org/gitlab/-/issues/391859) is challenging, and periodic email ingestion issues necessitate reactive support.
+
+Because we are using a fork of the `mail_room` gem ([`gitlab-mail_room`](https://gitlab.com/gitlab-org/ruby/gems/gitlab-mail_room)), which contains some GitLab specific features that won't be ported upstream, we have a noteable maintenance overhead.
+
+The [Service Desk Single-Engineer-Group (SEG)](https://about.gitlab.com/handbook/engineering/incubation/service-desk/) started work on [customizable email addresses for Service Desk](https://gitlab.com/gitlab-org/gitlab/-/issues/329990) and [released the first iteration in beta in `16.4`](https://about.gitlab.com/releases/2023/09/22/gitlab-16-4-released/#custom-email-address-for-service-desk). As a [MVC we introduced a `Forwarding & SMTP` mode](https://gitlab.com/gitlab-org/gitlab/-/issues/329990#note_1201344150) where administrators set up email forwarding from their custom email address to the projects' `incoming_mail` email address. They also provide SMTP credentials so GitLab can send emails from the custom email address on their behalf. We don't need any additional email ingestion other than the existing mechanics for this approach to work.
+
+As a second iteration we'd like to add Microsoft Graph support for custom email addresses for Service Desk as well. Therefore we need a way to ingest more than the system defined two addresses. We will explore a solution path for Microsoft Graph support where privileged users can connect a custom email account and we can [receive messages via a Microsoft Graph webhook (`Outlook message`)](https://learn.microsoft.com/en-us/graph/webhooks#supported-resources). GitLab would need a public endpoint to receive updates on emails. That might not work for Self-managed instances, so we'll need direct email ingestion for Microsoft customers as well. But using the webhook approach could improve performance and efficiency for GitLab SaaS where we potentially have thousands of mailboxes to poll.
+
+### Goals
+
+Our goals for this initiative are to enhance the scalability of email ingestion and slim down the infrastructure significantly.
+
+1. This consolidation will eliminate the need for setup for the separate process and pave the way for future initiatives, including direct custom email address ingestion (IMAP & Microsoft Graph), [improved health monitoring](https://gitlab.com/groups/gitlab-org/-/epics/9407), [data retention (preserving originals)](https://gitlab.com/groups/gitlab-org/-/epics/10521), and [enhanced processing of attachments within email size limits](https://gitlab.com/gitlab-org/gitlab/-/issues/406668).
+1. Make it easier for team members to develop features with email ingestion. [Right now it needs several manual steps.](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/service_desk_mail_room.md)
+
+### Non-Goals
+
+This blueprint does not aim to lay out implementation details for all the listed future initiatives. But it will be the foundation for upcoming features (customizable Service Desk email address IMAP/Microsoft Graph, health checks etc.).
+
+We don't include other ingestion methods. We focus on delivering the current set: IMAP and Microsoft Graph API for `incoming_email` and `service_desk_email`.
+
+## Current setup
+
+Administrators configure settings (credentials and delivery method) for email mailboxes (for [`incoming_email`](../../../administration/incoming_email.md) and [`service_desk_email`](../../../user/project/service_desk/configure.md#use-an-additional-service-desk-alias-email)) in `gitlab.rb` configuration file. After each change GitLab needs to be reconfigured and restarted to apply the new settings.
+
+We use the separate process `mail_room` to ingest emails from those mailboxes. `mail_room` spawns a thread for each configured mailbox and polls those mailboxes every minute. In the meantime the threads are idle. `mail_room` reads a configuration file that is generated from the settings in `gitlab.rb`.
+
+`mail_room` can connect via IMAP and Microsoft Graph, fetch unread emails, and mark them as read or deleted (based on settings). It takes an email and distributes it to its destination via one of the two delivery methods.
+
+### `webhook` delivery method (recommended)
+
+The `webhook` delivery method is the recommended way to move ingested emails from `mail_room` to GitLab. `mail_room` posts the email body and metadata to an internal API endpoint `/api/v4/internal/mail_room`, that selects the correct handler worker and schedules it for execution.
+
+```mermaid
+flowchart TB
+ User --Sends email--> provider[(Email provider mailbox)]
+ mail_room --Fetch unread emails via IMAP or Microsoft Graph API--> provider
+ mail_room --HTTP POST--> api
+ api --adds job for email--> create
+
+ subgraph mail_room_process[mail_room]
+ mail_room[mail_room thread]
+ end
+
+ subgraph GitLab
+ api[Internal API endpoint]
+ create["Sidekiq email handler job
+ that create issue/note based
+ on email address"]
+ end
+```
+
+### `sidekiq` delivery method (deprecated since 16.0)
+
+The `sidekiq` delivery method adds the email body and metadata directly to the Redis queue that Sidekiq uses to manage jobs. It has been [deprecated in 16.0](../../../update/deprecations.md#sidekiq-delivery-method-for-incoming_email-and-service_desk_email-is-deprecated) because there is a hard coupling between the delivery method and the Redis configuration. Moreover we cannot use Sidekiq framework optimizations such as job payload compression.
+
+```mermaid
+flowchart TB
+ User --Sends email--> provider[(Email provider mailbox)]
+ mail_room --Fetch unread emails via IMAP or Microsoft Graph API--> provider
+
+ mail_room --directly writes to Redis queue, which schedules a handler job--> redis[Redis queue]
+ redis --Sidekiq takes job from the queue and executes it--> create
+
+ subgraph mail_room_process[mail_room]
+ mail_room[mail_room thread]
+ end
+
+ subgraph GitLab
+ create["Sidekiq email handler job
+ that create issue/note based
+ on email address"]
+ end
+```
+
+## Proposal
+
+**Use Sidekiq jobs to poll mailboxes on a regular basis (every minute, maybe configurable in the future).
+Remove all other legacy email ingestion infrastructure.**
+
+```mermaid
+flowchart TB
+ User --Sends email--> provider[(Email provider mailbox)]
+ ingestion --Fetch unread emails via IMAP or Microsoft Graph API--> provider
+ controller --Triggers a job for each mailbox--> ingestion
+ ingestion --Adds a job for each fetched email--> create
+
+ subgraph GitLab
+ controller[Scheduled Sidekiq ingestion controller job]
+ ingestion[Sidekiq mailbox ingestion job]
+ create["Existing Sidekiq email handler jobs
+ that create issue/note based
+ on email address"]
+ end
+```
+
+1. Use a `controller` job that is scheduled every minute or every two minutes. This job adds one job for each configured mailbox (`incoming_email` and `service_desk_email`).
+1. The concrete `ingestion` job polls a mailbox (IMAP or Microsoft Graph), downloads unread emails and adds one job for each email that processes the email. We decide based on the used `To` email address which email handler should be used.
+1. The `existing email handler` jobs try to create an issue, a Service Desk issue or a note on an existing issue/merge request. These handlers are also used by the legacy email ingestion via `mail_room`.
+
+### Sidekiq jobs and job payload size optimizations
+
+We implemented a size limit for Sidekiq jobs and email job payloads (especially emails with attachments) are likely to pass that bar. We should experiment with the idea of handling email processing directly in the Sidekiq mailbox ingestion job. We could use an `ops` feature flag to switch between this mode and a Sidekiq job for each email.
+
+We'd also like to explore a solution path where we only fetch the message ids and then download the complete messages in child jobs (filter by `UID` range for example). For example we poll a mailbox and fetch a list of message ids. Then we create a new job for every 25 (or n) emails that takes the message ids or the range as an argument. These jobs will then download the entire messages and synchronously add issues or replies. If the number of emails is below 25, we could even handle the emails directly in the current job to save resources. This will allow us to eliminate the job payload size as the limiting factor for the size of emails. The disadvantage is that we need to make two calls to the IMAP server instead of one (n+1).
+
+## Execution plan
+
+1. Add deprecation for `mail_room` email ingestion.
+1. Strip out connection-specific logic from [`gitlab-mail_room` gem](https://gitlab.com/gitlab-org/ruby/gems/gitlab-mail_room), into a new separate gem. `mail_room` and other clients could use our work here. Right now we support IMAP and Microsoft Graph API connections.
+1. Add new jobs (set idempotency and de-duplication flags to avoid a huge backlog of jobs if Sidekiq isn't running).
+1. Add a setting (`gitlab.rb`) that enables email ingestion with Sidekiq jobs inside GitLab. We need to set `mailroom['enabled'] = false` in `gitlab.rb` to disable `mail_room` email ingestion. Maybe additionally add a feature flag.
+1. Use on `gitlab.com` before general availability, but allow self-managed to try it out in `beta`.
+1. Once rolled out in general availability and when removal has been scheduled, remove the dependency to `gitlab-mail_room` entirely, remove the internal API endpoint `api/internal/mail_room`, remove `mail_room.yml` dynamically generated static configuration file for `mail_room` and other configuration and binaries.
+
+## Change management
+
+We decided to [deprecate the `sidekiq` delivery method for `mail_room` in GitLab 16.0](../../../update/deprecations.md#sidekiq-delivery-method-for-incoming_email-and-service_desk_email-is-deprecated) and scheduled it for removal in GitLab 17.0.
+We can only remove the `sidekiq` delivery method after this blueprint has been implemented and our customers can use the new email ingestion in general availability.
+
+We should then schedule `mail_room` for removal (GitLab 17.0 or later). This will be a breaking change. We could make the new email ingestion the default beforehand, so self-managed customers wouldn't need to take action.
+
+## Alternative Solutions
+
+### Do nothing
+
+The current setup limits us and only allows to fetch two email addresses. To publish Service Desk custom email addresses with IMAP or API integration we would need to deliver the same architecture as described above. Because of that we should act now and include general email ingestion for `incoming_email` and `service_desk_email` first and remove the infrastructure overhead.
+
+## Additional resources
+
+- [Meta issue for this design document](https://gitlab.com/gitlab-org/gitlab/-/issues/393157)
+
+## Timeline
+
+- 2023-09-26: The initial version of the blueprint has been merged.
diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md
index b2e6fd1e82c..36fbd9395d7 100644
--- a/doc/architecture/blueprints/feature_flags_development/index.md
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
@@ -93,7 +93,7 @@ allow us to have:
name: ci_disallow_to_create_merge_request_pipelines_in_target_project
introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724
rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119
-group: group::release
+group: group::environments
type: development
default_enabled: false
```
diff --git a/doc/architecture/blueprints/gitaly_transaction_management/index.md b/doc/architecture/blueprints/gitaly_transaction_management/index.md
new file mode 100644
index 00000000000..38d28691c37
--- /dev/null
+++ b/doc/architecture/blueprints/gitaly_transaction_management/index.md
@@ -0,0 +1,427 @@
+---
+status: ongoing
+creation-date: "2023-05-30"
+authors: [ "@samihiltunen" ]
+owning-stage: "~devops::enablement"
+---
+
+# Transaction management in Gitaly
+
+## Summary
+
+Gitaly is a database system for storing Git repositories. This blueprint covers implementing transaction management in Gitaly that guarantees
+ACID-properties by introducing:
+
+- Write-ahead logging. Work on this is already underway and tracked in [Implement write-ahead logging in Gitaly](https://gitlab.com/groups/gitlab-org/-/epics/8911).
+- Serializable snapshot isolation through multiversion concurrency control.
+
+The goal is to improve reliability when dealing with concurrent access and interrupted writes. Transaction management makes it easier to contribute to Gitaly because transactions
+deal with the concurrency and failure-related anomalies.
+
+This is the first stage of implementing a [decentralized Raft-based architecture for Gitaly Cluster](https://gitlab.com/groups/gitlab-org/-/epics/8903).
+
+## Motivation
+
+Transaction management in Gitaly is lacking. Gitaly doesn't provide the guarantees typically expected from database-like software. Databases typically guarantee the ACID
+properties:
+
+- Atomicity: all changes in a transaction happen completely or not at all.
+- Consistency: all changes leave the data in a consistent state.
+- Isolation: concurrent transactions execute as if they were the only transaction running in the system.
+- Durability: changes in a transaction persist and survive system crashes once acknowledged.
+
+Gitaly does not access storage transactionally and violates these properties in countless ways. To give some examples:
+
+- Atomicity:
+ - References are updated one by one with Git. If the operation is interrupted, some references
+ may be updated and some not.
+ - Objects may be written into a repository but fail to be referenced.
+ - Custom hooks are updated by moving their old directory out of the way and moving the new one in place. If this operation fails half way, the repository's
+ existing hooks are removed but new ones are not written.
+- Consistency:
+ - Gitaly migrates objects from a quarantine directory to the main repository. It doesn't consider the dependencies between objects while doing so. If this process is interrupted, and an object missing its dependencies is later referenced, the repository ends up corrupted.
+ - Crashes might leave stale locks on the disk that prevent further writes.
+- Isolation:
+ - Any operation can fail due to the repository being deleted concurrently.
+ - References and object database contents can be modified while another operation is reading them.
+ - Backups can be inconsistent due to concurrent write operations modifying the data. Backups can even end up containing state that never existed on the
+ server, which can happen if custom hooks are updated while they are being backed up.
+ - Modifying and executing custom hooks concurrently can lead to custom hooks not being executed. This can happen if the execution happens between the old
+ hooks being removed and new ones being put in place.
+- Durability: multiple missing fsyncs were recently discovered in Gitaly.
+
+Not adhering to ACID properties can lead to:
+
+- Inconsistent reads.
+- Inconsistent backups that contain state that never existed on the server.
+- Repository corruption.
+- Writes missing after crashes.
+- Stale locks that lead to unavailability.
+
+Lack of isolation makes some features infeasible. These are generally long running read operations, such as online checksums for verifying data and online backups. The data being modified concurrently can cause these to yield incorrect results.
+
+The list is not exhaustive. Compiling an exhaustive list is not fruitful due to the large number of various scenarios that can happen due to concurrent interactions and
+write interruptions. However, there is a clear need to solve these problems in a systematic manner.
+
+## Solution
+
+The solution is to implement a transaction manager in Gitaly that guarantees ACID-properties. This centralizes the transactional logic into a single component.
+
+All operations accessing user data will run in a transaction with the transaction manager upholding transactional guarantees. This eases developing Gitaly as the RPC handlers can be developed as if they were the only one running in the system with durability and atomicity of changes guaranteed on commit.
+
+### Goals
+
+- Transaction management that guarantees ACID-properties.
+- Transactional guarantees cover access to all user data:
+ - References
+ - Objects
+ - Custom hooks
+- Write-ahead log for durability and atomicity.
+- Serializable Snapshot Isolation (SSI). Multiversion concurrency control (MVCC) for non-blocking concurrency.
+- Minimal changes to existing code in Gitaly.
+- Make it easier to contribute to Gitaly.
+- Enable future use cases:
+ - [Backups with WAL archiving](#continuous-backups-with-wal-archiving).
+ - [Replication with Raft](#raft-replication).
+ - [Expose transactional interface to Gitaly clients](#expose-transactions-to-clients).
+
+## Proposal
+
+The design below is the end state we want to reach. The in-progress implementation in Gitaly deviates in some aspects. We'll gradually get closer to the end state as the work progresses.
+
+### Partitioning
+
+The user data in Gitaly is stored in repositories. These repositories are accessed independently from each other.
+
+Each repository lives on a single storage. Gitaly identifies repositories with a composite key of `(storage_name, relative_path)`. Storage names are unique. Two storages may contain a repository with the same relative path. Gitaly considers these two distinct repositories.
+
+The synchronization required for guaranteeing transactional properties has a performance impact. To reduce the impact, a transaction only spans a subset of the data stored on a Gitaly node.
+
+The first boundary is the storage. The storages are independent of each other and host distinct repositories. Transactions never span across storages.
+
+Storages are further divided into partitions:
+
+- Transactional properties are maintained within a partition. Transactions never span across partitions.
+- A partition stores some data and provides access to that data with transactional guarantees. The data will generally be repositories. Partitions may also
+ store key-value data, which will be used in future with [the new cluster architecture](#raft-replication) to store cluster metadata.
+- Partitions will be the unit of replication with [Raft](#raft-replication).
+
+Repositories:
+
+- Within a storage might depend on each other. This is the case with objects pools and the repositories that borrow from them. Their operations must be
+ synchronized because changes in the pool would affect the object database content of the borrowing repository.
+- That are not borrowing from an object pool are independent from each other. They are also accessed independently.
+- That depend on each other go in the same partition. This generally means object pools and their borrowers. Most repositories will have their own partition.
+
+The logical data hierarchy looks as follows:
+
+``` mermaid
+graph
+ subgraph "Gitaly Node"
+ G[Process] --> S1[Storage 1]
+ G[Process] --> S2[Storage 2]
+ S1 --> P1[Partition 1]
+ S1 --> P2[Partition 2]
+ S2 --> P3[Partition 3]
+ S2 --> P4[Partition 4]
+ P1 --> R1[Object Pool]
+ P1 --> R2[Member Repo 1]
+ P1 --> R3[Member Repo 2]
+ R2 --> R1
+ R3 --> R1
+ P2 --> R4[Repository 3]
+ P3 --> R5[Repository 4]
+ P4 --> R6[Repository 5]
+ P4 --> R7[Repository 6]
+end
+```
+
+### Transaction management
+
+Transactional properties are guaranteed within a partition. Everything described here is within the scope of a single partition.
+
+Each partition will have a transaction manager that manages the transactions operating on data in the partition. Higher-level concepts used in the
+transaction management are covered below.
+
+#### Serializable snapshot isolation
+
+Prior to transactions, Gitaly didn't isolate concurrent operations from each other. Reads could read an in-between state due to writes running concurrently. Reading the same data multiple times could lead to different results if a concurrent operation modified the data in-between the two reads. Other anomalies were also possible.
+
+The transaction manager provides serializable snapshot isolation (SSI) for transactions. Each transaction is assigned a read snapshot when it begins. The read snapshot contains the latest committed data for a repository. The data remains the same despite any concurrent changes being committed.
+
+Multiversion concurrency control (MVCC) is used for non-blocking concurrency. MVCC works by always writing updates into a new location, leaving the old
+versions intact. With multiple versions maintained, the reads are isolated from the updates as they can keep reading the old versions. The old versions are
+garbage collected after there are no transactions reading them anymore.
+
+The snapshot covers all user data:
+
+- References
+- Objects
+- Custom hooks
+
+Git doesn't natively provide tools to implement snapshot isolation. Therefore, repository snapshots are implemented on the file system by copying the directory
+structure of the repository into a temporary directory and hard linking the contents of the repository in place. Git never updates references or objects in
+place but always writes new files so the hard-linked files remain unchanged in the snapshots. The correct version of custom hooks for the read snapshot is
+also linked into place. For information on performance concerns, see [Performance Considerations](#performance-considerations).
+
+The snapshot works for both reading and writing because it is a normal Git repository. The Git writes performed in the snapshot are captured through the
+reference transaction hook. After the transaction commits, the performed changes are write-ahead logged and ultimately applied to the repository from the log.
+After the transaction commits or aborts, the transaction's temporary state, including the snapshot, is removed. Old files are automatically removed by the
+file system after they are not linked to by the repository nor any transaction's snapshot.
+
+To maintain consistency, writes into the actual repository are blocked while the snapshot is taken. The transaction manager is the single-writer to the
+repository, which means that only the log application is blocked while a snapshot is taken.
+
+#### Serializability
+
+Serializability is a strong correctness guarantee. It ensures that the outcome of concurrent transactions is equal to some serial execution of them. Guaranteeing serializability makes life easy for users of the transactions. They can perform their changes as if they were the only user of the system and trust that the result is correct regardless of any concurrent activity.
+
+The transaction manager provides serializability through optimistic locking.
+
+Each read and write is operating on a snapshot of the repository. The locks acquired by Git are targeting different snapshot repositories, which allows all of
+the transactions to proceed concurrently, staging their changes because they are not operating on shared resources.
+
+When committing a transaction, the transaction manager checks whether any resources being updated or read were changed by an overlapping transaction that committed. If so, the later transaction is rejected due to a serialization violation. If there are no conflicts, the transaction is appended to the log. Once the transaction is logged, it is successfully committed. The transaction gets ultimately applied to the repository from the log. This locking mechanism allows all transactions to proceed unblocked until commit. It is general enough for identifying write conflicts of any resource.
+
+For true serializability, we would also have to track reads performed. This is to prevent write skew, where a transaction bases its update on a stale read of
+another value that was updated by a concurrent transaction. Git does not provide a way to track which references were read as part of a command. Because we
+don't have a general way to track references a transaction read, write skew is permitted.
+
+Predicate locks can be explicity acquired in a transaction. These provide hints to the transaction manager that allow it to prevent write skew to the extent
+they are used.
+
+#### Write-ahead log
+
+Prior to transactions, the writes updated the target data on the disk directly. This creates a problem if the writes are interrupted while they are being performed.
+
+For example, given a write:
+
+- `ref-a new-oid old-oid`
+- `ref-b new-oid old-oid`
+
+If the process crashes after updating `ref-a` but not yet updating `ref-b`, the state now contains a partially-applied transaction. This violates atomicity.
+
+The transaction manager uses a write-ahead log to provide atomicity and durability. Transaction's changes are written into a write-ahead log on commit prior to applying them to log's projections. If a crash occurs, the transaction is recovered from the log and performed to completion.
+
+All writes into a partition go through the write-ahead log. Once a transaction is logged, it's applied from the log to:
+
+- The Git repository. The repository's current state is constructed from the logged transactions.
+- An embedded database shared between all partitions on a storage. Write-ahead logging-related bookkeeping state is kept here.
+
+Most writes are fully self-contained in the log entry. Reference updates that include new objects are not. The new objects are logged in a packfile. The objects in a packfile may
+depend on existing objects in the repository. This is problematic for two reasons:
+
+- The dependencies might be garbage collected while the packfile is in the log waiting for application.
+- The dependencies in the actual repository's object database might be garbage collected while a transaction is verifying connectivity of new objects against
+ its snapshot.
+
+Both of these issues can be solved by writing internal references to the packfile's dependencies before committing the log entry. These internal references
+can be cleared when the log entry is pruned. For more information, see [issue 154](https://gitlab.com/gitlab-org/git/-/issues/154) on the GitLab fork of Git.
+
+### Integration
+
+Gitaly contains over 150 RPCs. We want to plug in the transaction management without having to modify all of them. This can be achieved by plugging in a
+gRPC interceptor that handles opening and committing transactions before each handler. The interceptor:
+
+1. Begins the transaction.
+1. Rewrites the repository in the request to point to the transaction's snapshot repository.
+1. Invokes the RPC handler with the rewritten repository.
+1. Commits or rolls back the transaction depending on whether the handler returns successfully or not.
+
+The existing code in the handlers already knows how to access the repositories from the request. Because we rewrite the repository to point to the snapshot,
+they'll be automatically snapshot-isolated because their operations will target the snapshot.
+
+RPCs that perform non-Git writes, such as `SetCustomHooks`, will need to be adapted because we don't have a way to hook into their writes like we do with
+the reference transaction hook. However, these however are a small minority, namely:
+
+- Custom hook updates.
+- Repository creations.
+- Repository deletions.
+
+To support integrating these, we'll provide a helper function to include the data in the transaction. We'll pipe the transaction through the request context.
+
+The biggest concern with integrating the transaction management is missing some locations that write to the repository without respecting the transaction logic. Because
+we are rewriting the request's repository to the snapshot repository, this is not an issue. The RPC handlers do not know the real location of the repository so they can't
+accidentally write there. Any writes they perform to the snapshot repository that are not included in the transaction will be discarded. This should fail tests and alert
+us to the problem.
+
+There may be some locations in Gitaly that would benefit from having the real repository's relative path. An example could be a cache, such as the pack objects cache, that uses the relative path as cache key. It would be problematic if each transaction has a their own snapshot repository and thus a different relative path. If needed, the real relative path could be piped through the request context. The snapshots can be shared between multiple read only transactions which would keep the relative path stable. This should work for at least some of the cases where the cache should expire anyway when the data changes.
+
+The pre-receive hook would send the rewritten repositories to the authorization endpoint at `internal/allowed`. The follow up requests from the endpoint to Gitaly would already contain the relative path pointing to the snapshot repository with a quarantine configured. The transaction middleware can detect this and not start another transaction.
+
+To retain backwards compatibility with Prafect, the transaction manager will cast votes to Praefect when committing a transaction. Reference transaction hooks won't because
+the changes there are only captured in the transaction, not actually committed yet.
+
+Housekeeping must be integrated with the transaction processing. Most of the clean up-related housekeeping tasks, such as removing temporary files or stale locks, are no longer needed. All of the trash left by Git on failures is contained in the snapshots and removed with them when the transaction finishes.
+
+That leaves reference and object repacking, object pruning, and building the various indexes. All of these can be done in transactions. The new packs, for
+example, can be computed in a snapshot. When committing, the transaction manager can check whether their changes conflict with any other concurrently-committed transaction.
+For example, an object that was pruned in a snapshot could be concurrently referenced from another transaction. If there are conflicts, the transaction manager either:
+
+- Resolves the conflict if possible.
+- Aborts the transaction and retries the housekeeping task.
+
+The transaction manager should keep a track of how many packfiles and loose references there are in a repository, and trigger a repack when necessary.
+
+The above allows for almost completely transparent integration with the existing code in Gitaly. We only have to update a couple of write RPCs to include the data in the transaction if it is set. This keeps the migrationary period manageable with minimal conditional logic spread throughout the code base.
+
+### Performance considerations
+
+The most glaring concern is the cost of snapshotting a repository. We are copying the directory structure of the repository and hard linking the files in
+place before a request is processed. This might not be as problematic as it first sounds because:
+
+- The snapshotting is essentially only creating directory entries. These are quick syscalls. The number of files in the repository increases the number of
+ directory entries and links we need to create in the snapshot. This can be mitigated by maintaining the repositories in good shape by repacking objects
+ and references. Reftables will also eventually help reduce the number of loose references. The write-ahead log only writes objects into the repository
+ as packfiles so loose objects won't be a concern in the future.
+- These will be in-memory operations. They'll target the page cache and don't need to be fsynced.
+- The snapshots can be shared between read-only transactions because they don't perform any modifications in them. This means that we only have to create
+ snapshots for writes, and for reads when a new version was committed after creating the previous read-only snapshot. Writes are relatively rare.
+- The isolation level can be configurable on a per-transaction level for performance. Snapshot isolation is not needed when an RPC fetches a single blob.
+
+Serializing the writes requires them to be committed one by one, which could become a bottleneck. However:
+
+- The data partitioning minimizes this bottleneck:
+ - We only have to serialize writes within a partition.
+ - Most repositories will have their own partition.
+ - Object pools and their borrowers must be in the same partition. This could result in large partitions which may lead to a performance degradation. However:
+ - The object pools are currently undergoing a redesign. See [the blueprint](../object_pools/index.md) for more details.
+ - The partition assignments of the object pools, the origin repository, and the forks are better handled in context of the object deduplication design.
+ Some possible approaches include:
+ - Keeping the origin repository in its own partition. This ensures forking a repository does not lead to performance degradation for the forked repository.
+ - Splitting the forks into multiple partitions with each having their own copy of the object pool. This ensures the forks will retain acceptable
+ performance at the cost of increased storage use due to object pool duplication.
+- Checking for write conflicts can be done entirely in memory because the transaction manager can keep track of which resources have been modified by
+ concurrent transactions. This allows for finer-grained locking than Git supports, especially when it comes to reference deletions.
+
+The snapshot isolation requires us to keep multiple versions of data. This will increase storage usage. The actual impact depends on the amount of the new data written and the open transactions that are holding on to the old data.
+
+On the other hand, the snapshot isolation brings performance benefits:
+
+- `fsync` can be turned off for most writes because they target the snapshots. The writes that are committed to the real repository will be `fsync`ed by the transaction manager.
+- Transactions never block each other because they'll write locks in their own snapshot. For example, transactions can concurrently delete references because they each have
+ their own `packed-refs` file.
+- Writes into the main repository can be batched together. For example, if multiple reference deletions are committed around the same time, they can be applied to the repository
+ in a single write, resulting in rewriting the `packed-refs` file only once.
+
+Snapshot isolation also enables features that were not previously feasible. These are generally long-running read operations:
+
+- Online checksumming requires that the data doesn't change during the checksumming operation. This would previously require a lock on the repository. This can be done without
+ any blocking because the checksum can be computed from the snapshot.
+- Online (consistent) backups become possible because they can be built from the snapshot.
+
+## Life of a transaction
+
+The diagram below models the flow of a write transaction that updates some references. The diagram shows the key points of how the transactions are handled:
+
+- Each transaction has a snapshot of the repository.
+- The RPC handlers never operate on the repository itself.
+- The changes performed in the snapshot are captured in the transaction.
+- The changes are committed after the RPC has returned successfully.
+- The transaction is asynchronously applied to the repository from the log.
+
+Beginning and committing a transaction may block other transactions. Open transactions proceed concurrently without blocking:
+
+1. Shared lock is acquired on the repository when the snapshot is being created. Multiple snapshots can be taken at the same time but the no changes can be written into
+ the repository.
+1. Transactions run concurrently without any blocking until the commit call where the serializability checks are done.
+1. Log application acquires an exclusive lock on the repository, which blocks snapshotting.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ gRPC Server->>+Transaction Middleware: Request
+ Transaction Middleware->>+Transaction Manager: Begin
+ Transaction Manager->>+Transaction: Open Transaction
+ participant Repository
+ critical Shared Lock on Repository
+ Transaction->>+Snapshot: Create Snapshot
+ end
+ Transaction->>Transaction Manager: Transaction Opened
+ Transaction Manager->>Transaction Middleware: Begun
+ Transaction Middleware->>+RPC Handler: Rewritten Request
+ RPC Handler->>+git update-ref: Update References
+ git update-ref->>Snapshot: Prepare
+ Snapshot->>git update-ref: Prepared
+ git update-ref->>Snapshot: Commit
+ Snapshot->>git update-ref: Committed
+ git update-ref->>+Reference Transaction Hook: Invoke
+ Reference Transaction Hook->>Transaction: Capture Updates
+ Transaction->>Reference Transaction Hook: OK
+ Reference Transaction Hook->>-git update-ref: OK
+ git update-ref->>-RPC Handler: References Updated
+ RPC Handler->>-Transaction Middleware: Success
+ Transaction Middleware->>Transaction: Commit
+ Transaction->>Transaction Manager: Commit
+ critical Serializability Check
+ Transaction Manager->>Transaction Manager: Verify Transaction
+ end
+ Transaction Manager->>Repository: Log Transaction
+ Repository->>Transaction Manager: Transaction Logged
+ Transaction Manager->>Transaction: Committed
+ Transaction->>Snapshot: Remove Snapshot
+ deactivate Snapshot
+ Transaction->>-Transaction Middleware: Committed
+ Transaction Middleware->>-gRPC Server: Success
+ critical Exclusive Lock on Repository
+ Transaction Manager->>-Repository: Apply Transaction
+ end
+```
+
+## Future opportunities
+
+### Expose transactions to clients
+
+Once Gitaly internally has transactions, the next natural step is to expose them to the clients. For example, Rails could run multiple operations in a single transaction. This would
+extend the ACID guarantees to the clients, which would solve a number of issues:
+
+- The clients would have ability to commit transactions atomically. Either all changes they make are performed or none are.
+- The operations would automatically be guarded against races through the serializability guarantees.
+
+For Gitaly maintainers, extending the transactions to clients enables reducing our API surface. Gitaly has multiple RPCs that perform the same operations. For example, references
+are updated in multiple RPCs. This increases complexity. If the clients can begin, stage changes, and commit a transaction, we can have fewer, more fine grained RPCs. For
+example, `UserCommitFiles` could be modeled with more fine grained commands as:
+
+- `Begin`
+- `WriteBlob`
+- `WriteTree`
+- `WriteCommit`
+- `UpdateReference`
+- `Commit`
+
+This makes the API composable because the clients can use the single-purpose RPCs to compose more complex operations. This might lead to a concern that each operation requires
+multiple RPC calls, increasing the latency due to roundtrips. This can be mitigated by providing an API that allows for batching commands.
+
+Other databases provide these features through explicit transactions and a query language.
+
+### Continuous backups with WAL archiving
+
+Incremental backups are currently prohibitively slow because they must always compute the changes between the previous backup and the current state of the repository. Because
+all writes to a partition go through the write-ahead log, it's possible to stream the write-ahead log entries to incrementally back up the repository. For more information,
+see [Repository Backups](../repository_backups/index.md).
+
+### Raft replication
+
+The transactions provide serializability on a single partition. The partition's write-ahead log can be replicated using a consensus algorithm such as Raft. Because Raft
+guarantees linearizability for log entry commits, and the transaction manager ensures serializability of transactions prior to logging them, all operations across the replicas
+get serializability guarantees. For more information, see [epic 8903](https://gitlab.com/groups/gitlab-org/-/epics/8903).
+
+## Alternative solutions
+
+No alternatives have been proposed to the transaction management. The current state of squashing concurrency- and write interruption-related bugs one by one is not scalable.
+
+### Snapshot isolation with reftables
+
+Our preliminary designs for snapshot isolation relied on reftables, a new reference backend in Git. Reftables have been a work in progress for years and there doesn't seem to
+be a clear timeline for when they'll actually land in Git. They have a number of shortcomings compared to the proposed solution here:
+
+- Reftables only cover references in a snapshot. The snapshot design here covers the complete repository, most importantly object database content.
+- Reftables would require heavy integration as each Git invocation would have to be wired to read the correct version of a reftable. The file system -based snapshot design
+ here requires no changes to the existing Git invocations.
+- The design here gives a complete snapshot of a repository, which enables running multiple RPCs on the same transaction because the transaction's state is stored on the disk
+ during the transaction. Each RPC is able to read the transaction's earlier writes but remain isolated from other transactions. It's unclear how this would be implemented with
+ reftables, especially when it comes to object isolation. This is needed if we want to extend the transaction interface to the clients.
+- The snapshots are independent from each other. This reduces synchronization because each transaction can proceed with staging their changes without being blocked by any
+ other transactions. This enables optimistic locking for better performance.
+
+Reftables are still useful as a more efficient reference backend but they are not needed for snapshot isolation.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/decisions/001_hierarchical_events.md b/doc/architecture/blueprints/gitlab_ci_events/decisions/001_hierarchical_events.md
new file mode 100644
index 00000000000..cec8fa47634
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_ci_events/decisions/001_hierarchical_events.md
@@ -0,0 +1,62 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab CI Events ADR 001: Use hierarchical events'
+---
+
+# GitLab CI Events ADR 001: Use hierarchical events
+
+## Context
+
+We did some brainstorming in [an issue](https://gitlab.com/gitlab-org/gitlab/-/issues/424865)
+with multiple use-cases for running CI pipelines based on subscriptions to CI
+events. The pattern of using hierarchical events emerged, it is clear that
+events may be grouped together by type or by origin.
+
+For example:
+
+```yaml
+annotate:
+ on: issue/created
+ script: ./annotate $[[ event.issue.id ]]
+
+summarize:
+ on: issue/closed
+ script: ./summarize $[[ event.issue.id ]]
+```
+
+When making this decision we didn't focus on the syntax yet, but the grouping
+of events seems to be useful in majority of use-cases.
+
+We considered making it possible for users to subscribe to multiple events in a
+group at once:
+
+```yaml
+audit:
+ on: events/gitlab/gitlab-org/audit/*
+ script: ./audit $[[ event.operation.name ]]
+```
+
+The implication of this is that events within the same groups should share same
+fields / schema definition.
+
+## Decision
+
+Use hierarchical events: events that can be grouped together and that will
+share the same fields following a stable contract. For example: all _issue_
+events will contain `issue.iid` field.
+
+How we group events has not been decided yet, we can either do that by
+labeling or grouping using path-like syntax.
+
+## Consequences
+
+The implication is that we will need to build a system with stable interface
+describing events' payload and / or schema.
+
+## Alternatives
+
+An alternative is not to use hierarchical events, and making it necessary to
+subscribe to every event separately, without giving users any guarantess around
+common schema for different events. This would be especially problematic for
+events that naturally belong to some group and users expect a common schema
+for, like audit events.
diff --git a/doc/architecture/blueprints/gitlab_ci_events/index.md b/doc/architecture/blueprints/gitlab_ci_events/index.md
index 51d65869dfb..afa7f324111 100644
--- a/doc/architecture/blueprints/gitlab_ci_events/index.md
+++ b/doc/architecture/blueprints/gitlab_ci_events/index.md
@@ -2,9 +2,9 @@
status: proposed
creation-date: "2023-03-15"
authors: [ "@furkanayhan" ]
-owners: [ "@furkanayhan" ]
+owners: [ "@fabiopitino" ]
coach: "@grzesiek"
-approvers: [ "@jreporter", "@cheryl.li" ]
+approvers: [ "@fabiopitino", "@jreporter", "@cheryl.li" ]
owning-stage: "~devops::verify"
participating-stages: [ "~devops::package", "~devops::deploy" ]
---
@@ -46,6 +46,10 @@ Events" blueprint is about making it possible to:
## Proposal
+### Decisions
+
+- [001: Use hierarchical events](decisions/001_hierarchical_events.md)
+
### Requirements
Any accepted proposal should take in consideration the following requirements and characteristics:
diff --git a/doc/architecture/blueprints/gitlab_ml_experiments/index.md b/doc/architecture/blueprints/gitlab_ml_experiments/index.md
index 90adfc41257..e0675bb5be6 100644
--- a/doc/architecture/blueprints/gitlab_ml_experiments/index.md
+++ b/doc/architecture/blueprints/gitlab_ml_experiments/index.md
@@ -69,7 +69,7 @@ Instead of embedding these applications directly into the Rails and/or Sidekiq c
![use services instead of fat containers](https://docs.google.com/drawings/d/e/2PACX-1vSRrPo0TNtXG8Yqj37TO2PaND9PojGZzNRs2rcTA37-vBZm5WZlfxLDCKVJD1vYHTbGy1KY1rDYHwlg/pub?w=1008&h=564)\
[source](https://docs.google.com/drawings/d/1ZPprcSYH5Oqp8T46I0p1Hhr-GD55iREDvFWcpQq9dTQ/edit)
-The service-integration approach has already been used for the [Suggested Reviewers feature](https://gitlab.com/gitlab-com/gl-infra/readiness/-/merge_requests/114) that has been deployed to GitLab.com.
+The service-integration approach has already been used for the [GitLab Duo Suggested Reviewers feature](https://gitlab.com/gitlab-com/gl-infra/readiness/-/merge_requests/114) that has been deployed to GitLab.com.
This approach would have many advantages:
diff --git a/doc/architecture/blueprints/gitlab_observability_backend/index.md b/doc/architecture/blueprints/gitlab_observability_backend/index.md
deleted file mode 100644
index 5b99235e18c..00000000000
--- a/doc/architecture/blueprints/gitlab_observability_backend/index.md
+++ /dev/null
@@ -1,693 +0,0 @@
----
-status: proposed
-creation-date: "2022-11-09"
-authors: [ "@ankitbhatnagar" ]
-coach: "@mappelman"
-approvers: [ "@sebastienpahl", "@nicholasklick" ]
-owning-stage: "~monitor::observability"
-participating-stages: []
----
-
-<!-- vale gitlab.FutureTense = NO -->
-
-# GitLab Observability Backend - Metrics
-
-## Summary
-
-Developing a multi-user system to store & query observability data typically formatted in widely accepted, industry-standard formats using Clickhouse as underlying storage, with support for long-term data retention and aggregation.
-
-## Motivation
-
-From the six pillars of Observability, commonly abbreviated as `TEMPLE` - Traces, Events, Metrics, Profiles, Logs & Errors, Metrics constitute one of the most important pillars of observability data for modern day systems, helping their users gather insights about their operational posture.
-
-Metrics which are commonly structured as timeseries data have the following characteristics:
-
-- indexed by their corresponding timestamps;
-- continuously expanding in size;
-- usually aggregated, down-sampled, and queried in ranges; and
-- have very write-intensive requirements.
-
-Within GitLab Observability Backend, we aim to add the support for our customers to ingest and query observability data around their systems & applications, helping them improve the operational health of their systems.
-
-### Goals
-
-With the development of the proposed system, we have the following goals:
-
-- Scalable, low latency & cost-effective monitoring system backed by Clickhouse whose performance has been proven via repeatable benchmarks.
-
-- Support for long-term storage for Prometheus/OpenTelemetry formatted metrics, ingested via Prometheus remote_write API and queried via Prometheus remote_read API, PromQL or SQL with support for metadata and exemplars.
-
-The aforementioned goals can further be broken down into the following four sub-goals:
-
-#### Ingesting data
-
-- For the system to be capable of ingesting large volumes of writes and reads, we aim to ensure that it must be horizontally scalable & provide durability guarantees to ensure no writes are dropped once ingested.
-
-#### Persisting data
-
-- We aim to support ingesting telemetry/data sent using Prometheus `remote_write` protocol. Any persistence we design for our dataset must be multi-tenant by default, ensuring we can store observability data for multiple tenants/groups/projects within the same storage backend.
-
-- We aim to develop a test suite for data correctness, seeking inspiration from how Prometheus compliance test suite checks the correctness of a given Metrics implementation and running it as a part of our CI setup.
-
-NOTE:
-Although remote_write_sender does not test the correctness of a remote write receiver itself as is our case, it does bring some inspiration to implement/develop one within the scope of this project.
-
-- We aim to also ensure compatibility for special Prometheus data types, for example, Prometheus histogram(s), summary(s).
-
-#### Reading data
-
-- We aim to support querying data using PromQL which means translating PromQL queries into Clickhouse SQL. To do this, [PromQL](https://github.com/prometheus/prometheus/tree/main/promql/parser) or [MetricsQL](https://github.com/VictoriaMetrics/metricsql) parsers are good alternatives.
-
-- We aim to provide additional value by exposing all ingested data via the native Clickhouse SQL interface subject to the following reliability characteristics:
- - query validation, sanitation
- - rate limiting
- - resource limiting - memory, cpu, network bandwidth
-
-- We aim to pass Prometheus test suits for correctness via the [Prometheus Compliance test suite](https://github.com/prometheus/compliance/tree/main/promql) with a target goal of 100% success rate.
-
-#### Deleting data
-
-- We aim to support being able to delete any ingested data should such a need arise. This is also in addition to us naturally deleting data when a configured TTL expires and/or respective retention policies are enforced. We must, within our schemas, build a way to delete data by labels OR their content, also add to our offering the necessary tooling to do so.
-
-### Non-Goals
-
-With the goals established above, we also want to establish what specific things are non-goals with the current proposal. They are:
-
-- We do not aim to support ingestion using OpenTelemetry/OpenMetrics formats with our first iteration, though our users can still use the Opentelemetry exporters(s) internally consuming the standard Prometheus `remote_write` protocol. More information [here](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/prometheusremotewriteexporter).
-
-- We do not aim to support ingesting Prometheus exemplars in our first iteration, though we do aim to account for them in our design from the beginning.
-
-NOTE:
-Worth noting that we intend to model exemplars the same way we're modeling metric-labels, so building on top of the same data structure should help implementt support for metadata/exemplars rather easily.
-
-## Proposal
-
-We intend to use GitLab Observability Backend as a framework for the Metrics implementation so that its lifecycle is also managed via already existing Kubernetes controllers for example, scheduler, tenant-operator.
-
-![Architecture](supported-deployments.png)
-
-From a development perspective, what's been marked as our "Application Server" above needs to be developed as a part of this proposal while the remaining peripheral components either already exist or can be provisioned via existing code in `scheduler`/`tenant-operator`.
-
-**On the write path**, we expect to receive incoming data via `HTTP`/`gRPC` `Ingress` similar to what we do for our existing services, for example, errortracking, tracing.
-
-NOTE:
-Additionally, since we intend to ingest data via Prometheus `remote_write` API, the received data will be Protobuf-encoded, Snappy-compressed. All received data therefore needs to be decompressed & decoded to turn it into a set of `prompb.TimeSeries` objects, which the rest of our components interact with.
-
-We also need to make sure to avoid writing a lot of small writes into Clickhouse, therefore it'd be prudent to batch data before writing it into Clickhouse.
-
-We must also make sure ingestion remains decoupled with `Storage` so as to reduce undue dependence on a given storage implementation. While we do intend to use Clickhouse as our backing storage for any foreseeable future, this ensures we do not tie ourselves in into Clickhouse too much should future business requirements warrant the usage of a different backend/technology. A good way to implement this in Go would be our implementations adhering to a standard interface, the following for example:
-
-```go
-type Storage interface {
- Read(
- ctx context.Context,
- request *prompb.ReadRequest
- ) (*prompb.ReadResponse, error)
- Write(
- ctx context.Context,
- request *prompb.WriteRequest
- ) error
-}
-```
-
-NOTE:
-We understand this couples the implementation with Prometheus data format/request types, but adding methods to the interface to support more data formats should be trivial looking forward with minimal changes to code.
-
-**On the read path**, we aim to allow our users to use the Prometheus `remote_read` API and be able to query ingested data via PromQL & SQL. Support for `remote_read` API should be trivial to implement, while supporting PromQL would need translating it into SQL. We can however employ the usage of already existing [PromQL](https://github.com/prometheus/prometheus/tree/main/promql/parser) parsing libraries.
-
-We aim to focus on implementing query validation & sanitation, rate-limiting and regulating resource-consumption to ensure underlying systems, esp. storage, remain in good operational health at all times.
-
-### Supported deployments
-
-In this first iteration of the metrics backend, we intend to support a generic deployment model that makes sure we can capture as much usage as possible and begin dogfooding the product as soon as possible. This is well illustrated in the [aforementioned architecture diagram](#proposal).
-
-In its most vanilla form, metrics support in GitLab Observability Backend can be used via the Prometheus remote read & write APIs. If a user already uses Prometheus as their monitoring abstraction, it can be configured to use this backend directly.
-
-- remote_write: [configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write)
-- remote_read: [configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read)
-
-For users of the system that do not use a Prometheus instance for scraping their telemetry data, they can export their metrics via a multitude of collectors/agents such as the OpenTelemetry collector or the Prometheus Agent for example, all of which can be configured to use our remote_write endpoint. For reads however, we intend to run a Prometheus within GOB (alongside the application server) itself, then hook it up automatically with the GitLab Observability UI (GOUI) preconfigured to consume our remote_read endpoint.
-
-Notably, the ability to use a GOB-run Prometheus instance is applicable while we can only support remote_read API for running queries. Looking forward towards our next iteration, we should be able to get rid of this additional component altogether when we have full support for executing PromQL and/or SQL queries directly from GOUI.
-
-**Per-group deployments**: From a scalability perspective, we deploy an instance of Ingress, a Prometheus instance & the application server per group to make sure we can scale them subject to traffic volumes of the respective tenant. It also helps isolate resource consumption across tenants in an otherwise multi-tenant system.
-
-### Metric collection and storage
-
-It is important to separate metric collection on the client side with the storage we provision at our end.
-
-### State of the art for storage
-
-Existing long-term Prometheus compatible metrics vendors provide APIs that are compatible with Prometheus remote_write.
-
-### State of the art for Prometheus clients
-
-Metric collection clients such as Prometheus itself, Grafana Cloud Agent, Datadog Agent, etc. will scrape metrics endpoints typically from within a firewalled environment, store locally scraped metrics in a [Write Ahead Log (WAL)](https://en.wikipedia.org/wiki/Write-ahead_logging) and then batch send them to an external environment (i.e. the vendor or an internally managed system like Thanos) via the Prometheus `remote_write` protocol.
-
-- A client-side collector is an important part of the overall architecture, though it's owned by the customer/user since it needs to run in their environment. This gives the end user full control over their data because they control how it is collected and to where it is delivered.
-
-- It's **not** feasible to provide an external vendor with credentials to access and scrape endpoints within a user's firewalled environment.
-
-- It's also critically important that our `remote_write` APIs respond correctly with the appropriate rate-limiting status codes so that Prometheus Clients can respect them.
-
-[Here](https://grafana.com/blog/2021/05/26/the-future-of-prometheus-remote-write/) is a good background/history on Prometheus `remote_write` and its importance in Prometheus based observability.
-
-## Design and implementation details
-
-Following are details of how we aim to design & implement the proposed solution. To that end, a reference implementation was also developed to understand the scope of the problem and provide early data to ensure our proposal was drafted around informed decisions and/or results of our experimentation.
-
-## Reference implementation(s)
-
-- [Application server](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1823)
-- [Metrics generator](https://gitlab.com/ankitbhatnagar/metrics-gen/-/blob/main/main.go)
-
-## Target environments
-
-Keeping inline with our current operational structure, we intend to deploy the metrics offering as a part of GitLab Observability Backend, deployed on the following two target environments:
-
-- kind cluster (for local development)
-- GKE cluster (for staging/production environments)
-
-## Schema Design
-
-### **Proposed solution**: Fully normalized tables for decreased redundancy & increased read performance
-
-### primary, denormalized data table
-
-```sql
-CREATE TABLE IF NOT EXISTS samples ON CLUSTER '{cluster}' (
- series_id UUID,
- timestamp DateTime64(3, 'UTC') CODEC(Delta(4), ZSTD),
- value Float64 CODEC(Gorilla, ZSTD)
-) ENGINE = ReplicatedMergeTree()
-PARTITION BY toYYYYMMDD(timestamp)
-ORDER BY (series_id, timestamp)
-```
-
-### metadata table to support timeseries metadata/exemplars
-
-```sql
-CREATE TABLE IF NOT EXISTS samples_metadata ON CLUSTER '{cluster}' (
- series_id UUID,
- timestamp DateTime64(3, 'UTC') CODEC(Delta(4), ZSTD),
- metadata Map(String, String) CODEC(ZSTD),
-) ENGINE = ReplicatedMergeTree()
-PARTITION BY toYYYYMMDD(timestamp)
-ORDER BY (series_id, timestamp)
-```
-
-### lookup table(s)
-
-```sql
-CREATE TABLE IF NOT EXISTS labels_to_series ON CLUSTER '{cluster}' (
- labels Map(String, String) CODEC(ZSTD)
- series_id UUID
-) engine=ReplicatedMergeTree
-PRIMARY KEY (labels, series_id)
-```
-
-```sql
-CREATE TABLE IF NOT EXISTS group_to_series ON CLUSTER '{cluster}'' (
- group_id Uint64,
- series_id UUID,
-) ORDER BY (group_id, series_id)
-```
-
-### Refinements
-
-- sharding considerations for a given tenant when ingesting/persisting data if we intend to co-locate data specific to multiple tenants within the same database tables. To simplify things, segregating tenant-specific data to their own dedicated set of tables would make a lot of sense.
-
-- structural considerations for "timestamps" when ingesting data across tenants.
-
-- creation_time vs ingestion_time
-
-- No support for transactions in the native client yet, to be able to effectively manage writes across multiple tables.
-
-NOTE:
-Slightly non-trivial but we can potentially investigate the possibility of using ClickHouse/ch-go directly, it supposedly promises a better performance profile too.
-
-### Pros - multiple tables
-
-- Normalised data structuring allows for efficient storage of data, removing any redundancy across multiple samples for a given timeseries. Evidently, for the "samples" schema, we expect to store 32 bytes of data per metric point.
-
-- Better search complexity when filtering timeseries by labels/metadata, via the use of better indexed columns.
-
-- All data is identifiable via a unique identifier, which can be used to maintain data consistency across tables.
-
-### Cons - multiple tables
-
-- Writes are trivially expensive considering writes across multiple tables.
-
-- Writes across tables also need to be implemented as a transaction to guarantee consistency when ingesting data.
-
-### Operational characteristics - multiple tables
-
-### Storage - multiple tables
-
-A major portion of our writes are made into the `samples` schema which contains a tuple containing three data points per metric point written:
-
-| Column | Data type | Byte size |
-|:------------|:-----------|:----------|
-| `series_id` | UUID | 16 bytes |
-| `timestamp` | DateTime64 | 8 bytes |
-| `value` | Float64 | 8 bytes |
-
-Therefore, we estimate to use 32 bytes per sample ingested.
-
-### Compression - multiple tables
-
-Inspecting the amount of compression we're able to get with the given design on our major schemas, we see it as a good starting point. Following measurements for both primary tables:
-
-**Schema**: `labels_to_series` containing close to 12k unique `series_id`, each mapping to a set of 10-12 label string pairs
-
-```sql
-SELECT
- table,
- column,
- formatReadableSize(sum(data_compressed_bytes) AS x) AS compressedsize,
- formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed
-FROM system.parts_columns
-WHERE table LIKE 'labels_to_series_1'
-GROUP BY
- database,
- table,
- column
-ORDER BY x ASC
-
-Query id: 723b4145-14f7-4e74-9ada-01c17c2f1fd5
-
-┌─table──────────────┬─column────┬─compressedsize─┬─uncompressed─┐
-│ labels_to_series_1 │ labels │ 586.66 KiB │ 2.42 MiB │
-│ labels_to_series_1 │ series_id │ 586.66 KiB │ 2.42 MiB │
-└────────────────────┴───────────┴────────────────┴──────────────┘
-```
-
-**Schema**: `samples` containing about 20k metric samples each containing a tuple comprising `series_id` (16 bytes), `timestamp` (8 bytes) and `value` (8 bytes).
-
-```sql
-SELECT
- table,
- column,
- formatReadableSize(sum(data_compressed_bytes) AS x) AS compressedsize,
- formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed
-FROM system.parts_columns
-WHERE table LIKE 'samples_1'
-GROUP BY
- database,
- table,
- column
-ORDER BY x ASC
-
-Query id: 04219cea-06ea-4c5f-9287-23cb23c023d2
-
-┌─table─────┬─column────┬─compressedsize─┬─uncompressed─┐
-│ samples_1 │ value │ 373.21 KiB │ 709.78 KiB │
-│ samples_1 │ timestamp │ 373.21 KiB │ 709.78 KiB │
-│ samples_1 │ series_id │ 373.21 KiB │ 709.78 KiB │
-└───────────┴───────────┴────────────────┴──────────────┘
-```
-
-### Performance - multiple tables
-
-From profiling our reference implementation, it can also be noted that most of our time right now is spent in the application writing data to Clickhouse and/or its related operations. A "top" pprof profile sampled from the implementation looked like:
-
-```shell
-(pprof) top
-Showing nodes accounting for 42253.20kB, 100% of 42253.20kB total
-Showing top 10 nodes out of 58
- flat flat% sum% cum cum%
-13630.30kB 32.26% 32.26% 13630.30kB 32.26% github.com/ClickHouse/clickhouse-go/v2/lib/compress.NewWriter (inline)
-11880.92kB 28.12% 60.38% 11880.92kB 28.12% github.com/ClickHouse/clickhouse-go/v2/lib/compress.NewReader (inline)
- 5921.37kB 14.01% 74.39% 5921.37kB 14.01% bufio.NewReaderSize (inline)
- 5921.37kB 14.01% 88.41% 5921.37kB 14.01% bufio.NewWriterSize (inline)
- 1537.69kB 3.64% 92.04% 1537.69kB 3.64% runtime.allocm
- 1040.73kB 2.46% 94.51% 1040.73kB 2.46% github.com/aws/aws-sdk-go/aws/endpoints.init
- 1024.41kB 2.42% 96.93% 1024.41kB 2.42% runtime.malg
- 768.26kB 1.82% 98.75% 768.26kB 1.82% go.uber.org/zap/zapcore.newCounters
- 528.17kB 1.25% 100% 528.17kB 1.25% regexp.(*bitState).reset
- 0 0% 100% 5927.73kB 14.03% github.com/ClickHouse/clickhouse-go/v2.(*clickhouse).Ping
-```
-
-As is evident above from our preliminary analysis, writing data into Clickhouse can be a potential bottleneck. Therefore, on the write path, it'd be prudent to batch our writes into Clickhouse so as to reduce the amount of work the application server ends up doing making the ingestion path more efficient.
-
-On the read path, it's also possible to parallelize reads for the samples table either by `series_id` OR by blocks of time between the queried start and end timestamps.
-
-### Caveats
-
-- When dropping labels from already existing metrics, we treat their new counterparts as completely new series and hence attribute them to a new `series_id`. This avoids having to merge series data and/or values. The old series, if not actively written into, should eventually fall off their retention and get deleted.
-
-- We have not yet accounted for any data aggregation. Our assumption is that the backing store (in Clickhouse) should allow us to keep a "sufficient" amount of data in its raw form and that we should be able to query against it within our query latency SLOs.
-
-### **Rejected alternative**: Single, centralized table
-
-### single, centralized data table
-
-```sql
-CREATE TABLE IF NOT EXISTS metrics ON CLUSTER '{cluster}' (
- group_id UInt64,
- name LowCardinality(String) CODEC(ZSTD),
- labels Map(String, String) CODEC(ZSTD),
- metadata Map(String, String) CODEC(ZSTD),
- value Float64 CODEC (Gorilla, ZSTD),
- timestamp DateTime64(3, 'UTC') CODEC(Delta(4),ZSTD)
-) ENGINE = ReplicatedMergeTree()
-PARTITION BY toYYYYMMDD(timestamp)
-ORDER BY (group_id, name, timestamp);
-```
-
-### Pros - single table
-
-- Single source of truth, so all metrics data lives in one big table.
-
-- Querying data is easier to express in terms of writing SQL queries without having to query data across multiple tables.
-
-### Cons - single table
-
-- Huge redundancy built into the data structure since attributes such as name, labels, metadata are stored repeatedly for each sample collected.
-
-- Non-trivial complexity to search timeseries with values for labels/metadata given how they're stored when backed by Maps/Arrays.
-
-- High query latencies by virtue of having to scan large amounts of data per query made.
-
-### Operational Characteristics - single table
-
-### Storage - single table
-
-| Column | Data type | Byte size |
-|:------------|:--------------------|:----------|
-| `group_id` | UUID | 16 bytes |
-| `name` | String | - |
-| `labels` | Map(String, String) | - |
-| `metadata` | Map(String, String) | - |
-| `value` | Float64 | 8 bytes |
-| `timestamp` | DateTime64 | 8 bytes |
-
-NOTE:
-Strings are of an arbitrary length, the length is not limited. Their value can contain an arbitrary set of bytes, including null bytes. We will need to regulate what we write into these columns application side.
-
-### Compression - single table
-
-**Schema**: `metrics` containing about 20k metric samples each consisting of a `group_id`, `metric name`, `labels`, `metadata`, `timestamp` & corresponding `value`.
-
-```sql
-SELECT count(*)
-FROM metrics_1
-
-Query id: e580f20b-b422-4d93-bb1f-eb1435761604
-
-┌─count()─┐
-│ 12144 │
-
-
-SELECT
- table,
- column,
- formatReadableSize(sum(data_compressed_bytes) AS x) AS compressedsize,
- formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed
-FROM system.parts_columns
-WHERE table LIKE 'metrics_1'
-GROUP BY
- database,
- table,
- column
-ORDER BY x ASC
-
-Query id: b2677493-3fbc-46c1-a9a7-4524a7a86cb4
-
-┌─table─────┬─column────┬─compressedsize─┬─uncompressed─┐
-│ metrics_1 │ labels │ 283.02 MiB │ 1.66 GiB │
-│ metrics_1 │ metadata │ 283.02 MiB │ 1.66 GiB │
-│ metrics_1 │ group_id │ 283.02 MiB │ 1.66 GiB │
-│ metrics_1 │ value │ 283.02 MiB │ 1.66 GiB │
-│ metrics_1 │ name │ 283.02 MiB │ 1.66 GiB │
-│ metrics_1 │ timestamp │ 283.02 MiB │ 1.66 GiB │
-└───────────┴───────────┴────────────────┴──────────────┘
-```
-
-Though we see a good compression factor for the aforementioned schema, the amount of storage needed to store the corresponding dataset is approximately 300MiB. We also expect to see this footprint increase linearly given the redundancy baked into the schema design itself, also one of the reasons we intend **not** to proceed with this design further.
-
-### Performance - single table
-
-```shell
-(pprof) top
-Showing nodes accounting for 12844.95kB, 100% of 12844.95kB total
-Showing top 10 nodes out of 40
- flat flat% sum% cum cum%
- 2562.81kB 19.95% 19.95% 2562.81kB 19.95% runtime.allocm
- 2561.90kB 19.94% 39.90% 2561.90kB 19.94% github.com/aws/aws-sdk-go/aws/endpoints.init
- 2374.91kB 18.49% 58.39% 2374.91kB 18.49% github.com/ClickHouse/clickhouse-go/v2/lib/compress.NewReader (inline)
- 1696.32kB 13.21% 71.59% 1696.32kB 13.21% bufio.NewWriterSize (inline)
- 1184.27kB 9.22% 80.81% 1184.27kB 9.22% bufio.NewReaderSize (inline)
- 1184.27kB 9.22% 90.03% 1184.27kB 9.22% github.com/ClickHouse/clickhouse-go/v2/lib/compress.NewWriter (inline)
- 768.26kB 5.98% 96.01% 768.26kB 5.98% go.uber.org/zap/zapcore.newCounters
- 512.20kB 3.99% 100% 512.20kB 3.99% runtime.malg
- 0 0% 100% 6439.78kB 50.13% github.com/ClickHouse/clickhouse-go/v2.(*clickhouse).Ping
- 0 0% 100% 6439.78kB 50.13% github.com/ClickHouse/clickhouse-go/v2.(*clickhouse).acquire
-```
-
-Writes against this schema perform much better in terms of compute, given it's concentrated on one table and does not need looking up `series_id` from a side table.
-
-### General storage considerations - Clickhouse
-
-The following sections intend to deep-dive into specific characteristics of our schema design and/or their interaction with Clickhouse - the database system.
-
-- table engines
-
- - [MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/)
- - [S3 Table Engine](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3/)
-
-- efficient partitioning and/or sharding
-
- - Configuring our schemas with the right partitioning keys so as to have the least amount of blocks scanned when reading back the data.
- - Sharding here would refer to how we design our data placement strategy to make sure the cluster remains optimally balanced at all times.
-
-- data compression
-
-As is visible from the aforementioned preliminary results, we see good compression results with dictionary and delta encoding for strings and floats respectively. When storing labels with a `Map` of `LowCardinality(String)`s, we were able to pack data efficiently.
-
-- materialized views
-
-Can be updated dynamically as the need be, help make read paths performant
-
-- async inserts
-
-- batch inserts
-
-- retention/TTLs
-
-We should only store data for a predetermined period of time, post which we either delete data, aggregate it or ship it to an archival store to reduce operational costs of having to store data for longer periods of time.
-
-- data aggregation/rollups
-
-- index granularity
-
-- skip indexes
-
-- `max_server_memory_usage_to_ram_ratio`
-
-### Data access via SQL
-
-While our corpus of data is PromQL-queryable, it would be prudent to make sure we make the SQL interface
-"generally available" as well. This capability opens up multiple possibilities to query resident data and
-allows our users to slice and dice their datasets whichever way they prefer to and/or need to.
-
-#### Challenges
-
-- Resource/cost profiling.
-- Query validation and sanitation.
-
-### Illustrative example(s) of data access
-
-### Writes
-
-On the write path, we first ensure registering a given set labels to a unique `series_id` and/or re-using one should we have seen the timeseries already in the past. For example:
-
-```plaintext
-redis{region="us-east-1",'os':'Ubuntu15.10',...} <TIMESTAMP> <VALUE>
-```
-
-**Schema**: labels_to_series
-
-```sql
-SELECT *
-FROM labels_to_series_1
-WHERE series_id = '6d926ae8-c3c3-420e-a9e2-d91aff3ac125'
-FORMAT Vertical
-
-Query id: dcbc4bd8-0bdb-4c35-823a-3874096aab6e
-
-Row 1:
-──────
-labels: {'arch':'x64','service':'1','__name__':'redis','region':'us-east-1','os':'Ubuntu15.10','team':'LON','service_environment':'production','rack':'36','service_version':'0','measurement':'pubsub_patterns','hostname':'host_32','datacenter':'us-east-1a'}
-series_id: 6d926ae8-c3c3-420e-a9e2-d91aff3ac125
-
-1 row in set. Elapsed: 0.612 sec.
-```
-
-Post which, we register each metric point in the `samples` table attributing it to the corresponding `series_id`.
-
-**Schema**: samples
-
-```sql
-SELECT *
-FROM samples_1
-WHERE series_id = '6d926ae8-c3c3-420e-a9e2-d91aff3ac125'
-LIMIT 1
-FORMAT Vertical
-
-Query id: f3b410af-d831-4859-8828-31c89c0385b5
-
-Row 1:
-──────
-series_id: 6d926ae8-c3c3-420e-a9e2-d91aff3ac125
-timestamp: 2022-11-10 12:59:14.939
-value: 0
-```
-
-### Reads
-
-On the read path, we first query all timeseries identifiers by searching for the labels under consideration. Once we have all the `series_id`(s), we then look up all corresponding samples between the query start timestamp and end timestamp.
-
-For example:
-
-```plaintext
-kernel{service_environment=~"prod.*", measurement="boot_time"}
-```
-
-which gets translated into first looking for all related timeseries:
-
-```sql
-SELECT *
-FROM labels_to_series
-WHERE
-((labels['__name__']) = 'kernel') AND
-match(labels['service_environment'], 'prod.*') AND
-((labels['measurement']) = 'boot_time');
-```
-
-yielding a bunch of `series_id`(s) corresponding to the labels just looked up.
-
-**Sidenote**, this mostly-static dataset can also be cached and built up in-memory gradually to reduce paying the latency cost the second time, which should reduce the number of lookups considerably.
-
-To account for newer writes when maintaining this cache:
-
-- Have an out-of-band process/goroutine maintain this cache, so even if a few queries miss the most recent data, subsequent ones eventually catch up.
-
-- Have TTLs on the keys, jittered per key so as to rebuild them frequently enough to account for new writes.
-
-Once we know which timeseries we're querying for, from there, we can easily look up all samples via the following query:
-
-```sql
-SELECT *
-FROM samples
-WHERE series_id IN (
- 'a12544be-0a3a-4693-86b0-c61a4553aea3',
- 'abd42fc4-74c7-4d80-9b6c-12f673db375d',
- …
-)
-AND timestamp >= '1667546789'
-AND timestamp <= '1667633189'
-ORDER BY timestamp;
-```
-
-yielding all timeseries samples we were interested in.
-
-We then render these into an array of `prometheus.QueryResult` object(s) and return back to the caller as a `prometheus.ReadResponse` object.
-
-NOTE:
-The queries have been broken down into multiple queries only during our early experimentation/iteration, it'd be prudent to use subqueries within the same roundtrip to the database going forward into production/benchmarking.
-
-## Production Readiness
-
-### Batching
-
-Considering we'll need to batch data before ingesting large volumes of small writes into Clickhouse, the design must account for app-local persistence to allow it to locally batch incoming data before landing it into Clickhouse in batches of a predetermined size in order to increase performance and allow the table engine to continue to persist data successfully.
-
-We have considered the following alternatives to implement app-local batching:
-
-- In-memory - non durable
-- BadgerDB - durable, embedded, performant
-- Redis - trivial, external dependency
-- Kafka - non-trivial, external dependency but it can augment multiple other use-cases and help other problem domains at GitLab.
-
-**Note**: Similar challenges have also surfaced with the CH interactions `errortracking` - the subsystem has in its current implementation. There have been multiple attempts to solve this problem domain in the past - [this MR](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1660) implemented an in-memory alternative while [this one](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1767) attempted an on-disk alternative.
-
-Any work done in this area of concern would also benefit other subsystems such as errortracking, logging, etc.
-
-### Scalability
-
-We intend to start testing the proposed implementation with 10K metric-points per second to test/establish our initial hypothesis, though ideally, we must design the underlying backend for 1M points ingested per second.
-
-### Benchmarking
-
-We propose the following three dimensions be tested while benchmarking the proposed implementation:
-
-- Data ingest performance
-- On-disk storage requirements (accounting for replication if applicable)
-- Mean query response times
-
-For understanding performance, we'll need to first compile a list of such queries given the data we ingest for our tests. Clickhouse query logging is super helpful while doing this.
-
-NOTE:
-Ideally, we aim to benchmark the system to be able to ingest >1M metric points/sec while consistently serving most queries under <1 sec.
-
-### Past work & references
-
-- [Benchmark ClickHouse for metrics](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/1666)
-- [Incubation:APM ClickHouse evaluation](https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/4)
-- [Incubation:APM ClickHouse metrics schema](https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/10)
-- [Our research around TimescaleDB](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14137)
-- [Current Workload on our Thanos-based setup](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15420#current-workload)
-- [Scaling-200m-series](https://opstrace.com/blog/scaling-200m-series)
-
-### Cost-estimation
-
-- We aim to make sure the system's not too expensive, especially given our biggest footprint is on Clickhouse and the underlying storage.
-
-- We must consider the usage of multiple storage medium(s), especially:
- - Tiered storage
- - Object storage
-
-### Tooling
-
-- We aim to building visibility into high cardinality metrics to be able to assist with keeping our databases healthy by pruning/dropping unused metrics.
-
-- Similarly, we aim to develop the ability to see unused metrics for the end-user, which can be easily & dynamically built into the system by parsing all read requests and building usage statistics.
-
-- We aim to add monitoring for per-metric scrape frequencies to make sure the end-user is not ingesting data at a volume they do not need and/or find useful.
-
-## Looking forward
-
-### Linkage across telemetry pillars, exemplars
-
-We must build the metrics system in a way to be able cross-reference ingested data with other telemetry pillars, such as traces, logs and errors, so as to provide a more holistic view of all instrumentation a system sends our way.
-
-### User-defined SQL queries to aggregate data and/or generate materialized views
-
-We should allow users of the system to be able to run user-defined, ad-hoc queries similar to how Prometheus recording rules help generate custom metrics from existing ones.
-
-### Write Ahead Logs (WALs)
-
-We believe that should we feel the need to start buffering data local to the ingestion application and/or move away from Clickhouse for persisting data, on-disk WALs would be a good direction to proceed into given their prevelant usage among other monitoring system.
-
-### Custom DSLs or query builders
-
-Using PromQL directly could be a steep learning curve for users. It would be really nice to have a query builder (as is common in Grafana) to allow building of the typical queries you'd expect to run and to allow exploration of the available metrics. It also serves as a way to learn the DSL, so more complex queries can be created later.
-
-## Roadmap & Next Steps
-
-The following section enlists how we intend to implement the aforementioned proposal around building Metrics support into GitLab Observability Service. Each corresponding document and/or issue contains further details of how each next step is planned to be executed.
-
-- **DONE** [Research & draft design proposal and/or requirements](https://docs.google.com/document/d/1kHyIoWEcs14sh3CGfKGiI8QbCsdfIHeYkzVstenpsdE/edit?usp=sharing)
-- **IN-PROGRESS** [Submit system/schema designs (proposal) & gather feedback](https://docs.google.com/document/d/1kHyIoWEcs14sh3CGfKGiI8QbCsdfIHeYkzVstenpsdE/edit?usp=sharing)
-- **IN-PROGRESS** [Develop table definitions and/or storage interfaces](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/1666)
-- **IN-PROGRESS** [Prototype reference implementation, instrument key metrics](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1823)
-- [Benchmark Clickhouse and/or proposed schemas, gather expert advice from Clickhouse Inc.](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/1666)
-- Develop write path(s) - `remote_write` API
-- Develop read path(s) - `remote_read` API, `PromQL`-based querier.
-- Setup testbed(s) for repeatable benchmarking/testing
-- Schema design and/or application server improvements if needed
-- Production Readiness v1.0-alpha/beta
-- Implement vanguarded/staged rollouts
-- Run extended alpha/beta testing
-- Release v1.0
diff --git a/doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png b/doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png
deleted file mode 100644
index 9dccc515129..00000000000
--- a/doc/architecture/blueprints/gitlab_observability_backend/supported-deployments.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/gitlab_services/img/architecture.png b/doc/architecture/blueprints/gitlab_services/img/architecture.png
new file mode 100644
index 00000000000..8ec0852e12b
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_services/img/architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/gitlab_services/index.md b/doc/architecture/blueprints/gitlab_services/index.md
new file mode 100644
index 00000000000..c2f1d08a984
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_services/index.md
@@ -0,0 +1,129 @@
+---
+status: proposed
+creation-date: "2023-08-18"
+authors: [ "@nagyv-gitlab" ]
+coach: "@grzesiek"
+approvers: [ "@shinya.maeda", "@emilybauman" ]
+owning-stage: "~devops::deploy"
+participating-stages: ["~devops::deploy", "~devops::analyze"]
+---
+
+# Services
+
+## Summary
+
+To orthogonally capture the modern service-oriented deployment and environment managements,
+GitLab needs to have [services](https://about.gitlab.com/direction/delivery/glossary.html#service) as the first-class concept.
+This blueprint outlines how the service and related entities should be built in the GitLab CD solution.
+
+## Motivation
+
+As GitLab works towards providing a single platform for the whole DevSecOps cycle,
+its offering should not stop at pipelines, but should include the deployment and release management, as well as
+observability of user-developed and third party applications.
+
+While GitLab offers some concepts, like the `environment` syntax in GitLab pipelines,
+it does not offer any concept on what is running in a given environment. While the environment might answer the "where" is
+something running, it does not answer the question of "what" is running there. We should
+introduce [service](https://about.gitlab.com/direction/delivery/glossary.html#service) and [release artifact](https://about.gitlab.com/direction/delivery/glossary.html#release) to answer this question. The [Delivery glossary](https://about.gitlab.com/direction/delivery/glossary.html#service) defines
+a service as
+
+> a logical concept that is a (mostly) independently deployable part of an application that is loosely coupled with other services to serve specific functionalities for the application.
+
+A service would connect to the SCM, registry or issues through release artifacts and would be a focused view into the [environments](https://about.gitlab.com/direction/delivery/glossary.html#environment) where
+a specific version of the given release artifact is deployed (or being deployed).
+
+Having a concept of services allows our users to track their applications in production, not only in CI/CD pipelines. This opens up possibilities, like cost management.
+The current work in [Analyze:Observability](https://about.gitlab.com/handbook/product/categories/#observability-group) could be integrated into GitLab if it supports services.
+
+### Goals
+
+- Services are defined at the project level.
+- A single project can hold multiple services.
+- We should be able to list services at group and organization levels. We should make sure that our architecture is ready to support group-level features from day one. "Group-level environment views" are a feature customers are asking for many-many years now.
+- Services are tied to environments. Every service might be present in multiple environments and every environment might host multiple services. Not every service is expected to be present in every environment and no environment is expected to host all the services.
+- A service is a logical concept that groups several resources.
+- A service is typically deployed independently of other services. A service is typically deployed in its entirety.
+ - Deployments in the user interviews happened using CI via `Helm`, `helmfiles` or Flux and a `HelmRelease`.
+ - Even in Kubernetes, there might be other tools (Kustomize, vanilla manifests) to deploy a service.
+ - Outside of Kubernetes other tools might be used. e.g. Runway deploys using Terraform.
+- We want to connect a [deployment](https://about.gitlab.com/direction/delivery/glossary.html#deployment) of a service to the MRs, containers, packages, linter results included in the [release artifact](https://about.gitlab.com/direction/delivery/glossary.html#release).
+- A service contains a bunch of links to external (or internal) pages.
+
+![architecture diagram](img/architecture.png)
+
+[src of the architecture diagram](https://docs.google.com/drawings/d/1TJinpfqc48jXZEw7rxe6mB-8AwDOW7o58wTAB_ljSNM/edit?usp=sharing)
+
+(The dotted border for Deployment represents a projection to the Target Infrastructure)
+
+### Non-Goals
+
+- Metrics related to a service should be customizable and configurable by project maintainers (developers?). Metrics might differ from service to service both in the query and in the meaning. (e.g. traffic does not make sense for a queue).
+- Metrics should integrate with various external tools, like OpenTelemetry/Prometheus, Datadog, etc.
+- We don't want to tackle GitLab observability solution built by [Analyze:Observability](https://about.gitlab.com/handbook/product/categories/#observability-group). The proposal here should treat it as one observability integration backend.
+- We don't want to cover alerting, SLOs, SLAs and incident management.
+- Some infrastructures might already have better support within GitLab than others (Kubernetes is supported better than pure AWS). There is no need to discuss functionalities that we provide or plan to provide for Kubernetes and how to achieve feature parity with other infrastructures.
+- Services can be filtered by metadata (e.g. tenant, region). These could vary by customer or even by group.
+
+## Proposal
+
+Introduce a Service model. This is a shallow model that contains the following parameters:
+
+- **Name**: The name of the service (e.g. `Awesome API`)
+- **Description**: Markdown field. It can contain links to external (or internal) pages.
+- (TBD) **Metadata**: User-defined key-Value pairs to label the service, which later can be used for filtering at group or project level.
+ - Example fields:
+ - `Tenant: northwest`
+ - `Component: Redis`
+ - `Region: us-east-1`
+- (TBD) **Deployment sequence**: To allow the promotion from dev to staging to production.
+- (TBD) **Environment variables specific to services**: Variables within an environment, variables should be definable for services as well.
+
+### DORA metrics
+
+Users can observe DORA metrics through Services:
+
+- Today, deployment frequency counts the deployments with an `environment_tier=production` or the job name being `prod` or `production`.
+- It should be clear for end-users. It can be a convention, like restricting a pipeline to a single `environment_tier=production` job or the first `environment_tier=production` per environment. To be defined later.
+
+### Aggregate environments and services at group level
+
+At the group-level, GitLab fetches all of the project-level environments under the specific group,
+and grouping it by the **name** of the environments. For example:
+
+| | Frontend service | Backend service |
+| ------ | ------ | ------ |
+| dev | Release Artifact v0.x | |
+| development | Release Artifact v0.y | |
+| production | Release Artifact v0.z | Release Artifact v1.x |
+
+### Entity relationships
+
+- Service and Environment has many-to-many relationship.
+- Deployment and Release Artifact have many-to-one relationship (for a specific version of the artifact). Focusing on a single environment, Deployment and Release Artifact have a one-to-one relationship.
+- Environment and Deployment has one-to-many relationship. This allows to show a deployment history (past and running, no outstanding, roll-out status might be included) by Environment.
+- Environment and Release Artifact has many-to-many relationship through Deployment.
+- Service and Release Artifact has many-to-many relationship. This allows to show a history of releases (past, running and outstanding) by service
+- Release Artifact and Artifact have one-to-many relationship (e.g. chart as artifact => value as artifact => image as artifact).
+
+```mermaid
+classDiagram
+ Group "1" o-- "*" Project : There may be multiple projects with services in a group
+ Project "1" <.. "*" Service : A service is part of a project
+ Project "1" <.. "*" Environment :
+ Environment "*" .. "*" Service : A service is linked to 1+ environments
+ Service "1" <|-- "*" ReleaseArtifact : A release artifact packages a specific version of a service
+ ReleaseArtifact "1" <|-- "*" Deployment : A release artifact can be deployed
+ Deployment "1" --|> "1" Environment : Every deployment lives in a specific environment
+```
+
+See [Glossary](https://about.gitlab.com/direction/delivery/glossary.html) for more information.
+
+**Discussion:** It's TBD whether we should reuse existing entities such as [`Deployment`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/deployment.rb) and [`Environment`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/environment.rb) models. Reusing the existing entities could limit us in the long run, however, users should be able to adopt the new architecture seamlessly without changing their existing CI/CD workflow drastically. This decision should be made when we have clearer answer for the ideal structure and behavior of the entities, so that we can understand how far from it the existing entity is and how feasible to migrate.
+
+## Alternative Solutions
+
+- [Add dynamically populated organization-level environments page](https://gitlab.com/gitlab-org/gitlab/-/issues/241506).
+ This approach was concluded as no-go in favor of Service concept.
+- There is an alternative proposal to introduce [Group Environment entity](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/129696#note_1557477581) for [Group-level environment views](#aggregate-environments-and-services-at-group-level).
+ \ No newline at end of file
diff --git a/doc/architecture/blueprints/gitlab_steps/index.md b/doc/architecture/blueprints/gitlab_steps/index.md
index d7878445cd0..74c9ba1498d 100644
--- a/doc/architecture/blueprints/gitlab_steps/index.md
+++ b/doc/architecture/blueprints/gitlab_steps/index.md
@@ -3,7 +3,7 @@ status: proposed
creation-date: "2023-08-23"
authors: [ "@ayufan" ]
coach: "@grzegorz"
-approvers: [ "@dhershkovitch", "@DarrenEastman", "@marknuzzo", "@nicolewilliams" ]
+approvers: [ "@dhershkovitch", "@DarrenEastman", "@cheryl.li" ]
owning-stage: "~devops::verify"
participating-stages: [ ]
---
@@ -15,7 +15,7 @@ participating-stages: [ ]
This document describes architecture of a new component called Step Runner, the GitLab Steps syntax it uses,
and how the GitHub Actions support will be achieved.
-The competitive CI products [drone.io](https://drone.io),
+The competitive CI products [drone.io](https://drone.io/),
[GitHub Actions](https://docs.github.com/en/actions/creating-actions)
have a composable CI jobs execution in form of steps, or actions.
@@ -139,4 +139,4 @@ TBD
## References
-- [GitLab Issue #215511](https://gitlab.com/gitlab-org/gitlab/-/issues/215511)
+- [GitLab Epic 11535](https://gitlab.com/groups/gitlab-org/-/epics/11535)
diff --git a/doc/architecture/blueprints/google_artifact_registry_integration/backend.md b/doc/architecture/blueprints/google_artifact_registry_integration/backend.md
new file mode 100644
index 00000000000..8213e3ede32
--- /dev/null
+++ b/doc/architecture/blueprints/google_artifact_registry_integration/backend.md
@@ -0,0 +1,131 @@
+---
+stage: Package
+group: Container Registry
+description: 'Backend changes for Google Artifact Registry Integration'
+---
+
+# Backend changes for Google Artifact Registry Integration
+
+## Client SDK
+
+To interact with GAR we will make use of the official GAR [Ruby client SDK](https://cloud.google.com/ruby/docs/reference/google-cloud-artifact_registry/latest).
+By default, this client will use the [RPC](https://cloud.google.com/artifact-registry/docs/reference/rpc) version of the Artifact Registry API.
+
+To build the client, we will need the [service account key](index.md#authentication).
+
+### Interesting functions
+
+For the scope of this blueprint, we will need to use the following functions from the Ruby client:
+
+- [`#get_repository`](https://github.com/googleapis/google-cloud-ruby/blob/d0ce758a03335b60285a3d2783e4cca7089ee2ea/google-cloud-artifact_registry-v1/lib/google/cloud/artifact_registry/v1/artifact_registry/client.rb#L1244). [API documentation](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#getrepositoryrequest). This will return a single [`Repository`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#repository).
+- [`#list_docker_images`](https://github.com/googleapis/google-cloud-ruby/blob/d0ce758a03335b60285a3d2783e4cca7089ee2ea/google-cloud-artifact_registry-v1/lib/google/cloud/artifact_registry/v1/artifact_registry/client.rb#L243). [API documentation](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#listdockerimagesrequest). This will return a list of [`DockerImage`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#dockerimage).
+- [`#get_docker_image`](https://github.com/googleapis/google-cloud-ruby/blob/d0ce758a03335b60285a3d2783e4cca7089ee2ea/google-cloud-artifact_registry-v1/lib/google/cloud/artifact_registry/v1/artifact_registry/client.rb#L329). [API documentation](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#getdockerimagerequest). This will return a single [`DockerImage`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#dockerimage).
+
+### Limitations
+
+Filtering is not available in `#list_docker_images`. In other words, we can't filter the returned list (for example on a specific name). However, ordering on some columns is available.
+
+In addition, we can't point directly to a specific page. For example, directly accessing page 3 of the list of Docker images without going first through page 1 and 2.
+We can't build this feature on the GitLab side because this will require to walk through all pages and we could hit a situation where we need to go through a very large amount of pages.
+
+### Exposing the client
+
+It would be better to centralize the access to the official Ruby client. This way, it's very easy to check for permissions.
+
+We suggest having a custom client class located in `Integrations::GoogleCloudPlatform::ArtifactRegistry::Client`. That class will need to require a `User` and a `Integrations::GoogleCloudPlatform::ArtifactRegistry` (see [Project Integration](#project-integration)).
+
+The client will then need to expose three functions: `#repository`, `#docker_images` and `#docker_image` that will be mapped to the similarly name functions of the official client.
+
+Before calling the official client, this class will need to check the user permissions. The given `User` should have `read_gcp_artifact_registry_repository` on the `Project` related with the `Integrations::GoogleCloudPlatform::ArtifactRegistry`.
+
+Lastly, to setup the official client, we will need to properly set:
+
+- the [timeout](https://github.com/googleapis/google-cloud-ruby/blob/a64ed1de61a6f1b5752e7c8e01d6a79365e6de67/google-cloud-artifact_registry-v1/lib/google/cloud/artifact_registry/v1/artifact_registry/operations.rb#L646).
+- the [retry_policy](https://github.com/googleapis/google-cloud-ruby/blob/a64ed1de61a6f1b5752e7c8e01d6a79365e6de67/google-cloud-artifact_registry-v1/lib/google/cloud/artifact_registry/v1/artifact_registry/operations.rb#L652).
+
+For these, we can simply either use the default values if they are ok or use fixed values.
+
+## New permission
+
+We will need a new permission on the [Project policy](https://gitlab.com/gitlab-org/gitlab/-/blob/1411076f1c8ec80dd32f5da7518f795014ea5a2b/app/policies/project_policy.rb):
+
+- `read_gcp_artifact_registry_repository` granted to at least reporter users.
+
+## Project Integration
+
+We will need to build a new [project integration](../../../development/integrations/index.md) with the following properties:
+
+- `google_project_id` - the Google project ID. A simple string.
+- `google_location` - the Google location. A simple string.
+- `repositories` - an array of repository names (see below).
+- `json_key` - the service account JSON. A string but displayed as a text area.
+- `json_key_base64` - the service account JSON, encoded with base64. Value set from `json_key`.
+
+We will also have derived properties:
+
+- `repository`- the repository name. Derived from `repositories`.
+
+`repositories` is used as a way to store the repository name in an array. This is to help with a future follow up where multiple repositories will need to be supported. As such, we store the repository name into an array and we create a `repository` property that is the first entry of the array. By having a `repository` single property, we can use the [frontend helpers](../../../development/integrations/index.md#customize-the-frontend-form) as array values are not supported in project integrations.
+
+We also need the base64 version of the `json_key`. This is required for the [`CI/CD variables`](#cicd-variables).
+
+Regarding the class name, we suggest using `Integrations::GoogleCloudPlatform::ArtifactRegistry`. The `Integrations::GoogleCloudPlatform` namespace allows us to have possible future other integrations for the other services of the Google Cloud Platform.
+
+Regarding the [configuration test](../../../development/integrations/index.md#define-configuration-test), we need to get the repository info on the official API (method `#get_repository`). The test is successful if and only if, the call is successful and the returned repository has the format `DOCKER`.
+
+## GraphQL APIs
+
+The [UI](ui_ux.md) will basically have two pages: listing Docker images out of the repository configured in the project integration and show details of a given Docker image.
+
+In order to support the other repository formats in follow ups, we choose to not map the official client function names in GraphQL fields or methods but rather have a more re-usable approach.
+
+All GraphQL changes should be marked as [`alpha`](../../../development/api_graphql_styleguide.md#mark-schema-items-as-alpha).
+
+First, on the [`ProjectType`](../../../api/graphql/reference/index.md#project), we will need a new field `google_cloud_platform_artifact_registry_repository_artifacts`. This will return a list of an [abstract](../../../api/graphql/reference/index.md#abstract-types) new type: `Integrations::GoogleCloudPlatform::ArtifactRegistry::ArtifactType`. This list will have pagination support. Ordering options will be available.
+
+We will have `Integrations::GoogleCloudPlatform::ArtifactRegistry::DockerImage` as a concrete type of `Integrations::GoogleCloudPlatform::ArtifactRegistry::ArtifactType` with the following fields:
+
+- `name`. A string.
+- `uri`. A string.
+- `image_size_bytes`. A integer.
+- `upload_time`. A timestamp.
+
+Then, we will need a new query `Query.google_cloud_platform_registry_registry_artifact_details` that given a name of a `Integrations::GoogleCloudPlatform::ArtifactRegistry::DockerImage` will return a single `Integrations::GoogleCloudPlatform::ArtifactRegistry::ArtifacDetailsType` with the following fields:
+
+- all fields of `Integrations::GoogleCloudPlatform::ArtifactRegistry::ArtifactType`.
+- `tags`. An array of strings.
+- `media_type`. A string.
+- `build_time`. A timestamp.
+- `updated_time`. A timestamp.
+
+All GraphQL changes will require users to have the [`read_gcp_artifact_registry_repository` permission](#new-permission).
+
+## CI/CD variables
+
+Similar to the [Harbor](../../../user/project/integrations/harbor.md#configure-gitlab) integration, once users activates the GAR integration, additional CI/CD variables will be automatically available if the integration is enabled. These will be set according to the requirements described in the [documentation](https://cloud.google.com/artifact-registry/docs/docker/authentication#json-key):
+
+- `GCP_ARTIFACT_REGISTRY_URL`: This will be set to `https://LOCATION-docker.pkg.dev`, where `LOCATION` is the GCP project location configured for the integration.
+- `GCP_ARTIFACT_REGISTRY_PROJECT_URI`: This will be set to `LOCATION-docker.pkg.dev/PROJECT-ID`. `PROJECT-ID` is the GCP project ID of the GAR repository configured for the integration.
+- `GCP_ARTIFACT_REGISTRY_PASSWORD`: This will be set to the base64-encode version of the service account JSON key file configured for the integration.
+- `GCP_ARTIFACT_REGISTRY_USER`: This will be set to `_json_key_base64`.
+
+These can then be used to log in using `docker login`:
+
+```shell
+docker login -u $GCP_ARTIFACT_REGISTRY_USER -p $GCP_ARTIFACT_REGISTRY_PASSWORD $GCP_ARTIFACT_REGISTRY_URL
+```
+
+Similarly, these can be used to download images from the repository with `docker pull`:
+
+```shell
+docker pull $GCP_ARTIFACT_REGISTRY_PROJECT_URI/REPOSITORY/myapp:latest
+```
+
+Finally, provided that the configured service account has the `Artifact Registry Writer` role, one can also push images to GAR:
+
+```shell
+docker build -t $GCP_ARTIFACT_REGISTRY_REPOSITORY_URI/myapp:latest .
+docker push $GCP_ARTIFACT_REGISTRY_REPOSITORY_URI/myapp:latest
+```
+
+For forward compatibility reasons, the repository name (`REPOSITORY` in the command above) must be appended to `GCP_ARTIFACT_REGISTRY_PROJECT_URI` by the user. In the first iteration we will only support a single GAR repository, and therefore we could technically provide a variable like `GCP_ARTIFACT_REGISTRY_REPOSITORY_URI` with the repository name already included. However, once we add support for multiple repositories, there is no way we can tell what repository a user will want to target for a specific instruction.
diff --git a/doc/architecture/blueprints/google_artifact_registry_integration/index.md b/doc/architecture/blueprints/google_artifact_registry_integration/index.md
index adde0f7f587..4c2bfe95c5e 100644
--- a/doc/architecture/blueprints/google_artifact_registry_integration/index.md
+++ b/doc/architecture/blueprints/google_artifact_registry_integration/index.md
@@ -88,49 +88,11 @@ Among the proprietary GAR APIs, the [REST API](https://cloud.google.com/artifact
Last but not least, there is also an [RPC API](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1), backed by gRPC and Protocol Buffers. This API provides the most functionality, covering all GAR features. From the available operations, we can make use of the [`ListDockerImagesRequest`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#listdockerimagesrequest) and [`GetDockerImageRequest`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#google.devtools.artifactregistry.v1.GetDockerImageRequest) operations. As with the REST API, both responses are composed of [`DockerImage`](https://cloud.google.com/artifact-registry/docs/reference/rpc/google.devtools.artifactregistry.v1#google.devtools.artifactregistry.v1.DockerImage) objects.
-Between the two proprietary API options, we chose the RPC one because it provides support not only for the operations we need today but also offers better coverage of all GAR features, which will be beneficial in future iterations. Finally, we do not intend to make direct use of this API but rather use it through the official Ruby client SDK. Please see [Client SDK](#client-sdk) below for more details.
+Between the two proprietary API options, we chose the RPC one because it provides support not only for the operations we need today but also offers better coverage of all GAR features, which will be beneficial in future iterations. Finally, we do not intend to make direct use of this API but rather use it through the official Ruby client SDK. Please see [Client SDK](backend.md#client-sdk) below for more details.
#### Backend Integration
-##### Client SDK
-
-To interact with GAR we will make use of the official GAR [Ruby client SDK](https://cloud.google.com/ruby/docs/reference/google-cloud-artifact_registry/latest).
-
-*TODO: Add more details about the client SDK integration and its limitations (no filtering for example).*
-
-##### Database Changes
-
-*TODO: Describe any necessary changes to the database to support this integration.*
-
-##### CI/CD variables
-
-Similar to the [Harbor](../../../user/project/integrations/harbor.md#configure-gitlab) integration, once users activates the GAR integration, additional CI/CD variables will be automatically available if the integration is enabled. These will be set according to the requirements described in the [documentation](https://cloud.google.com/artifact-registry/docs/docker/authentication#json-key):
-
-- `GCP_ARTIFACT_REGISTRY_URL`: This will be set to `https://LOCATION-docker.pkg.dev`, where `LOCATION` is the GCP project location configured for the integration.
-- `GCP_ARTIFACT_REGISTRY_PROJECT_URI`: This will be set to `LOCATION-docker.pkg.dev/PROJECT-ID`. `PROJECT-ID` is the GCP project ID of the GAR repository configured for the integration.
-- `GCP_ARTIFACT_REGISTRY_PASSWORD`: This will be set to the base64-encode version of the service account JSON key file configured for the integration.
-- `GCP_ARTIFACT_REGISTRY_USER`: This will be set to `_json_key_base64`.
-
-These can then be used to log in using `docker login`:
-
-```shell
-docker login -u $GCP_ARTIFACT_REGISTRY_USER -p $GCP_ARTIFACT_REGISTRY_PASSWORD $GCP_ARTIFACT_REGISTRY_URL
-```
-
-Similarly, these can be used to download images from the repository with `docker pull`:
-
-```shell
-docker pull $GCP_ARTIFACT_REGISTRY_PROJECT_URI/REPOSITORY/myapp:latest
-```
-
-Finally, provided that the configured service account has the `Artifact Registry Writer` role, one can also push images to GAR:
-
-```shell
-docker build -t $GCP_ARTIFACT_REGISTRY_REPOSITORY_URI/myapp:latest .
-docker push $GCP_ARTIFACT_REGISTRY_REPOSITORY_URI/myapp:latest
-```
-
-For forward compatibility reasons, the repository name (`REPOSITORY` in the command above) must be appended to `GCP_ARTIFACT_REGISTRY_PROJECT_URI` by the user. In the first iteration we will only support a single GAR repository, and therefore we could technically provide an e.g. `GCP_ARTIFACT_REGISTRY_REPOSITORY_URI` variable with the repository name already included. However, once we add support for multiple repositories, there is no way we can tell what repository a user will want to target for a specific instruction. So it must be the user to tell that.
+This integration will need several changes on the backend side of the rails project. See the [backend](backend.md) page for additional details.
#### UI/UX
diff --git a/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md
index f0f689d48ca..f8003a3dd56 100644
--- a/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md
+++ b/doc/architecture/blueprints/modular_monolith/hexagonal_monolith/index.md
@@ -12,7 +12,7 @@ owning-stage: ""
## Summary
**TL;DR:** Change the Rails monolith from a [big ball of mud](https://en.wikipedia.org/wiki/Big_ball_of_mud) state to
-a [modular monolith](https://www.thereformedprogrammer.net/my-experience-of-using-modular-monolith-and-ddd-architectures)
+a [modular monolith](https://www.thereformedprogrammer.net/my-experience-of-using-modular-monolith-and-ddd-architectures/)
that uses an [Hexagonal architecture](https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)) (or ports and adapters architecture).
Extract cohesive functional domains into separate directory structure using Domain-Driven Design practices.
Extract infrastructure code (logging, database tools, instrumentation, etc.) into gems, essentially remove the need for `lib/` directory.
diff --git a/doc/architecture/blueprints/modular_monolith/index.md b/doc/architecture/blueprints/modular_monolith/index.md
index f1e6c119552..e8de9195d86 100644
--- a/doc/architecture/blueprints/modular_monolith/index.md
+++ b/doc/architecture/blueprints/modular_monolith/index.md
@@ -95,7 +95,7 @@ add more important details as we move forward towards the goal:
1. [Deliver modularization proof-of-concepts that will deliver key insights](proof_of_concepts.md).
1. Align modularization plans to the organizational structure by [defining bounded contexts](bounded_contexts.md).
-1. Separate domains into modules that will reflect organizational structure (TODO)
+1. [Separate domains into modules](packages_extraction.md) that will reflect organizational structure.
1. Start a training program for team members on how to work with decoupled domains (TODO)
1. Build tools that will make it easier to build decoupled domains through inversion of control (TODO)
1. [Introduce hexagonal architecture within the monolith](hexagonal_monolith/index.md)
diff --git a/doc/architecture/blueprints/modular_monolith/packages_extraction.md b/doc/architecture/blueprints/modular_monolith/packages_extraction.md
new file mode 100644
index 00000000000..2b9a64e0631
--- /dev/null
+++ b/doc/architecture/blueprints/modular_monolith/packages_extraction.md
@@ -0,0 +1,52 @@
+---
+status: proposed
+creation-date: "2023-09-29"
+authors: [ "@fabiopitino" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: ""
+---
+
+# Convert domain module into packages
+
+The general steps of refactoring existing code to modularization could be:
+
+1. Use the same namespace for all classes and modules related to the same [bounded context](bounded_contexts.md).
+
+ - **Why?** Without even a rough understanding of the domains at play in the codebase it is difficult to draw a plan.
+ Having well namespaced code that everyone else can follow is also the pre-requisite for modularization.
+ - If a domain is already well namespaced and no similar or related namespaces exist, we can move directly to the
+ next step.
+1. Prepare Rails development for Packwerk packages. This is a **once off step** with maybe some improvements
+ added over time.
+
+ - We will have the Rails autoloader to work with Packwerk's directory structure, as demonstrated in
+ [this PoC](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/129254/diffs#note_1512982957).
+ - We will have [Danger-Packwerk](https://github.com/rubyatscale/danger-packwerk) running in CI for merge requests.
+ - We will possibly have Packer check running in Lefthook on pre-commit or pre-push.
+1. Move file into a Packwerk package.
+
+ - This should consist in creating a Packwerk package and iteratively move files into the package.
+ - Constants are auto-loaded correctly whether they are in `app/` or `lib/` inside a Packwerk package.
+ - This is a phase where the domain code will be split between the package directory and the Rails directory structure.
+ **We must move quickly here**.
+1. Enforce namespace boundaries by requiring packages declare their [dependencies explicitly](https://github.com/Shopify/packwerk/blob/main/USAGE.md#enforcing-dependency-boundary)
+ and only depend on other packages' [public interface](https://github.com/rubyatscale/packwerk-extensions#privacy-checker).
+
+ - **Why?** Up until now all constants would be public since we have not enforced privacy. By moving existing files
+ into packages without enforcing boundaries we can focus on wrapping a namespace in a package without being distracted
+ by Packwer privacy violations. By enforcing privacy afterwards we gain an understanding of coupling between various
+ constants and domains.
+ - This way we know what constants need to be made public (as they are used by other packages) and what can
+ remain private (taking the benefit of encapsulation). We will use Packwerk's recorded violations (like Rubocop TODOs)
+ to refactor the code over time.
+ - We can update the dependency graph to see where it fit in the overall architecture.
+1. Work off Packwerk's recorded violations to make refactorings. **This is a long term phase** that the DRIs of the
+ domain need to nurture over time. We will use Packwerk failures and the dependency diagram to influence the modular design.
+
+ - Revisit wheteher a class should be private instead of public, and crate a better interface.
+ - Move constants to different package if too coupled with that.
+ - Join packages if they are too coupled to each other.
+
+Once we have Packwerk configured for the Rails application (step 2 above), emerging domains could be directly implemented
+as Packwerk packages, benefiting from isolation and clear interface immediately.
diff --git a/doc/architecture/blueprints/new_diffs.md b/doc/architecture/blueprints/new_diffs.md
new file mode 100644
index 00000000000..b5aeb9b8aa8
--- /dev/null
+++ b/doc/architecture/blueprints/new_diffs.md
@@ -0,0 +1,103 @@
+---
+status: proposed
+creation-date: "2023-10-10"
+authors: [ "@iamphill" ]
+coach: [ "@ntepluhina" ]
+approvers: [ ]
+owning-stage: "~devops::create"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# New diffs
+
+## Summary
+
+Diffs at GitLab are spread across several places with each area using their own method. We are aiming
+to develop a single, performant way for diffs to be rendered across the application. Our aim here is
+to improve all areas of diff rendering, from the backend creation of diffs to the frontend rendering
+the diffs.
+
+## Motivation
+
+### Goals
+
+- improved perceived performance
+- improved maintainability
+- consistent coverage of all scenarios
+
+### Non-Goals
+
+<!--
+Listing non-goals helps to focus discussion and make progress. This section is
+optional.
+
+- What is out of scope for this blueprint?
+-->
+
+### Priority of Goals
+
+In an effort to provide guidance on which goals are more important than others to assist in making
+consistent choices, despite all goals being important, we defined the following order.
+
+**Perceived performance** is above **improved maintainability** is above **consistent coverage**.
+
+Examples:
+
+- a proposal improves maintainability at the cost of perceived performance: ❌ we should consider an alternative.
+- a proposal removes a feature from certain contexts, hurting coverage, and has no impact on perceived performance or maintanability: ❌ we should re-consider.
+- a proposal improves perceived performance but removes features from certain contexts of usage: ✅ it's valid and should be discussed with Product/UX.
+- a proposal guarantees consistent coverage and has no impact on perceived performance or maintainability: ✅ it's valid.
+
+In essence, we'll strive to meet every goal at each decision but prioritise the higher ones.
+
+## Proposal
+
+<!--
+This is where we get down to the specifics of what the proposal actually is,
+but keep it simple! This should have enough detail that reviewers can
+understand exactly what you're proposing, but should not include things like
+API designs or implementation. The "Design Details" section below is for the
+real nitty-gritty.
+
+You might want to consider including the pros and cons of the proposed solution so that they can be
+compared with the pros and cons of alternatives.
+-->
+
+## Design and implementation details
+
+<!--
+This section should contain enough information that the specifics of your
+change are understandable. This may include API specs (though not always
+required) or even code snippets. If there's any ambiguity about HOW your
+proposal will be implemented, this is the place to discuss them.
+
+If you are not sure how many implementation details you should include in the
+blueprint, the rule of thumb here is to provide enough context for people to
+understand the proposal. As you move forward with the implementation, you may
+need to add more implementation details to the blueprint, as those may become
+an important context for important technical decisions made along the way. A
+blueprint is also a register of such technical decisions. If a technical
+decision requires additional context before it can be made, you probably should
+document this context in a blueprint. If it is a small technical decision that
+can be made in a merge request by an author and a maintainer, you probably do
+not need to document it here. The impact a technical decision will have is
+another helpful information - if a technical decision is very impactful,
+documenting it, along with associated implementation details, is advisable.
+
+If it's helpful to include workflow diagrams or any other related images.
+Diagrams authored in GitLab flavored markdown are preferred. In cases where
+that is not feasible, images should be placed under `images/` in the same
+directory as the `index.md` for the proposal.
+-->
+
+## Alternative Solutions
+
+<!--
+It might be a good idea to include a list of alternative solutions or paths considered, although it is not required. Include pros and cons for
+each alternative solution/path.
+
+"Do nothing" and its pros and cons could be included in the list too.
+-->
diff --git a/doc/architecture/blueprints/observability_metrics/index.md b/doc/architecture/blueprints/observability_metrics/index.md
new file mode 100644
index 00000000000..25a3b72a989
--- /dev/null
+++ b/doc/architecture/blueprints/observability_metrics/index.md
@@ -0,0 +1,286 @@
+---
+status: proposed
+creation-date: "2022-11-09"
+authors: [ "@ankitbhatnagar" ]
+coach: "@mappelman"
+approvers: [ "@sguyon", "@nicholasklick" ]
+owning-stage: "~monitor::observability"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Observability - Metrics
+
+## Summary
+
+Developing a multi-user system to store & query observability data typically formatted in widely accepted, industry-standard formats such as OpenTelemetry using Clickhouse as the underlying storage with support for long-term data retention and aggregation.
+
+## Motivation
+
+From the six pillars of Observability, commonly abbreviated as `TEMPLE` - Traces, Events, Metrics, Profiles, Logs & Errors, Metrics constitute one of the most important of those for modern day systems helping their users gather insights about the operational posture of monitored systems.
+
+Metrics which are commonly structured as timeseries data have the following characteristics:
+
+- indexed by their corresponding timestamps;
+- continuously expanding in size;
+- usually aggregated, down-sampled, and queried in ranges; and
+- have very write-intensive requirements.
+
+Within GitLab Observability Backend, we aim to add the support for our customers to ingest and query observability data around their systems & applications, helping them improve the operational health of their systems.
+
+### Goals
+
+With the development of the proposed system, we have the following goals:
+
+- Scalable, low latency & cost-effective monitoring system backed by Clickhouse whose performance has been proven via repeatable benchmarks.
+
+- Support for long-term storage for metrics, ingested via an OpenTelemetry-compliant agent and queried via GitLab-native UI with probable support for metadata and exemplars.
+
+The aforementioned goals can further be broken down into the following four sub-goals:
+
+#### Ingesting data
+
+- For the system to be capable of ingesting large volumes of writes and reads, we aim to ensure that it must be horizontally scalable & provide durability guarantees to ensure no writes are dropped once ingested.
+
+#### Persisting data
+
+- We aim to support ingesting telemetry/data instrumented using OpenTelemetry specifications. For a first iteration, any persistence we design for our dataset will be multi-tenant by default, ensuring we can store observability data for multiple groups/projects within the same storage backend.
+
+#### Reading data
+
+- We aim to support querying data via a GitLab-native UX which would mean using a custom DSL/Query Builder sending API requests to our backend which would then translate them into Clickhouse SQL. From our internal discussions around this, [Product Analytics Visualisation Designer](https://gitlab.com/gitlab-org/gitlab-services/design.gitlab.com/-/analytics/dashboards/visualization-designer) is a good source of inspiration for this.
+
+#### Deleting data
+
+- We aim to support being able to delete any ingested data should such a need arise. This is also in addition to us naturally deleting data when a configured TTL expires and/or respective retention policies are enforced. We must, within our schemas, build a way to delete data by labels OR their content, also add to our offering the necessary tooling to do so.
+
+### Non-Goals
+
+With the goals established above, we also want to establish what specific things are non-goals with the current proposal. They are:
+
+- With our first iteration here, we do not aim to support querying ingested telemetry via [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) deferring that to as & when such a business need arises. However, users will be able to ingest their metrics using the OpenTelemetry Line Protocol (OTLP), e.g. via the [Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md) in case of Prometheus metrics.
+
+## Proposal
+
+We intend to use GitLab Observability Backend (GOB) as a framework for the Metrics implementation so that its lifecycle can be managed via already established components of our backend.
+
+![Architecture](metrics_indexing_at_ingestion.png)
+
+As depicted in the diagram above, an OTEL-collector pipeline, indexer & query service are components that need to be developed as proposed here while the remaining peripheral components either already exist or can be provisioned via existing code in our centralised `scheduler` within GOB.
+
+**On the write path**:
+
+- We expect to receive incoming data via `HTTP/JSON` similar to what we do for our existing services, e.g. errortracking, tracing.
+
+- We aim to heavily deduplicate incoming timeseries by indexing/caching per-series metadata to reduce our storage footprint.
+
+- We aim to ensure avoiding writing a lot of small writes into Clickhouse by batching data before writing it into Clickhouse.
+
+**On the read path**:
+
+![MetricsReadPath](metrics-read-path.png)
+
+- We aim to allow our users to use GitLab itself to read ingested data, which will necessitate building a dedicated `Query Service` on our backend to be able to service API requests originating from GitLab.
+
+- We aim implement necessary query validation, sanitation and rate-limiting for any resource consumption to ensure underlying systems remain in good operational health at all times.
+
+### GitLab Observability Tenant
+
+With the recent changes to our backend design especially around deprecating the use of a Grafana-based UX, we have found opportunities to streamline how we provision tenants within our system. This initiative had led to the development of a custom CR - `GitLabObservabilityTenant` intended to model a dedicated set of resources **per top-level GitLab namespace**. From a scalability perspective, this means we deploy a dedicated instance of `Ingress` & `Ingester` per top-level GitLab namespace to make sure we can scale each tenant subject to traffic volumes of its respective groups & projects. It also helps isolate resource consumption across tenants in an otherwise multi-tenant system such as ours.
+
+### Indexing per-series metadata
+
+As an internal part of the `ingester`, we aim to index per-series labels and/or metadata to be able to deduplicate incoming timeseries data and segregate them into metadata and points-data. This helps reduce our storage footprint by an order of magnitude keeping total cost of operation low. This indexed data can also be consumed by the `Query Service` to efficiently compute timeseries for all incoming read requests. This part of our architecture is also described in more detail in [Proposal: Indexing metrics labels for efficiently deduplicating & querying time series data](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/2397).
+
+### Query Service
+
+The `Query Service` consists of two primary components - 1. a request parser & 2. a backend-specific querier implementation. On the request path, once its received on the designated endpoint(s), it is handled by a handler which is a part of the request parser. The parser's responsibility is to unmarshal incoming query payloads, validate the contents and produce a `SearchContext` object which describes how must this query/request be processed. Within a `SearchContext` object is a `QueryContext` attribute which further defines one or more `Query` objects - each a completely independent data query against one of our backends.
+
+![QueryServiceInternals](query-service-internals.png)
+
+#### API structure
+
+For the user-facing API, we intend to add support via HTTP/JSON endpoint(s) with user-queries marshalled as payloads within a request body. For example, to compute the sum of a minutely delta of metric:`apiserver_request_total` over all values of label:`instance`, you'd send a POST request to `https://observe.gitlab.com/query/$GROUP/$PROJECT/metrics` with the following as body:
+
+```json
+{
+ "queries": {
+ "A": {
+ "type": "metrics",
+ "filters": [
+ {
+ "key": "__name__",
+ "value": "apiserver_request_total",
+ "operator": "eq"
+ }
+ ],
+ "aggregation": {
+ "function": "rate",
+ "interval": "1m"
+ },
+ "groupBy": {
+ "attribute": [
+ "instance"
+ ],
+ "function": "sum"
+ },
+ "sortBy": {},
+ "legend": {}
+ }
+ },
+ "expression": "A"
+}
+```
+
+#### Query representation as an AST
+
+```plaintext
+type SearchContext struct {
+ UserContext *UserContext `json:"authContext"`
+ BackendContext *BackendContext `json:"backendContext"`
+
+ StartTimestamp int64 `json:"start"`
+ EndTimestamp int64 `json:"end"`
+ StepIntervalSeconds int64 `json:"step"`
+
+ QueryContext *QueryContext `json:"queryContext"`
+ CorrelationContext *CorrelationContext `json:"correlationContext"`
+ Variables map[string]interface{} `json:"variables,omitempty"`
+}
+```
+
+Generally speaking:
+
+- `SearchContext` defines how a search must be executed.
+ - It internally contains a `QueryContext` which points to one or more `Query`(s) each targeting a given backend.
+ - Each `Query` must be parsed & processed independently, supplemented by other common attributes within a `QueryContext` or `SearchContext`.
+
+- `Query` defines an AST-like object which describes how must a query be performed.
+ - It is intentionally schema-agnostic allowing it to be serialised and passed around our system(s).
+ - It is also an abstraction that hides details of how we model data internal to our databases from the querying entity.
+ - Assuming an incoming query can be parsed & validated into a `Query` object, a `Querier` can execute a search/query against it.
+
+- `UserContext` defines if a request has access to the data being searched for.
+ - It is perhaps a good place to model & enforce request quotas, rate-limiting, etc.
+ - Populating parts of this attribute depend on the parser reading other global state via the API gateway or Gatekeeper.
+
+- `BackendContext` defines which backend must a request be processed against.
+ - It helps route requests to an appropriate backend in a multitenant environment.
+ - For this iteration though, we intend to work with only one backend as is the case with our architecture.
+
+- `CorrelationContext` defines how multiple queries can be correlated to each other to build a cohesive view on the frontend.
+ - For this iteration though, we intend to keep it empty and only work on adding correlation vectors later.
+
+## Intended target-environments
+
+Keeping inline with our current operational structure, we intend to deploy the metrics offering as a part of GitLab Observability Backend, deployed on the following two target environments:
+
+- kind cluster (for local development)
+- GKE cluster (for staging/production environments)
+
+## Production Readiness
+
+### Batching
+
+Considering we'll need to batch data before ingesting large volumes of small writes into Clickhouse, the design must account for app-local persistence to allow it to locally batch incoming data before landing it into Clickhouse in batches of a predetermined size in order to increase performance and allow the table engine to continue to persist data successfully.
+
+We have considered the following alternatives to implement app-local batching:
+
+- In-memory - non durable
+- BadgerDB - durable, embedded, performant
+- Redis - trivial, external dependency
+- Kafka - non-trivial, external dependency but it can augment multiple other use-cases and help other problem domains at GitLab.
+
+**Note**: Similar challenges have also surfaced with the CH interactions `errortracking` - the subsystem has in its current implementation. There have been multiple attempts to solve this problem domain in the past - [this MR](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1660) implemented an in-memory alternative while [this one](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/1767) attempted an on-disk alternative.
+
+Any work done in this area of concern would also benefit other subsystems such as errortracking, logging, etc.
+
+### Scalability
+
+We intend to start testing the proposed implementation with 10K metric-points per second to test/establish our initial hypothesis, though ideally, we must design the underlying backend for 1M points ingested per second.
+
+### Benchmarking
+
+We propose the following three dimensions be tested while benchmarking the proposed implementation:
+
+- Data ingest performance (functional)
+- Mean query response times (functional)
+- Storage requirements (operational)
+
+For understanding performance, we'll need to first compile a list of such queries given the data we ingest for our tests. Clickhouse query logging is super helpful while doing this.
+
+NOTE:
+Ideally, we aim to benchmark the system to be able to ingest >1M metric points/sec while consistently serving most queries under <1 sec.
+
+### Past work & references
+
+- [Benchmark ClickHouse for metrics](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/1666)
+- [Incubation:APM ClickHouse evaluation](https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/4)
+- [Incubation:APM ClickHouse metrics schema](https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/10)
+- [Our research around TimescaleDB](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14137)
+- [Current Workload on our Thanos-based setup](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15420#current-workload)
+- [Scaling-200m-series](https://opstrace.com/blog/scaling-200m-series)
+
+### Cost-estimation
+
+- We aim to make sure the system is cost-effective to our users for ingesting & querying telemetry data. One of the more significant factors affecting underlying costs are how we model & store ingested data which the intended proposal must optimize for by measures such as reducing data redundancy, pruning unused metrics, etc.
+
+- We must consider the usage of multiple storage medium(s), especially:
+ - Tiered storage
+ - Object storage
+
+### Tooling
+
+As an overarching outcome here, we aim to build the necessary tooling and/or telemetry around ingested data to enable all user personas to have visibility into high cardinality metrics to help prune or drop unused metrics. It'd be prudent to have usage statistics e.g. per-metric scrape frequencies, to make sure our end-users are not ingesting data at a volume they do not need and/or find useful.
+
+## Future iterations
+
+### Linkage across telemetry pillars, exemplars
+
+We must build the metrics system in a way to be able cross-reference ingested data with other telemetry pillars, such as traces, logs and errors, so as to provide a more holistic view of all instrumentation a system sends our way.
+
+### Support for user-defined SQL queries to aggregate data and/or generate materialized views
+
+We should allow users of the system to be able to run user-defined, ad-hoc queries similar to how Prometheus recording rules help generate custom metrics from existing ones.
+
+### Support for scalable data ingestion
+
+We believe that should we feel the need to start buffering data local to the ingestion application and/or move away from Clickhouse for persisting data, on-disk WALs would be a good direction to proceed into given their prevelant usage among other monitoring systems.
+
+### Query Service features
+
+- Adding support for compound queries and/or expressions.
+- Consolidation of querying capabilities for tracing, logs & errortracking via the query engine.
+- Using the query engine to build integrations such as alerting.
+- Adding support for other monitoring/querying standards such as PromQL, MetricQL, OpenSearch, etc
+- Adding automated insights around metric cardinality & resource consumption.
+
+## Planned roadmap
+
+The following section enlists how we intend to implement the aforementioned proposal around building Metrics support into GitLab Observability Service. Each corresponding document and/or issue contains further details of how each next step is planned to be executed.
+
+### 16.5
+
+- Research & draft design proposal and/or requirements.
+- Produce architectural blueprint, open for feedback.
+
+### 16.6
+
+- Develop support for OpenTelemetry-based ingestion.
+- Develop support for querying data; begin with an API to list all ingested metrics scoped to a given tenant.
+- Develop support for displaying a list of ingested metrics within GitLab UI.
+- Release Experimental version.
+
+### 16.7
+
+- Develop support for querying data, add metrics search endpoints for supported metric-types.
+- Develop our first iteration of the query builder, enable querying backend APIs.
+- Develop a metrics details page with the ability to graph data returned via backend APIs.
+- Setup testing, ensure repeatable benchmarking/testing can be performed.
+- Release Beta version, open for early usage by internal and external customers.
+
+### 16.9 (Gap to allow for user feedback for GA release)
+
+- Develop end-to-end testing, complete necessary production readiness, address feedback from users.
+- Release GA version.
diff --git a/doc/architecture/blueprints/observability_metrics/metrics-read-path.png b/doc/architecture/blueprints/observability_metrics/metrics-read-path.png
new file mode 100644
index 00000000000..c94e947079b
--- /dev/null
+++ b/doc/architecture/blueprints/observability_metrics/metrics-read-path.png
Binary files differ
diff --git a/doc/architecture/blueprints/observability_metrics/metrics_indexing_at_ingestion.png b/doc/architecture/blueprints/observability_metrics/metrics_indexing_at_ingestion.png
new file mode 100644
index 00000000000..cafabac25c0
--- /dev/null
+++ b/doc/architecture/blueprints/observability_metrics/metrics_indexing_at_ingestion.png
Binary files differ
diff --git a/doc/architecture/blueprints/observability_metrics/query-service-internals.png b/doc/architecture/blueprints/observability_metrics/query-service-internals.png
new file mode 100644
index 00000000000..de43f812fa8
--- /dev/null
+++ b/doc/architecture/blueprints/observability_metrics/query-service-internals.png
Binary files differ
diff --git a/doc/architecture/blueprints/observability_tracing/index.md b/doc/architecture/blueprints/observability_tracing/index.md
index 71e03d81bcf..4c95d23e6bd 100644
--- a/doc/architecture/blueprints/observability_tracing/index.md
+++ b/doc/architecture/blueprints/observability_tracing/index.md
@@ -45,14 +45,14 @@ To release a generally available distributed tracing feature as part of GitLab.c
Specific goals:
-- An HTTPS write API implemented in the [GitLab Observability Backend](https://GitLab.com/GitLab-org/opstrace/opstrace) project which receives spans sent to GitLab using [OTLP (OpenTelemetry Protocol)](https://opentelemetry.io/docs/specs/otel/protocol/). Users can collect and send distributed traces using either the [OpenTelemetry SDK](https://opentelemetry.io/docs/collector/deployment/no-collector/) or the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/).
+- An HTTPS write API implemented in the [GitLab Observability Backend](https://gitlab.com/gitlab-org/opstrace/opstrace) project which receives spans sent to GitLab using [OTLP (OpenTelemetry Protocol)](https://opentelemetry.io/docs/specs/otel/protocol/). Users can collect and send distributed traces using either the [OpenTelemetry SDK](https://opentelemetry.io/docs/collector/deployment/no-collector/) or the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/).
- UI to list and filter/search for traces by ID, service, attributes or time
- UI to show a detail view of a trace and its corresponding spans
- Apply sensible ingestion and storage limits per top-level namespace for all GitLab tiers
## Timeline
-In order to achieve the group objectives, the following timelines must be met for [GitLab phased rollout](https://about.GitLab.com/handbook/product/GitLab-the-product/#experiment-beta-ga) of Tracing.
+In order to achieve the group objectives, the following timelines must be met for [GitLab phased rollout](https://about.gitlab.com/handbook/product/gitlab-the-product/#experiment-beta-ga) of Tracing.
- **Tracing Experiment Release**: 16.2
- **Tracing Beta Release**: 16.3
@@ -114,7 +114,7 @@ The scope of effort for GA would include two APIs:
### Authentication and Authorization
<!-- markdownlint-disable-next-line MD044 -->
-GitLab Observability Backend utilizes an [instance-wide trusted GitLab OAuth](https://docs.GitLab.com/ee/integration/OAuth_provider.html#create-an-instance-wide-application) token to perform a seamless OAuth flow that authenticates the GitLab user against the GitLab Observability Backend (GOB). GOB creates an auth session and stores the session identifier in an http-only, secure cookie. This mechanism has already been examined and approved by AppSec. Now that the Observability UI will be native within the UI hosted at GitLab.com, a few small adjustments must be made for authentication to work against the new UI domain vs the embedded iframe that we previously relied upon (GitLab.com instead of observe.gitLab.com).
+GitLab Observability Backend utilizes an [instance-wide trusted GitLab OAuth](../../../integration/oauth_provider.md#create-an-instance-wide-application) token to perform a seamless OAuth flow that authenticates the GitLab user against the GitLab Observability Backend (GOB). GOB creates an auth session and stores the session identifier in an http-only, secure cookie. This mechanism has already been examined and approved by AppSec. Now that the Observability UI will be native within the UI hosted at GitLab.com, a few small adjustments must be made for authentication to work against the new UI domain vs the embedded iframe that we previously relied upon (GitLab.com instead of observe.gitLab.com).
A hidden iframe will be embedded in the GitLab UI only on pages where GOB authenticated APIs must be consumed. This allows GitLab.com UI to directly communicate with GOB APIs without the need for an intermediate proxy layer in rails and without relying on the less secure shared token between proxy and GOB. This iframe will be hidden and its sole purpose is to perform the OAuth flow and assign the http-only secure cookie containing the GOB user session. This flow is seamless and can be fully hidden from the user since its a **trusted** GitLab OAuth flow. Sessions currently expire after 30 days which is configurable in GOB deployment terraform.
diff --git a/doc/architecture/blueprints/organization/index.md b/doc/architecture/blueprints/organization/index.md
index 0955d53313d..258a624e371 100644
--- a/doc/architecture/blueprints/organization/index.md
+++ b/doc/architecture/blueprints/organization/index.md
@@ -108,11 +108,13 @@ The Organization MVC will contain the following functionality:
- Organization Owner. The creation of an Organization appoints that User as the Organization Owner. Once established, the Organization Owner can appoint other Organization Owners.
- Organization Users. A User is managed by one Organization, but can be part of multiple Organizations. Users are able to navigate between the different Organizations they are part of.
- Setup settings. Containing the Organization name, ID, description, and avatar. Settings are editable by the Organization Owner.
-- Setup flow. Users are able to build new Organizations and transfer existing top-level Groups into them. They can also create new top-level Groups in an Organization.
+- Setup flow. Users are able to build new Organizations. They can also create new top-level Groups in an Organization.
- Visibility. Initially, Organizations can only be `public`. Public Organizations can be seen by everyone. They can contain public and private Groups and Projects.
- Organization settings page with the added ability to remove an Organization. Deletion of the default Organization is prevented.
- Groups. This includes the ability to create, edit, and delete Groups, as well as a Groups overview that can be accessed by the Organization Owner and Users.
- Projects. This includes the ability to create, edit, and delete Projects, as well as a Projects overview that can be accessed by the Organization Owner and Users.
+- Personal Namespaces. Users get [a personal Namespace in each Organization](../cells/impacted_features/personal-namespaces.md) they interact with.
+- User Profile. Each [User Profile will be scoped to the Organization](../cells/impacted_features/user-profile.md).
### Organization Access
@@ -324,13 +326,12 @@ In iteration 2, an Organization MVC Experiment will be released. We will test th
### Iteration 3: Organization MVC Beta (FY25Q1)
-In iteration 3, the Organization MVC Beta will be released. Users will be able to transfer existing top-level Groups into an Organization.
+In iteration 3, the Organization MVC Beta will be released.
- Multiple Organization Owners can be assigned.
- Organization avatars can be changed in the Organization settings.
- Organization Owners can create, edit and delete Groups from the Groups overview.
- Organization Owners can create, edit and delete Projects from the Projects overview.
-- Top-level Groups can be transferred into an Organization.
- The Organization URL path can be changed.
### Iteration 4: Organization MVC GA (FY25Q2)
@@ -341,6 +342,7 @@ In iteration 4, the Organization MVC will be rolled out.
After the initial rollout of Organizations, the following functionality will be added to address customer needs relating to their implementation of GitLab:
+1. [Users can transfer existing top-level Groups into Organizations](https://gitlab.com/groups/gitlab-org/-/epics/11711).
1. [Organizations can invite Users](https://gitlab.com/gitlab-org/gitlab/-/issues/420166).
1. Internal visibility will be made available on Organizations that are part of GitLab.com.
1. Restrict inviting Users outside of the Organization.
diff --git a/doc/architecture/blueprints/permissions/index.md b/doc/architecture/blueprints/permissions/index.md
index ab66733803d..c131c372550 100644
--- a/doc/architecture/blueprints/permissions/index.md
+++ b/doc/architecture/blueprints/permissions/index.md
@@ -179,6 +179,6 @@ Cons:
## Resources
-- [Custom Roles MVC announcement](https://github.blog/changelog/2021-10-27-enterprise-organizations-can-now-create-custom-repository-roles)
+- [Custom Roles MVC announcement](https://github.blog/changelog/2021-10-27-enterprise-organizations-can-now-create-custom-repository-roles/)
- [Custom Roles lunch and learn notes](https://docs.google.com/document/d/1x2ExhGJl2-nEibTaQE_7e5w2sDCRRHiakrBYDspPRqw/edit#)
- [Discovery on auto-generating documentation for permissions](https://gitlab.com/gitlab-org/gitlab/-/issues/352891#note_989392294).
diff --git a/doc/architecture/blueprints/remote_development/index.md b/doc/architecture/blueprints/remote_development/index.md
index d64fbfc8b55..cc66c3b5416 100644
--- a/doc/architecture/blueprints/remote_development/index.md
+++ b/doc/architecture/blueprints/remote_development/index.md
@@ -747,7 +747,7 @@ You can read more about this decision in this [issue](https://gitlab.com/gitlab-
## Links
-- [Remote Development direction](https://about.gitlab.com/direction/create/editor/remote_development)
+- [Remote Development direction](https://about.gitlab.com/direction/create/ide/remote_development/)
- [Remote Development presentation](https://docs.google.com/presentation/d/1XHH_ZilZPufQoWVWViv3evipI-BnAvRQrdvzlhBuumw/edit#slide=id.g131f2bb72e4_0_8)
- [Category Strategy epic](https://gitlab.com/groups/gitlab-org/-/epics/7419)
- [Minimal Maturity epic](https://gitlab.com/groups/gitlab-org/-/epics/9189)
@@ -760,5 +760,4 @@ You can read more about this decision in this [issue](https://gitlab.com/gitlab-
- [Browser runtime](https://gitlab.com/groups/gitlab-org/-/epics/8291)
- [GitLab-hosted infrastructure](https://gitlab.com/groups/gitlab-org/-/epics/8292)
- [Browser runtime spike](https://gitlab.com/gitlab-org/gitlab-web-ide/-/merge_requests/58)
-- [Ideal user journey](https://about.gitlab.com/direction/create/editor/remote_development/#ideal-user-journey)
- [Building container images for workspaces](https://gitlab.com/gitlab-org/gitlab/-/issues/396300#note_1375061754)
diff --git a/doc/architecture/blueprints/runway/img/runway-architecture.png b/doc/architecture/blueprints/runway/img/runway-architecture.png
index e577eb7fd15..4ab4cf882c2 100644
--- a/doc/architecture/blueprints/runway/img/runway-architecture.png
+++ b/doc/architecture/blueprints/runway/img/runway-architecture.png
Binary files differ
diff --git a/doc/architecture/blueprints/runway/img/runway_vault_4_.drawio.png b/doc/architecture/blueprints/runway/img/runway_vault_4_.drawio.png
index b56e326c8c4..2a35df44aa8 100644
--- a/doc/architecture/blueprints/runway/img/runway_vault_4_.drawio.png
+++ b/doc/architecture/blueprints/runway/img/runway_vault_4_.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/secret_manager/decisions/001_envelop_encryption.md b/doc/architecture/blueprints/secret_manager/decisions/001_envelop_encryption.md
new file mode 100644
index 00000000000..909b70ad4c2
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/decisions/001_envelop_encryption.md
@@ -0,0 +1,69 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab Secrets Manager ADR 001: Use envelope encryption'
+---
+
+# GitLab Secrets Manager ADR 001: Use envelope encryption
+
+## Context
+
+To store secrets securely in the GitLab Secrets Manager, we need a system that can prevent unencrypted secrets from being leaked
+in the event of a security breach to a GitLab system.
+
+## Decision
+
+Use envelope encryption. GitLab Rails will store the encrypted secret at rest, along with an encrypted data key.
+In order to decrypt the secret, GitLab Rails will need to make a decryption request to the GCP key manager through GitLab
+Secrets Service and obtain the decrypted data key. The data key is then used to decrypt the encrypted secret.
+
+```mermaid
+sequenceDiagram
+ participant A as Client
+ participant B as GitLab Rails
+ participant C as GitLab Secrets Service
+
+ Note over B,C: Initialize vault for project/group/organization
+
+ B->>C: Initialize vault - create key pair
+ C->>B: Returns vault public key
+ B->>B: Stores vault public key
+
+ Note over A,C: Creating a new secret
+
+ A->>B: Create new secret
+ B->>B: Generate new symmetric data key
+ B->>B: Encrypts secret with data key
+ B->>B: Encrypts data key with vault public key
+ B->>B: Stores envelope (encrypted secret + encrypted data key)
+ B-->>B: Discards plain-text data key
+ B->>A: Success
+
+ Note over A,C: Retrieving a secret
+
+ A->>B: Get secret
+ B->>B: Retrieves envelope (encrypted secret + encrypted data key)
+ B->>C: Decrypt data key
+ C->>C: Decrypt data key using vault private key
+ C->>B: Returns plain-text data key
+ B->>B: Decrypts secret
+ B-->>B: Discards plain-text data key
+ B->>A: Returns secret
+```
+
+## Consequences
+
+With this approach, an actor that gains access to the GitLab database containing the envelope will not be able to
+decrypt the content of the secret as the private key required is not stored with it.
+
+We also need to consider how to securely generate and store the asymmetric keypair used for each vault.
+
+In addition, the following resources would be required:
+
+1. Multiple asymmetric keypairs. A unique asymmetric keypair is needed per vault, belonging to a project, group or an organization.
+1. Multiple symmetric keys. A unique key is needed per secret.
+
+## Alternatives
+
+We considered performing the encryption and decryption of the secret in the GitLab Secrets Service, while storing the
+encrypted data in GitLab Rails. However, this means that there would be a time where the secret and the encryption keys
+exist at the same time in GitLab Secrets Service.
diff --git a/doc/architecture/blueprints/secret_manager/index.md b/doc/architecture/blueprints/secret_manager/index.md
new file mode 100644
index 00000000000..2a840f8d846
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/index.md
@@ -0,0 +1,139 @@
+---
+status: proposed
+creation-date: "2023-08-07"
+authors: [ "@alberts-gitlab" ]
+coach: [ "@grzesiek" ]
+approvers: [ "@jocelynjane", "@shampton" ]
+owning-stage: "~devops::verify"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Secrets Manager
+
+## Summary
+
+GitLab users need a secure and easy-to-use solution to
+store their sensitive credentials that should be kept confidential ("secret").
+GitLab Secrets Manager is the desired system that provides GitLab users
+to meet that need without having to access third party tools.
+
+## Motivation
+
+The current de-facto approach used by many to store a sensitive credential in GitLab is
+using a [Masked Variable](../../../ci/variables/index.md#mask-a-cicd-variable) or a
+[File Variable](../../../ci/variables/index.md#use-file-type-cicd-variables).
+However, data stored in variables (masked or file variables) can be inadvertently exposed even with masking.
+A more secure solution would be to use native integration
+with external secret managers such as HashiCorp Vault or Azure Key Vault.
+
+Integration with external secret managers requires GitLab to maintain the integration
+with the third-party products and to assist customers in troubleshooting configuration issues.
+In addition, customer's engineering teams using these external secret managers
+may need to maintain these systems themselves, adding to the operational burden.
+
+Having a GitLab native secret manager would provide customers a secure method to store and access secrets
+without the overhead of third party tools as well as to leverage the tight integration with other GitLab features.
+
+### Goals
+
+Provide GitLab users with a way to:
+
+- Securely store secrets in GitLab
+- Use the stored secrets in GitLab components (for example, CI Runner)
+- Use the stored secrets in external environments (for example, production infrastructure).
+- Manage access to secrets across a root namespace, subgroups and projects.
+- Seal/unseal secrets vault on demand.
+
+#### Non-functional requirements
+
+- Security
+- Compliance
+- Auditability
+
+### Non-Goals
+
+This blueprint does not cover the following:
+
+- Secrets such as access tokens created within GitLab to allow external resources to access GitLab, e.g personal access tokens.
+
+## Proposal
+
+The secrets manager feature will consist of three core components:
+
+1. GitLab Rails
+1. GitLab Secrets Service
+1. GCP Key Management
+
+At a high level, secrets will be stored using unique encryption keys in order to achieve isolation
+across GitLab. Each service should also be isolated such that in the event
+one of the components is compromised, the likelihood of a secrets leaking is minimized.
+
+![Secrets Manager Overview](secrets-manager-overview.png)
+
+**1. GitLab Rails**
+
+GitLab Rails would be the main interface that users would interact with when creating secrets using the Secrets Manager feature.
+
+This component performs the following role:
+
+1. Storing unique encryption public keys per organization.
+1. Encrypting and storing secret using envelope encryption.
+
+The plain-text secret would be encrypted using a single use data key.
+The data key is then encrypted using the public key belonging to the group or project.
+Both, the encrypted secret and the encrypted data key, are being stored in the database.
+
+**2. GitLab Secrets Manager**
+
+GitLab Secrets Manager will be a new component in the GitLab overall architecture. This component serves the following purpose:
+
+1. Correlating GitLab identities into GCP identities for access control.
+1. A proxy over GCP Key Management for decrypting operations.
+
+**3. GCP Key Management**
+
+We choose to leverage GCP Key Management to build on the security and trust that GCP provides on cryptographic operations.
+In particular, we would be using GCP Key Management to store the private keys that will be used to decrypt
+the data keys mentioned above.
+
+### Implementation detail
+
+- [Secrets Manager](secrets_manager.md)
+
+### Further investigations required
+
+1. Management of identities stored in GCP Key Management.
+We need to investigate how we can correlate and de-multiplex GitLab identities into
+GCP identities that are used to allow access to cryptographic operations on GCP Key Management.
+1. Authentication of clients. Clients to the Secrets Manager could be GitLab Runner or external clients.
+For each of these, we need a secure and reliable method to authenticate requests to decrypt a secret.
+1. Assignment of GCP backed private keys to each identity.
+
+### Availability on SaaS and Self-Managed
+
+To begin with, the proposal above is intended for GitLab SaaS environment. GitLab SaaS is deployed on Google Cloud Platform.
+Hence, GCP Key Management is the natural choice for a cloud-based key management service.
+
+To extend this service to self-managed GitLab instances, we would consider using GitLab Cloud Connector as a proxy between
+self-managed GitLab instances and the GitLab Secrets Manager.
+
+## Decision Records
+
+- [001: Use envelope encryption](decisions/001_envelop_encryption.md)
+
+## Alternative Solutions
+
+Other solutions we have explored:
+
+- Separating secrets from CI/CD variables as a separate model with limited access, to avoid unintended exposure of the secret.
+- [Secure Files](../../../ci/secure_files/index.md)
+
+## References
+
+The following links provide additional information that may be relevant to secret management concepts.
+
+- [OWASP Secrets Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html)
+- [OWASP Key Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html)
diff --git a/doc/architecture/blueprints/secret_manager/secrets-manager-overview.png b/doc/architecture/blueprints/secret_manager/secrets-manager-overview.png
new file mode 100644
index 00000000000..4e3985cc30e
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/secrets-manager-overview.png
Binary files differ
diff --git a/doc/architecture/blueprints/secret_manager/secrets_manager.md b/doc/architecture/blueprints/secret_manager/secrets_manager.md
new file mode 100644
index 00000000000..7e9488243bb
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/secrets_manager.md
@@ -0,0 +1,14 @@
+---
+status: proposed
+creation-date: "2023-08-07"
+authors: [ "@alberts-gitlab" ]
+coach: [ "@grzesiek" ]
+approvers: [ "@jocelynjane", "@shampton" ]
+owning-stage: "~devops::verify"
+participating-stages: []
+---
+
+<!-- Blueprints often contain forward-looking statements -->
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Secrets Manager - Implementation Detail (Placeholder)
diff --git a/doc/architecture/blueprints/work_items/index.md b/doc/architecture/blueprints/work_items/index.md
index 6f5b48fffcb..e12bb4d8773 100644
--- a/doc/architecture/blueprints/work_items/index.md
+++ b/doc/architecture/blueprints/work_items/index.md
@@ -66,25 +66,25 @@ All Work Item types share the same pool of predefined widgets and are customized
### Work Item widget types (updating)
-| Widget | Description | feature flag |
-|---|---|---|
-| [WorkItemWidgetAssignees](../../../api/graphql/reference/index.md#workitemwidgetassignees) | List of work item assignees | |
-| [WorkItemWidgetAwardEmoji](../../../api/graphql/reference/index.md#workitemwidgetawardemoji) | Emoji reactions added to work item, including support for upvote/downvote counts | |
-| [WorkItemWidgetCurrentUserTodos](../../../api/graphql/reference/index.md#workitemwidgetcurrentusertodos) | User todo state of work item | |
-| [WorkItemWidgetDescription](../../../api/graphql/reference/index.md#workitemwidgetdescription) | Description of work item, including support for edited state, timestamp, and author | |
-| [WorkItemWidgetHealthStatus](../../../api/graphql/reference/index.md#workitemwidgethealthstatus) | Health status assignment support for work item | |
-| [WorkItemWidgetHierarchy](../../../api/graphql/reference/index.md#workitemwidgethierarchy) | Hierarchy of work items, including support for boolean representing presence of children. **Note:** Hierarchy is currently available only for OKRs. | `okrs_mvc` |
-| [WorkItemWidgetIteration](../../../api/graphql/reference/index.md#workitemwidgetiteration) | Iteration assignment support for work item | |
-| [WorkItemWidgetLabels](../../../api/graphql/reference/index.md#workitemwidgetlabels) | List of labels added to work items, including support for checking whether scoped labels are supported |
-| [WorkItemWidgetLinkedItems](../../../api/graphql/reference/index.md#workitemwidgetlinkeditems) | List of work items added as related to a given work item, with possible relationship types being `relates_to`, `blocks`, and `blocked_by`. Includes support for individual counts of blocked status, blocked by, blocking, and related to. | `linked_work_items` |
-| [WorkItemWidgetMilestone](../../../api/graphql/reference/index.md#workitemwidgetmilestone) | Milestone assignment support for work item | |
-| [WorkItemWidgetNotes](../../../api/graphql/reference/index.md#workitemwidgetnotes) | List of discussions within a work item | |
-| [WorkItemWidgetNotifications](../../../api/graphql/reference/index.md#workitemwidgetnotifications) | Notifications subscription status of a work item for current user | |
-| [WorkItemWidgetProgress](../../../api/graphql/reference/index.md#workitemwidgetprogress) | Progress value of a work item. **Note:** Progress is currently available only for OKRs. | `okrs_mvc` |
-| [WorkItemWidgetStartAndDueDate](../../../api/graphql/reference/index.md#workitemwidgetstartandduedate) | Set start and due dates for a work item | |
-| [WorkItemWidgetStatus](../../../api/graphql/reference/index.md#workitemwidgetstatus) | Status of a work item when type is Requirement, with possible status types being `unverified`, `satisfied`, or `failed` | |
-| [WorkItemWidgetTestReports](../../../api/graphql/reference/index.md#workitemwidgettestreports) | Test reports associated with a work item | |
-| [WorkItemWidgetWeight](../../../api/graphql/reference/index.md#workitemwidgetweight) | Set weight of a work item | |
+| Widget | Description | Feature flag | Write permission | GraphQL Subscription Support |
+|---|---|---|---|---|
+| [WorkItemWidgetAssignees](../../../api/graphql/reference/index.md#workitemwidgetassignees) | List of work item assignees | |`Guest`|Yes|
+| [WorkItemWidgetAwardEmoji](../../../api/graphql/reference/index.md#workitemwidgetawardemoji) | Emoji reactions added to work item, including support for upvote/downvote counts | |Anyone who can view|No|
+| [WorkItemWidgetCurrentUserTodos](../../../api/graphql/reference/index.md#workitemwidgetcurrentusertodos) | User todo state of work item | |Anyone who can view|No|
+| [WorkItemWidgetDescription](../../../api/graphql/reference/index.md#workitemwidgetdescription) | Description of work item, including support for edited state, timestamp, and author | |`Reporter`|No|
+| [WorkItemWidgetHealthStatus](../../../api/graphql/reference/index.md#workitemwidgethealthstatus) | Health status assignment support for work item | |`Reporter`|No|
+| [WorkItemWidgetHierarchy](../../../api/graphql/reference/index.md#workitemwidgethierarchy) | Hierarchy of work items, including support for boolean representing presence of children. **Note:** Hierarchy is currently available only for OKRs. | `okrs_mvc` |`Guest`|No|
+| [WorkItemWidgetIteration](../../../api/graphql/reference/index.md#workitemwidgetiteration) | Iteration assignment support for work item | |`Reporter`|No|
+| [WorkItemWidgetLabels](../../../api/graphql/reference/index.md#workitemwidgetlabels) | List of labels added to work items, including support for checking whether scoped labels are supported | |`Reporter`|Yes|
+| [WorkItemWidgetLinkedItems](../../../api/graphql/reference/index.md#workitemwidgetlinkeditems) | List of work items added as related to a given work item, with possible relationship types being `relates_to`, `blocks`, and `blocked_by`. Includes support for individual counts of blocked status, blocked by, blocking, and related to. | `linked_work_items`|`Guest`|No|
+| [WorkItemWidgetMilestone](../../../api/graphql/reference/index.md#workitemwidgetmilestone) | Milestone assignment support for work item | |`Reporter`|No|
+| [WorkItemWidgetNotes](../../../api/graphql/reference/index.md#workitemwidgetnotes) | List of discussions within a work item | |`Guest`|Yes|
+| [WorkItemWidgetNotifications](../../../api/graphql/reference/index.md#workitemwidgetnotifications) | Notifications subscription status of a work item for current user | |Anyone who can view|No|
+| [WorkItemWidgetProgress](../../../api/graphql/reference/index.md#workitemwidgetprogress) | Progress value of a work item. **Note:** Progress is currently available only for OKRs. | `okrs_mvc` |`Reporter`|No|
+| [WorkItemWidgetStartAndDueDate](../../../api/graphql/reference/index.md#workitemwidgetstartandduedate) | Set start and due dates for a work item | |`Reporter`|No|
+| [WorkItemWidgetStatus](../../../api/graphql/reference/index.md#workitemwidgetstatus) | Status of a work item when type is Requirement, with possible status types being `unverified`, `satisfied`, or `failed` | | |No|
+| [WorkItemWidgetTestReports](../../../api/graphql/reference/index.md#workitemwidgettestreports) | Test reports associated with a work item | | | |
+| [WorkItemWidgetWeight](../../../api/graphql/reference/index.md#workitemwidgetweight) | Set weight of a work item | |`Reporter`|No|
### Work item relationships