Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2023-11-14 11:41:52 +0300
committerGitLab Bot <gitlab-bot@gitlab.com>2023-11-14 11:41:52 +0300
commit585826cb22ecea5998a2c2a4675735c94bdeedac (patch)
tree5b05f0b30d33cef48963609e8a18a4dff260eab3 /doc/architecture
parentdf221d036e5d0c6c0ee4d55b9c97f481ee05dee8 (diff)
Add latest changes from gitlab-org/gitlab@16-6-stable-eev16.6.0-rc42
Diffstat (limited to 'doc/architecture')
-rw-r--r--doc/architecture/blueprints/cdot_orders/index.md265
-rw-r--r--doc/architecture/blueprints/cells/impacted_features/personal-access-tokens.md28
-rw-r--r--doc/architecture/blueprints/cells/index.md2
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/img/catalogs.pngbin30325 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md59
-rw-r--r--doc/architecture/blueprints/cloud_connector/decisions/001_lb_entry_point.md52
-rw-r--r--doc/architecture/blueprints/cloud_connector/index.md12
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database/index.md10
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md2
-rw-r--r--doc/architecture/blueprints/email_ingestion/index.md2
-rw-r--r--doc/architecture/blueprints/feature_flags_usage_in_dev_and_ops/index.md285
-rw-r--r--doc/architecture/blueprints/gitlab_ml_experiments/index.md67
-rw-r--r--doc/architecture/blueprints/gitlab_steps/gitlab-ci.md247
-rw-r--r--doc/architecture/blueprints/gitlab_steps/index.md15
-rw-r--r--doc/architecture/blueprints/gitlab_steps/step-definition.md368
-rw-r--r--doc/architecture/blueprints/gitlab_steps/steps-syntactic-sugar.md66
-rw-r--r--doc/architecture/blueprints/google_artifact_registry_integration/index.md2
-rw-r--r--doc/architecture/blueprints/new_diffs.md29
-rw-r--r--doc/architecture/blueprints/observability_logging/diagrams.drawio1
-rw-r--r--doc/architecture/blueprints/observability_logging/index.md632
-rw-r--r--doc/architecture/blueprints/observability_logging/system_overview.pngbin0 -> 76330 bytes
-rw-r--r--doc/architecture/blueprints/organization/diagrams/organization-isolation-broken.drawio.pngbin0 -> 57795 bytes
-rw-r--r--doc/architecture/blueprints/organization/diagrams/organization-isolation.drawio.pngbin0 -> 56021 bytes
-rw-r--r--doc/architecture/blueprints/organization/index.md3
-rw-r--r--doc/architecture/blueprints/organization/isolation.md152
-rw-r--r--doc/architecture/blueprints/runner_admission_controller/index.md97
-rw-r--r--doc/architecture/blueprints/secret_detection/index.md124
-rw-r--r--doc/architecture/blueprints/secret_manager/decisions/002_gcp_kms.md101
-rw-r--r--doc/architecture/blueprints/secret_manager/decisions/003_go_service.md37
-rw-r--r--doc/architecture/blueprints/secret_manager/decisions/004_staleless_kms.md49
-rw-r--r--doc/architecture/blueprints/secret_manager/index.md18
-rw-r--r--doc/architecture/blueprints/work_items/index.md32
32 files changed, 2586 insertions, 171 deletions
diff --git a/doc/architecture/blueprints/cdot_orders/index.md b/doc/architecture/blueprints/cdot_orders/index.md
new file mode 100644
index 00000000000..924a50d2b8a
--- /dev/null
+++ b/doc/architecture/blueprints/cdot_orders/index.md
@@ -0,0 +1,265 @@
+---
+status: proposed
+creation-date: "2023-10-12"
+authors: [ "@tyleramos" ]
+coach: "@fabiopitino"
+approvers: [ "@tgolubeva", "@jameslopez" ]
+owning-stage: "~devops::fulfillment"
+participating-stages: []
+---
+
+# Align CustomersDot Orders with Zuora Orders
+
+## Summary
+
+The [GitLab Customers Portal](https://customers.gitlab.com/) is an application separate from the GitLab product that allows GitLab Customers to manage their account and subscriptions, tasks like purchasing additional seats. More information about the Customers Portal can be found in [the GitLab docs](../../../subscriptions/customers_portal.md). Internally, the application is known as [CustomersDot](https://gitlab.com/gitlab-org/customers-gitlab-com) (also known as CDot).
+
+GitLab uses [Zuora's platform](https://about.gitlab.com/handbook/business-technology/enterprise-applications/guides/zuora/) to manage their subscription-based services. CustomersDot integrates directly with Zuora Billing and treats [Zuora Billing](https://about.gitlab.com/handbook/finance/accounting/finance-ops/billing-ops/zuora-billing/) as the single source of truth for subscription data.
+
+CustomersDot stores some subscription and order data locally, in the form of the `orders` database table, which at times can be out of sync with Zuora Billing. The main objective for this blueprint is to lay out a plan for improving the integration with Zuora Billing, making it more reliable, accurate, and performant.
+
+## Motivation
+
+Working with the `Order` model in CustomersDot has been a challenge for Fulfillment engineers. It is difficult to trust `Order` data as it can get out of sync with the single source of truth for subscription data, Zuora Billing. This has led to bugs, confusion and delays in feature development. An [epic exists for aligning CustomersDot Orders with Zuora objects](https://gitlab.com/groups/gitlab-org/-/epics/9748) which lists a variety of issues related to these data integrity problems. The motivation of this blueprint is to develop a better data architecture in CustomersDot for Subscriptions and associated data models which builds trust and reduces bugs.
+
+### Goals
+
+This re-architecture project has several multifaceted objectives.
+
+- Increase the accuracy of CustomersDot data pertaining to Subscriptions and its entitlements. This data is stored as `Order` records in CustomersDot - it is not granular enough to represent what the customer has purchased, and it is error prone as shown by the following issues:
+ - [Multiple order records for the same subscription](https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/6971)
+ - [Multiple subscriptions active for the same namespace](https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/6972)
+ - [Support Multiple Active Orders on a Namespace](https://gitlab.com/groups/gitlab-org/-/epics/9486)
+- Continue to align with Zuora Billing being the SSoT for Subscription and Order data.
+- Decrease dependency and reliance on Zuora Billing uptime.
+- Improve CustomersDot performance by storing relevant Subscription data locally and keeping it in sync with Zuora Billing. This could be a key piece to making Seat Link more efficient and reliable.
+- Eliminate confusion between CustomersDot Orders, which contain data more closely resembling a Subscription, and [Zuora Orders](https://knowledgecenter.zuora.com/Zuora_Billing/Manage_subscription_transactions/Orders), which represent a transaction between a customer and merchant and can apply to multiple Subscriptions.
+ - The CustomersDot `orders` table contains a mixture of Zuora Subscription and trials, along with GitLab-specific metadata like sync timestamps with GitLab.com. GitLab does not store trial subscriptions in Zuora at this time.
+
+## Proposal
+
+As the list of goals above shows, there are a good number of desired outcomes we would like to see at the end of implementation. To reach these goals, we will break this work up into smaller iterations.
+
+1. [Phase one: Zuora Subscription Cache](#phase-one-zuora-subscription-cache)
+
+ The first iteration focuses on adding a local cache for Zuora Subscription objects, including Rate Plans, Rate Plan Charges, and Rate Plan Charge Tiers, in CustomersDot.
+
+1. [Phase two: Utilize Zuora Cache Models](#phase-two-utilize-zuora-cache-models)
+
+ The second phase involves using the Zuora cache models introduced in phase one. Any code in CustomersDot that makes a read request to Zuora for Subscription data should be replaced with an ActiveRecord query. This should result in a big performance improvement.
+
+1. [Phase three: Transition from `Order` to `Subscription`](#phase-three-transition-from-order-to-subscription)
+
+ The next iteration focuses on transitioning away from the CustomersDot `Order` model to a new model for Subscription.
+
+## Design and implementation details
+
+### Phase one: Zuora Subscription Cache
+
+The first phase for this blueprint focuses on adding new models for caching Zuora Subscription data locally in CustomersDot. These local data models will allow CustomersDot to query the local database for Zuora Subscriptions. Currently, this requires querying directly to Zuora which can be problematic if Zuora is experiencing downtime. Zuora also has rate limits for API usage which we want to avoid as CustomersDot continues to scale.
+
+This phase will consist of creating the new data models, building the mechanisms to keep the local data in sync with Zuora, and backfilling the existing data. It will be important that the local cache models are read-only for most of the application to ensure the data is always in sync. Only the syncing mechanism should have the ability to write to these models.
+
+#### Proposed DB schema
+
+```mermaid
+erDiagram
+ "Zuora::Subscription" ||--|{ "Zuora::RatePlan" : "has many"
+ "Zuora::RatePlan" ||--|{ "Zuora::RatePlanCharge" : "has many"
+ "Zuora::RatePlanCharge" ||--|{ "Zuora::RatePlanChargeTier" : "has many"
+
+ "Zuora::Subscription" {
+ string(64) zuora_id PK "`id` field on Zuora Subscription"
+ string(64) account_id
+ string name
+ string(64) previous_subscription_id
+ string status
+ date term_start_date
+ date term_end_date
+ int version
+ boolean auto_renew "null:false default:false"
+ date cancelled_date
+ string(64) created_by_id
+ integer current_term
+ string current_term_period_type
+ string eoa_starter_bronze_offer_accepted__c
+ string external_subscription_id__c
+ string external_subscription_source__c
+ string git_lab_namespace_id__c
+ string git_lab_namespace_name__c
+ integer initial_term
+ string(64) invoice_owner_id
+ string notes
+ string opportunity_id__c
+ string(64) original_id
+ string(64) ramp_id
+ string renewal_subscription__c__c
+ integer renewal_term
+ date subscription_end_date
+ date subscription_start_date
+ string turn_on_auto_renew__c
+ string turn_on_cloud_licensing__c
+ string turn_on_operational_metrics__c
+ string turn_on_seat_reconciliation__c
+ datetime created_date
+ datetime updated_date
+ datetime created_at
+ datetime updated_at
+ }
+
+ "Zuora::RatePlan" {
+ string(64) zuora_id PK "`id` field on Zuora RatePlan"
+ string(64) subscription_id FK
+ string name
+ string(64) product_rate_plan_id
+ datetime created_date
+ datetime updated_date
+ datetime created_at
+ datetime updated_at
+ }
+
+ "Zuora::RatePlanCharge" {
+ string(64) zuora_id PK "`id` field on Zuora RatePlanCharge"
+ string(64) rate_plan_id FK
+ string(64) product_rate_plan_charge_id
+ int quantity
+ date effective_start_date
+ date effective_end_date
+ string price_change_option
+ string charge_number
+ string charge_type
+ boolean is_last_segment "null:false default:false"
+ int segment
+ int mrr
+ int tcv
+ int dmrc
+ int dtcv
+ string(64) subscription_id
+ string(64) subscription_owner_id
+ int version
+ datetime created_date
+ datetime updated_date
+ datetime created_at
+ datetime updated_at
+ }
+
+ "Zuora::RatePlanChargeTier" {
+ string zuora_id PK "`id` field on Zuora RatePlanChargeTier"
+ string rate_plan_charge_id FK
+ string price
+ datetime created_date
+ datetime updated_date
+ datetime created_at
+ datetime updated_at
+ }
+```
+
+#### Notes
+
+- The namespace `Zuora` is already taken by the classes used to extend `IronBank` resource classes. It was decided to move these to the namespace `Zuora::Remote` to indicate these are intended to reach out to Zuora. This frees up the `Zuora` namespace to be used to group the models related to Zuora cached data.
+- All versions of Zuora Subscriptions will be stored in this table to be able to support display of current as well as future purchases when Zuora is down. One of the guiding principles from the Architecture Review meeting on 2023-08-06 was "Customers should be able to view and access what they purchased even if Zuora is down". Given that customers can make future-dated purchases, CustomersDot needs to store current and future versions of Subscriptions.
+- `zuora_id` would be the primary key given we want to avoid the field name `id` which is magical in ActiveRecord.
+- The timezone for Zuora Billing is configured as Pacific Time. Let's account for this timezone as we sync data from Zuora into CDot's cached models to allow for more accurate comparisons.
+
+#### Keeping data in sync with Zuora
+
+CDot currently receives and processes `Order Processed` Zuora callouts for Order actions like `Update Product` ([full list](https://gitlab.com/gitlab-org/customers-gitlab-com/-/blob/64c5d17bac38bef1156e9a15008cc7d2b9aa46a9/lib/zuora/order.rb#L26)). These callouts help to keep CustomersDot in sync with Zuora and trigger provisioning events. These callouts will be important to keeping `Zuora::Subscription` and related cached models in sync with changes in Zuora.
+
+This existing callout would not be sufficient to cover all changes to a Zuora Subscription though. In particular, changes to custom fields may not be captured by these existing callouts. We will need to create custom events and callouts for any custom field cached in CustomersDot for any of these resources to ensure CDot is in sync with Zuora. This should only affect `Zuora::Subscription` though as no custom fields are used by CustomersDot on any of the other proposed cached resources at this time.
+
+#### Rollout of Zuora Cache models
+
+With the first iteration of introducing the cached Zuora data models, we will take an iterative approach to the rollout. There should be no impact to existing functionality as we build out the models, start populating the data through callouts, and backfill these models. Once this is in place, we will iteratively update existing features to use these cached data models instead of querying Zuora directly.
+
+We will make this transition using many small scoped feature flags, rather than one large feature flag to gate all of the new logic using these cache models. This will help us deliver more quickly and reduce the length with which feature flag logic is maintained and test cases are retained.
+
+Testing can be performed before the cached models are used in the codebase to ensure data integrity of the cached models.
+
+### Phase two: Utilize Zuora Cache Models
+
+This phase covers the second phase of work of the Orders re-architecture. In this phase, the focus will be utilizing the new Zuora cache data models introduced in phase one. Querying Zuora for Subscription data is fundamental to Customers so there are plenty of places that will need to be updated. In the places where CDot is reading from Zuora, it can be replaced by querying the local cache data models instead. This should result in a big performance boost by avoiding third party requests, particularly in components like the Seat Link Service.
+
+This transition will be completed using many small scoped feature flags, rather than one large feature flag to gate all of the new logic using these cache models. This will help to deliver more quickly and reduce the length with which feature flag logic is maintained and test cases are retained.
+
+### Phase three: Transition from `Order` to `Subscription`
+
+The second phase for this blueprint focuses on transitioning away from the CustomersDot `Order` model to a new model for `Subscription`. This phase will consist of creating a new model for `Subscription`, supporting both models during the transition period, updating existing code to use `Subscription` and finally removing the `Order` model once it is no longer needed.
+
+Replacing the `Order` model with a `Subscription` model should address the goal of eliminating confusion around the `Order` model. The data stored in the CustomersDot `Order` model does not correspond to a Zuora Order. It more closely resembles a Zuora Subscription with some additional metadata about syncing with GitLab.com. The transition to a `Subscription` model, along with the local cache layer in phase one, should address the goal of better data accuracy and building trust in CustomersDot data.
+
+#### Proposed DB schema
+
+```mermaid
+erDiagram
+ Subscription ||--|{ "Zuora::Subscription" : "has many"
+
+ Subscription {
+ bigint id PK
+ bigint billing_account_id
+ string(64) zuora_account_id
+ string(64) zuora_subscription_id
+ string zuora_subscription_name
+ string gitlab_namespace_id
+ string gitlab_namespace_name
+ datetime last_extra_ci_minutes_sync_at
+ datetime increased_billing_rate_notified_at
+ boolean reconciliation_accepted "null:false default:false"
+ datetime seat_overage_notified_at
+ datetime auto_renew_error_notified_at
+ date monthly_seat_digest_notified_on
+ datetime created_at
+ datetime updated_at
+ }
+
+ "Zuora::Subscription" {
+ string(64) zuora_id PK "`id` field on Zuora Subscription"
+ string(64) account_id
+ string name
+ }
+```
+
+#### Notes
+
+- The name for this model is up for debate given a `Subscription` model already exists. The existing model could be renamed with the hope of eventually replacing it with the new model.
+- This model serves as a record of the Subscription that is modifiable by the CDot application, whereas the `Zuora::Subscription` table below should be read-only.
+- `zuora_account_id` could be added as a convenience but could also be fetched via the `billing_account`.
+- There will be one `Subscription` record per actual subscription instead of a Subscription version.
+ - This has the advantage of avoiding duplication of fields like `gitlab_namespace_id` or `last_extra_ci_minutes_sync_at`.
+ - The `zuora_subscription_id` column could be removed or kept as a reference to the latest Zuora Subscription version.
+
+#### Keeping data in sync with Zuora
+
+The `Subscription` model should stay in sync with Zuora as subscriptions are created or updated. This model will be synced when we sync `Zuora::Subscription` records, similar to how the cached models are synced when processing Zuora callouts as described in phase one. When saving a new version of a `Zuora::Subscription`, an update could be made to the `Subscription` record with the matching `zuora_subscription_name`, or create a `Subscription` if one does not exist. The `zuora_subscription_id` would be set to the latest version on typical updates. Most of the data on `Subscription` is GitLab metadata (e.g. `last_extra_ci_minutes_sync_at`) so it wouldn't need to be updated.
+
+The exception to this update rule are the `zuora_account_id` and `billing_account_id` attributes. Let's consider the current behavior when processing an `Order Processed` callout in CDot if the `zuora_account_id` changes for a Zuora Subscription:
+
+1. The Billing Account Membership is updated to the new Billing Account for the CDot `Customer` matching the Sold To email address.
+1. CDot attempts to find the CDot `Order` with the new `billing_account_id` and `subscription_name`.
+1. If an `Order` isn't found matching this criteria, a new `Order` is created. This leads to two `Order` records for the same Zuora Subscription.
+
+This scenario should be avoided for the new `Subscription` model. One `Subscription` should exist for a unique `Zuora::Subscription` name. If the Zuora Subscription transfers Accounts, the `Subscription` should as well.
+
+#### Unknowns
+
+Several unknowns are outlined below. As we get further into implementation, these unknown should become clearer.
+
+##### Trial data in Subscription?
+
+The CDot `Order` model contains paid subscription data as well as trials. For `Subscription`, we could choose to continue to have paid subscription and trial data together in the same table, or break them into their own models.
+
+The `orders` table has fields for `customer_id` and `trial` which only really concern trials. Should these fields be added to the `Subscription` table? Should `Subscription` contain trial information if it doesn't exist in Zuora?
+
+If trial orders were broken out into their own table, these are the columns likely needed for a (SaaS) `trials` table:
+
+- `customer_id`
+- `product_rate_plan_id` (or rename to `plan_id` or use `plan_code`)
+- `quantity`
+- `start_date`
+- `end_date`
+- `gl_namespace_id`
+- `gl_namespace_name`
+
+### Resources
+
+- [FY24Q3 OKR - Create plan to align CustomersDot Orders to Zuora Orders](https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/3378)
+- [Epic &9748 - Align CustomersDot Orders to Zuora objects](https://gitlab.com/groups/gitlab-org/-/epics/9748)
diff --git a/doc/architecture/blueprints/cells/impacted_features/personal-access-tokens.md b/doc/architecture/blueprints/cells/impacted_features/personal-access-tokens.md
index 3aca9f1e116..a493a1c4395 100644
--- a/doc/architecture/blueprints/cells/impacted_features/personal-access-tokens.md
+++ b/doc/architecture/blueprints/cells/impacted_features/personal-access-tokens.md
@@ -17,13 +17,37 @@ we can document the reasons for not choosing this approach.
## 1. Definition
-Personal Access Tokens associated with a User are a way for Users to interact with the API of GitLab to perform operations.
-Personal Access Tokens today are scoped to the User, and can access all Groups that a User has access to.
+Personal Access Tokens (PATs) associated with a User are a way for Users to interact with the API of GitLab to perform operations.
+PATs today are scoped to the User, and can access all Groups that a User has access to.
## 2. Data flow
## 3. Proposal
+### 3.1. Organization-scoped PATs
+
+Pros:
+
+- Can be managed entirely from Rails application.
+- Increased security. PAT is limited only to Organization.
+
+Cons:
+
+- Different PAT needed for different Organizations.
+- Cannot tell at a glance if PAT will apply to a certain Project/Namespace.
+
+### 3.2. Cluster-wide PATs
+
+Pros:
+
+- User does not have to worry about which scope the PAT applies to.
+
+Cons:
+
+- User has to worry about wide-ranging scope of PAT (e.g. separation of personal items versus work items).
+- Organization cannot limit scope of PAT to only their Organization.
+- Increases complexity. All cluster-wide data likely will be moved to a separate [data access layer](../../cells/index.md#1-data-access-layer).
+
## 4. Evaluation
## 4.1. Pros
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
index 1366d308487..c9a03830a4a 100644
--- a/doc/architecture/blueprints/cells/index.md
+++ b/doc/architecture/blueprints/cells/index.md
@@ -338,6 +338,7 @@ Below is a list of known affected features with preliminary proposed solutions.
- [Cells: Global Search](impacted_features/global-search.md)
- [Cells: GraphQL](impacted_features/graphql.md)
- [Cells: Organizations](impacted_features/organizations.md)
+- [Cells: Personal Access Tokens](impacted_features/personal-access-tokens.md)
- [Cells: Personal Namespaces](impacted_features/personal-namespaces.md)
- [Cells: Secrets](impacted_features/secrets.md)
- [Cells: Snippets](impacted_features/snippets.md)
@@ -354,7 +355,6 @@ The following list of impacted features only represents placeholders that still
- [Cells: Group Transfer](impacted_features/group-transfer.md)
- [Cells: Issues](impacted_features/issues.md)
- [Cells: Merge Requests](impacted_features/merge-requests.md)
-- [Cells: Personal Access Tokens](impacted_features/personal-access-tokens.md)
- [Cells: Project Transfer](impacted_features/project-transfer.md)
- [Cells: Router Endpoints Classification](impacted_features/router-endpoints-classification.md)
- [Cells: Schema changes (Postgres and Elasticsearch migrations)](impacted_features/schema-changes.md)
diff --git a/doc/architecture/blueprints/ci_pipeline_components/img/catalogs.png b/doc/architecture/blueprints/ci_pipeline_components/img/catalogs.png
deleted file mode 100644
index 8c83aede186..00000000000
--- a/doc/architecture/blueprints/ci_pipeline_components/img/catalogs.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 46b8f361949..9fdbf8cb70b 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -105,6 +105,7 @@ identifying abstract concepts and are subject to changes as we refine the design
allows components to be pinned to a specific revision.
- **Step** is a type of component that contains a collection of instructions for job execution.
- **Template** is a type of component that contains a snippet of CI/CD configuration that can be [included](../../../ci/yaml/includes.md) in a project's pipeline configuration.
+- **Publishing** is the act of listing a version of the resource (for example, a project release) on the Catalog.
## Definition of pipeline component
@@ -524,17 +525,26 @@ spec:
The CI Catalog is an index of resources that users can leverage in CI/CD. It initially
contains a list of components repositories that users can discover and use in their pipelines.
+The user sees only resources based on their permissions and project visibility level.
+Unauthenticated users will only see public resources.
+
+Project admins are responsible for setting the project private or public.
+The CI Catalog should not provide security functionalities like prevent projects from appearing in the Community Catalog.
+If the project is public it's visible to the world anyway.
+
+The Catalog page can provide different filters to refine the user search including
+predefined filters such as resources from groups the user is member of.
In the future, the Catalog could contain also other types of resources (for example:
-integrations, project templates, etc.).
+integrations, project templates, container images, etc.).
To list a components repository in the Catalog we need to mark the project as being a
-catalog resource. We do that initially with an API endpoint, similar to changing a project setting.
+catalog resource. We do that initially with a project setting.
-Once a project is marked as a "catalog resource" it can be displayed in the Catalog.
+Once a project is marked as a "catalog resource" it can eventually be displayed in the Catalog.
-We could create a database record when the API endpoint is used and remove the record when
-the same is disabled/removed.
+We could create a database record when the setting is enabled and modify the record's state when
+the same is disabled.
## Catalog resource
@@ -552,9 +562,6 @@ Other properties of a catalog resource:
- indicators of popularity (stars, forks).
- categorization: user should select a category and or define search tags
-As soon as a components repository is marked as being a "catalog resource"
-we should be seeing the resource listed in the Catalog.
-
Initially for the resource, the project may not have any released tags.
Users would be able to use the components repository by specifying a branch name or
commit SHA for the version. However, these types of version qualifiers should not
@@ -564,10 +571,14 @@ be listed in the catalog resource's page for various reasons:
- Branches and tags may not be meaningful for the end-user.
- Branches and tags don't communicate versioning thoroughly.
+To list a catalog resource in the Catalog we first need to create a release for
+the project.
+
## Releasing new resource versions to the Catalog
-The versions that should be displayed for the resource should be the project [releases](../../../user/project/releases/index.md).
-Creating project releases is an official act of versioning a resource.
+The versions that will be published for the resource should be the project
+[releases](../../../user/project/releases/index.md). Creating project releases is an official
+act of versioning a resource.
A resource page would have:
@@ -599,29 +610,6 @@ For example: index the content of `spec:` section for CI components.
See an [example of development workflow](dev_workflow.md) for a components repository.
-## Availability of CI catalog as a feature
-
-We plan to introduce 2 features of CI catalog as separate views:
-
-1. **Namespace Catalog (GitLab Ultimate):** allows organizations to share and discover catalog resources
- created inside the top-level namespace.
- Users will be able to access the Namespace Catalog from a project or subgroup inside the top-level
- namespace.
-1. **Community Catalog (GitLab free):** allows anyone in a GitLab instance to share and discover catalog
- resources. The Community Catalog presents only resources/projects that are public.
-
-If a resource in a Namespace Catalog is made public (changing the project's visibility) the resource is
-available in both Namespace Catalog (because it comes from there) as well as the Community Catalog
-(because it's public).
-
-![Namespace and Community Catalogs](img/catalogs.png)
-
-There is only 1 CI catalog. The Namespace and Community Catalogs are different views of the CI catalog.
-
-Project admins are responsible for setting the project private or public.
-The CI Catalog should not provide security functionalities like prevent projects from appearing in the Community Catalog.
-If the project is public it's visible to the world anyway.
-
## Note about future resource types
In the future, to support multiple types of resources in the Catalog we could
@@ -673,6 +661,8 @@ metadata:
## Iterations
+The first plan of iterations constisted in:
+
1. Experimentation phase
- Build an MVC behind a feature flag with `namespace` actor.
- Enable the feature flag only for `gitlab-com` and `gitlab-org` namespaces to initiate the dogfooding.
@@ -691,6 +681,9 @@ metadata:
components from GitLab.com or from repository exports.
- Iterate on feedback.
+In October 2023, after releasing the namespace-view (previously called private catalog view) as Experiment we changed
+focus moving away from 2 separate views (namespace view and global view) and combining the UX in a single global view.
+
## Limits
Any MVC that exposes a feature should be added with limitations from the beginning.
diff --git a/doc/architecture/blueprints/cloud_connector/decisions/001_lb_entry_point.md b/doc/architecture/blueprints/cloud_connector/decisions/001_lb_entry_point.md
new file mode 100644
index 00000000000..d49b702be94
--- /dev/null
+++ b/doc/architecture/blueprints/cloud_connector/decisions/001_lb_entry_point.md
@@ -0,0 +1,52 @@
+---
+owning-stage: "~devops::data stores"
+description: 'Cloud Connector ADR 001: Use load balancer as single entry point'
+---
+
+# Cloud Connector ADR 001: Load balancer as single entry point
+
+## Context
+
+The original iteration of the blueprint suggested to stand up a dedicated Cloud Connector edge service,
+through which all traffic that uses features under the Cloud Connector umbrella would pass.
+
+The primary reasons for why we wanted this to be a dedicated service were to:
+
+1. **Provide a single entry point for customers.** We identified the ability for any GitLab instance
+ around the world to consume Cloud Connector features through a single endpoint such as
+ `cloud.gitlab.com` as a must-have property.
+1. **Have the ability to execute custom logic.** There was a desire from product to create a space where we can
+ run cross-cutting business logic such as application-level rate limiting, which is hard or impossible to
+ do using a traditional load balancer such as HAProxy.
+
+## Decision
+
+We decided to take a smaller incremental step toward having a "smart router" by focusing on
+the ability to provide a single endpoint through which Cloud Connector traffic enters our
+infrastructure. This can be accomplished using simpler means than deploying dedicated services, specifically
+by pulling in a load balancing layer listening at `cloud.gitlab.com` that can also perform simple routing
+tasks to forward traffic into feature backends.
+
+Our reasons for this decision were:
+
+1. **Unclear requirements for custom logic to run.** We are still exploring how and to what extent we would
+ apply rate limiting logic at the Cloud Connector level. This is being explored in
+ [issue 429592](https://gitlab.com/gitlab-org/gitlab/-/issues/429592). Because we need to have a single
+ entry point by January, and because we think we will not be ready by then to implement such logic at the
+ Cloud Connector level, a web service is not required yet.
+1. **New use cases found that are not suitable to run through a dedicated service.** We started to work with
+ the Observability group to see how we can bring the GitLab Observability Backend (GOB) to Cloud Connector
+ customers in [MR 131577](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/131577).
+ In this discussion it became clear that due to the large amounts of traffic and data volume passing
+ through GOB each day, putting another service in front of this stack does not provide a sensible
+ risk/benefit trade-off. Instead, we will probably split traffic and make Cloud Connector components
+ available through other means for special cases like these (for example, through a Cloud Connector library).
+
+We are exploring several options for load-balancing this new endpoint in [issue 429818](https://gitlab.com/gitlab-org/gitlab/-/issues/429818)
+and are working with the `Infrastructure:Foundations` team to deploy this in [issue 24711](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24711).
+
+## Consequences
+
+We have not yet discarded the plan to build a smart router eventually, either as a service or
+through other means, but have delayed this decision in face of uncertainty at both a product
+and technical level. We will reassess how to proceed in Q1 2024.
diff --git a/doc/architecture/blueprints/cloud_connector/index.md b/doc/architecture/blueprints/cloud_connector/index.md
index 840e17a438a..9aef8bc7a98 100644
--- a/doc/architecture/blueprints/cloud_connector/index.md
+++ b/doc/architecture/blueprints/cloud_connector/index.md
@@ -68,7 +68,7 @@ Introducing a dedicated edge service for Cloud Connector serves the following go
we do not currently support.
- **Independently scalable.** For reasons of fault tolerance and scalability, it is beneficial to have all SM traffic go
through a separate service. For example, if an excess of unexpected requests arrive from SM instances due to a bug
- in a milestone release, this traffic could be absorbed at the CC gateway level without cascading downstream, thus leaving
+ in a milestone release, this traffic could be absorbed at the CC gateway level without cascading further, thus leaving
SaaS users unaffected.
### Non-goals
@@ -82,6 +82,10 @@ Introducing a dedicated edge service for Cloud Connector serves the following go
other systems using public key cryptographic checks. We may move some of the code around that currently implements this,
however.
+## Decisions
+
+- [ADR-001: Use load balancer as single entry point](decisions/001_lb_entry_point.md)
+
## Proposal
We propose to make two major changes to the current architecture:
@@ -133,7 +137,7 @@ The new service would be made available at `cloud.gitlab.com` and act as a "smar
It will have the following responsibilities:
1. **Request handling.** The service will make decisions about whether a particular request is handled
- in the service itself or forwarded to a downstream service. For example, a request to `/ai/code_suggestions/completions`
+ in the service itself or forwarded to other backends. For example, a request to `/ai/code_suggestions/completions`
could be handled by forwarding this request to an appropriate endpoint in the AI gateway unchanged, while a request
to `/-/metrics` could be handled by the service itself. As mentioned in [non-goals](#non-goals), the latter would not
include domain logic as it pertains to an end user feature, but rather cross-cutting logic such as telemetry, or
@@ -141,14 +145,14 @@ It will have the following responsibilities:
When handling requests, the service should be unopinionated about which protocol is used, to the extent possible.
Reasons for injecting custom logic could be setting additional HTTP header fields. A design principle should be
- to not require CC service deployments if a downstream service merely changes request payload or endpoint definitions. However,
+ to not require CC service deployments if a backend service merely changes request payload or endpoint definitions. However,
supporting more protocols on top of HTTP may require adding support in the CC service itself.
1. **Authentication/authorization.** The service will be the first point of contact for authenticating clients and verifying
they are authorized to use a particular CC feature. This will include fetching and caching public keys served from GitLab SaaS
and CustomersDot to decode JWT access tokens sent by GitLab instances, including matching token scopes to feature endpoints
to ensure an instance is eligible to consume this feature. This functionality will largely be lifted out of the AI gateway
where it currently lives. To maintain a ZeroTrust environment, the service will implement a more lightweight auth/z protocol
- with internal services downstream that merely performs general authenticity checks but forgoes billing and permission
+ with internal backends that merely performs general authenticity checks but forgoes billing and permission
related scoping checks. How this protocol will look like is to be decided, and might be further explored in
[Discussion: Standardized Authentication and Authorization between internal services and GitLab Rails](https://gitlab.com/gitlab-org/gitlab/-/issues/421983).
1. **Organization-level rate limits.** It is to be decided if this is needed, but there could be value in having application-level rate limits
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index 243270afdb2..c9f7f1c0d27 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -30,7 +30,7 @@ graph LR
R -- Write/read metadata --> B
```
-Client applications (for example, GitLab Rails and Docker CLI) interact with the Container Registry through its [HTTP API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md). The most common operations are pushing and pulling images to/from the registry, which require a series of HTTP requests in a specific order. The request flow for these operations is detailed in the [Request flow](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/push-pull-request-flow.md).
+Client applications (for example, GitLab Rails and Docker CLI) interact with the Container Registry through its [HTTP API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/gitlab/api.md). The most common operations are pushing and pulling images to/from the registry, which require a series of HTTP requests in a specific order. The request flow for these operations is detailed in the [Request flow](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/push-pull-request-flow.md).
The registry supports multiple [storage backends](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/configuration.md#storage), including Google Cloud Storage (GCS) which is used for the GitLab.com registry. In the storage backend, images are stored as blobs, deduplicated, and shared across repositories. These are then linked (like a symlink) to each repository that relies on them, giving them access to the central storage location.
@@ -69,7 +69,7 @@ Please refer to the [Docker documentation](https://docs.docker.com/registry/spec
##### Push and Pull
-Push and pull commands are used to upload and download images, more precisely manifests and blobs. The push/pull flow is described in the [documentation](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/push-pull-request-flow.md).
+Push and pull commands are used to upload and download images, more precisely manifests and blobs. The push/pull flow is described in the [documentation](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/push-pull-request-flow.md).
#### GitLab Rails
@@ -86,7 +86,7 @@ The single entrypoint for the registry is the [HTTP API](https://gitlab.com/gitl
| [Check if manifest exists](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#existing-manifests) | **{check-circle}** Yes | **{dotted-circle}** No | Used to get the digest of a manifest by tag. This is then used to pull the manifest and show the tag details in the UI. |
| [Pull manifest](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pulling-an-image-manifest) | **{check-circle}** Yes | **{dotted-circle}** No | Used to show the image size and the manifest digest in the tag details UI. |
| [Pull blob](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pulling-a-layer) | **{check-circle}** Yes | **{dotted-circle}** No | Used to show the configuration digest and the creation date in the tag details UI. |
-| [Delete tag](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#deleting-a-tag) | **{check-circle}** Yes | **{check-circle}** Yes | Used to delete a tag from the UI and in background (cleanup policies). |
+| [Delete tag](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#delete-tag) | **{check-circle}** Yes | **{check-circle}** Yes | Used to delete a tag from the UI and in background (cleanup policies). |
A valid authentication token is generated in GitLab Rails and embedded in all these requests before sending them to the registry.
@@ -154,7 +154,7 @@ Following the GitLab [Go standards and style guidelines](../../../development/go
The design and development of the registry database adhere to the GitLab [database guidelines](../../../development/database/index.md). Being a Go application, the required tooling to support the database will have to be developed, such as for running database migrations.
-Running *online* and [*post deployment*](../../../development/database/post_deployment_migrations.md) migrations is already supported by the registry CLI, as described in the [documentation](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/database-migrations.md).
+Running *online* and [*post deployment*](../../../development/database/post_deployment_migrations.md) migrations is already supported by the registry CLI, as described in the [documentation](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/database-migrations.md).
#### Partitioning
@@ -224,7 +224,7 @@ This is a list of all the registry HTTP API operations and how they depend on th
| [Check API version](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#api-version-check) | `GET` | `/v2/` | **{dotted-circle}** No | **{dotted-circle}** No | **{check-circle}** Yes |
| [List repositories](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#listing-repositories) | `GET` | `/v2/_catalog` | **{check-circle}** Yes | **{dotted-circle}** No | **{dotted-circle}** No |
| [List repository tags](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#listing-image-tags) | `GET` | `/v2/<name>/tags/list` | **{check-circle}** Yes | **{dotted-circle}** No | **{check-circle}** Yes |
-| [Delete tag](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#deleting-a-tag) | `DELETE` | `/v2/<name>/tags/reference/<reference>` | **{check-circle}** Yes | **{dotted-circle}** No | **{check-circle}** Yes |
+| [Delete tag](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#delete-tag) | `DELETE` | `/v2/<name>/manifests/<reference>` | **{check-circle}** Yes | **{dotted-circle}** No | **{check-circle}** Yes |
| [Check if manifest exists](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#existing-manifests) | `HEAD` | `/v2/<name>/manifests/<reference>` | **{check-circle}** Yes | **{dotted-circle}** No | **{check-circle}** Yes |
| [Pull manifest](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pulling-an-image-manifest) | `GET` | `/v2/<name>/manifests/<reference>` | **{check-circle}** Yes | **{dotted-circle}** No | **{check-circle}** Yes |
| [Push manifest](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pushing-an-image-manifest) | `PUT` | `/v2/<name>/manifests/<reference>` | **{check-circle}** Yes | **{dotted-circle}** No | **{dotted-circle}** No |
diff --git a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
index 84a95e3e7c3..d91f2fdddbf 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md
@@ -160,7 +160,7 @@ import which would lead to greater consistency across all storage driver impleme
### The Import Tool
-The [import tool](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/database-import-tool.md)
+The [import tool](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/database-import-tool.md)
is a well-validated component of the Container Registry project that we have used
from the beginning as a way to perform local testing. This tool is a thin wrapper
over the core import functionality — the code which handles the import logic has
diff --git a/doc/architecture/blueprints/email_ingestion/index.md b/doc/architecture/blueprints/email_ingestion/index.md
index 9579a903133..59086aed86a 100644
--- a/doc/architecture/blueprints/email_ingestion/index.md
+++ b/doc/architecture/blueprints/email_ingestion/index.md
@@ -36,7 +36,7 @@ The current implementation lacks scalability and requires significant infrastruc
Because we are using a fork of the `mail_room` gem ([`gitlab-mail_room`](https://gitlab.com/gitlab-org/ruby/gems/gitlab-mail_room)), which contains some GitLab specific features that won't be ported upstream, we have a noteable maintenance overhead.
-The [Service Desk Single-Engineer-Group (SEG)](https://about.gitlab.com/handbook/engineering/incubation/service-desk/) started work on [customizable email addresses for Service Desk](https://gitlab.com/gitlab-org/gitlab/-/issues/329990) and [released the first iteration in beta in `16.4`](https://about.gitlab.com/releases/2023/09/22/gitlab-16-4-released/#custom-email-address-for-service-desk). As a [MVC we introduced a `Forwarding & SMTP` mode](https://gitlab.com/gitlab-org/gitlab/-/issues/329990#note_1201344150) where administrators set up email forwarding from their custom email address to the projects' `incoming_mail` email address. They also provide SMTP credentials so GitLab can send emails from the custom email address on their behalf. We don't need any additional email ingestion other than the existing mechanics for this approach to work.
+The [Service Desk Single-Engineer-Group (SEG)](https://about.gitlab.com/handbook/engineering/development/incubation/service-desk/) started work on [customizable email addresses for Service Desk](https://gitlab.com/gitlab-org/gitlab/-/issues/329990) and [released the first iteration in beta in `16.4`](https://about.gitlab.com/releases/2023/09/22/gitlab-16-4-released/#custom-email-address-for-service-desk). As a [MVC we introduced a `Forwarding & SMTP` mode](https://gitlab.com/gitlab-org/gitlab/-/issues/329990#note_1201344150) where administrators set up email forwarding from their custom email address to the projects' `incoming_mail` email address. They also provide SMTP credentials so GitLab can send emails from the custom email address on their behalf. We don't need any additional email ingestion other than the existing mechanics for this approach to work.
As a second iteration we'd like to add Microsoft Graph support for custom email addresses for Service Desk as well. Therefore we need a way to ingest more than the system defined two addresses. We will explore a solution path for Microsoft Graph support where privileged users can connect a custom email account and we can [receive messages via a Microsoft Graph webhook (`Outlook message`)](https://learn.microsoft.com/en-us/graph/webhooks#supported-resources). GitLab would need a public endpoint to receive updates on emails. That might not work for Self-managed instances, so we'll need direct email ingestion for Microsoft customers as well. But using the webhook approach could improve performance and efficiency for GitLab SaaS where we potentially have thousands of mailboxes to poll.
diff --git a/doc/architecture/blueprints/feature_flags_usage_in_dev_and_ops/index.md b/doc/architecture/blueprints/feature_flags_usage_in_dev_and_ops/index.md
new file mode 100644
index 00000000000..ad6dd755607
--- /dev/null
+++ b/doc/architecture/blueprints/feature_flags_usage_in_dev_and_ops/index.md
@@ -0,0 +1,285 @@
+---
+status: proposed
+creation-date: "2023-11-01"
+authors: [ "@rymai" ]
+coach: "@DylanGriffith"
+approvers: []
+owning-stage: "~devops::non_devops"
+participating-stages: []
+---
+
+# Feature Flags usage in GitLab development and operations
+
+This blueprint builds upon [the Development Feature Flags Architecture blueprint](../feature_flags_development/index.md).
+
+## Summary
+
+Feature flags are critical both in developing and operating GitLab, but in the current state
+of the process, they can lead to production issues, and introduce a lot of manual and maintenance work.
+
+The goals of this blueprint is to make the process safer, more maintainable, lightweight, automated and transparent.
+
+## Motivations
+
+### Feature flag use-cases
+
+Feature flags can be used for different purposes:
+
+- De-risking GitLab.com deployments (most feature flags): Allows to quickly enable/disable
+ a feature flag in production in the event of a production incident.
+- Work-in-progress feature: Some features are complex and need to be implemented through several MRs. Until they're fully implemented, it needs
+ to be hidden from anyone. In that case, the feature flag allows to merge all the changes to the main branch without actually using
+ the feature yet.
+- Beta features: We might
+ [not be confident we'll be able to scale, support, and maintain a feature](https://about.gitlab.com/handbook/product/gitlab-the-product/#experiment-beta-ga)
+ in its current form for every designed use case ([example](https://gitlab.com/gitlab-org/gitlab/-/issues/336070#note_1523983444)).
+ There are also scenarios where a feature is not complete enough to be considered an MVC.
+ Providing a flag in this case allows engineers and customers to disable the new feature until it's performant enough.
+- Operations: Site reliability engineer or Support engineer can use these flags to
+ disable potentially resource-heavy features in order to the instance back to a
+ more stable and available state. Another example is SaaS-only features.
+- Experiment: A/B testing on GitLab.com.
+- Worker (special `ops` feature flag): Used for controlling Sidekiq workers behavior, such as deferring Sidekiq jobs.
+
+We need to better categorize our feature flags.
+
+### Production incidents related to feature flags
+
+Feature flags have caused production incidents on GitLab.com ([1](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5289), [2](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4155), [3](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16366)).
+
+We need to prevent this for the sake of GitLab.com stability.
+
+### Technical debt caused by feature flags
+
+Feature flags are also becoming an ever-growing source of technical debt: there are currently
+[591 feature flags in the GitLab codebase](../../../user/feature_flags.md).
+
+We need to reduce feature flags count for the sake of long-term maintainability & quality of the GitLab codebase.
+
+## Goal
+
+The goal of this blueprint is to improve the feature flag process by making it:
+
+- safer
+- more maintainable
+- more lightweight & automated
+- more transparent
+
+## Challenges
+
+### Complex feature flag rollout process
+
+The feature flag rollout process is currently:
+
+- Complex: Rollout issues that are very manual and includes a lot of checkboxes
+ (including non-relevant checkboxes).
+ Engineers often don't use these issues, which tend to become stale and forgotten over time.
+- Not very transparent: Feature flag changes are logged in several places far from the rollout
+ issue, which makes it hard to understand the latest feature flag state.
+- Far from production processes: Rollout issues are created in the `gitlab-org/gitlab` project
+ (far from the production issue tracker).
+- There is no consistent path to rolling out feature flags: we leave to the judgement of the
+ engineer to trade-off between speed and safety. There should be a standardized set of rollout
+ steps.
+
+### Technical debt and codebase complexity
+
+[The challenges from the Development Feature Flags Architecture blueprint still stand](../feature_flags_development/index.md#challenges).
+
+Additionally, there are new challenges:
+
+- If a feature flag is enabled by default, and is disabled in an on-premise installation,
+ then when the feature flag is removed, the feature suddenly becomes enabled on the
+ on-premise instance and cannot be rolled backed to the previous behavior.
+
+### Multiple source of truth for feature flag default states and observability
+
+We currently show the feature flag default states in several places, for different intended audiences:
+
+**GitLab customers**
+
+- [User documentation](../../../user/feature_flags.md):
+ List all feature flags and their metadata so that GitLab customers can tweak feature flags on
+ their instance. Also useful for GitLab.com users that want to check the default state of a feature flag.
+
+**Site reliability and Delivery engineers**
+
+- [Internal GitLab.com feature flag state change issues](https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues):
+ For each change of a feature flag state on GitLab.com, an issue is created in this project.
+- [Internal GitLab.com feature flag state change logs](https://nonprod-log.gitlab.net):
+ Filter logs with `source: feature` and `env: gprd` to see feature flag state change logs.
+
+**GitLab Engineering & Infra/Quality Directors / VPs, and CTO**
+
+- [Internal Sisense dashboard](https://app.periscopedata.com/app/gitlab/792066/Engineering-::-Feature-Flags):
+ Feature flag metrics over time, grouped per DevOps groups.
+
+**GitLab Engineering and Product managers**
+
+- ["Feature flags requiring attention" monthly reports](https://gitlab.com/gitlab-org/quality/triage-reports/-/issues/?sort=created_date&state=opened&search=Feature%20flags&in=TITLE&assignee_id=None&first_page_size=100):
+ Same data as the above Internal Sisense dashboard but for a specific DevOps
+ group, presented in an issue and assigned to the group's Engineering managers.
+
+**Anyone who wants to check feature flag default states**
+
+- [Unofficial feature flags dashboard](https://samdbeckham.gitlab.io/feature-flags/):
+ A user-friendly dashboard which provides useful filtering.
+
+This leads to confusion for almost all feature flag stakeholders (Development engineers, Engineering managers, Site reliability, Delivery engineers).
+
+## Proposal
+
+### Improve feature flags implementation and usage
+
+- [Reduce the likelihood of mis-configuration and human-error at the implementation step](https://gitlab.com/groups/gitlab-org/-/epics/11553)
+ - Remove the "percentage of time" strategy in favor of "percentage of actors"
+- [Improve the feature flag development documentation](https://gitlab.com/groups/gitlab-org/-/epics/5324)
+
+### Introduce new feature flag `type`s
+
+It's clear that the `development` feature flag type actually includes several use-cases:
+
+- GitLab.com deployment de-risking. YAML value: `gitlab_com_derisk`.
+- Work-in-progress feature. YAML value: `wip`. Once the feature is complete, the feature flag type can be changed to `beta`
+ if there still are some doubts on the scalability of the feature.
+- Beta features. YAML value: `beta`.
+
+Notes:
+
+- These new types replace the broad `development` type, which shouldn't be used anymore in the future.
+- Backward-compatibility will be kept until there's no `development` feature flags in the codebase anymore.
+
+### Introduce constraints per feature flag type
+
+Each feature flag type will be assigned specific constraints regarding:
+
+- Allowed values for the `default_enabled` attribute
+- Maximum Lifespan (MLS): the duration starting on the introduction of the feature flag (i.e. when it's merged into `master`).
+ We don't introduce a life span that would start on the global GitLab.com enablement (or `default_enabled: true` when
+ applicable) so that there's incentive to rollout and delete feature flags as quickly as possible.
+
+The MLS will be enforced through automation, reporting & regular review meetings at the section level.
+
+Following are the constraints for each feature flag type:
+
+- `gitlab_com_derisk`
+ - `default_enabled` **must not** be set to `true`. This kind of feature flag is meant to lower the risk on GitLab.com, thus
+ there's no need to keep the flag in the codebase after it's been enabled on GitLab.com.
+ **`default_enabled: true` will not have any effect for this type of feature flag.**
+ - Maximum Lifespan: 2 months.
+ - Additional note: This type of feature flag won't be documented in the [All feature flags in GitLab](../../../user/feature_flags.md)
+ page given they're short-lived and deployment-related.
+- `wip`
+ - `default_enabled` **must not** be set to `true`. If needed, this type can be changed to `beta` once the feature is complete.
+ - Maximum Lifespan: 4 months.
+- `beta`
+ - `default_enabled` can be set to `true` so that a feature can be "released" to everyone in Beta with the possibility to disable
+ it in the case of scalability issues (ideally it should only be disabled for this reason on specific on-premise installations).
+ - Maximum Lifespan: 6 months.
+- `ops`
+ - `default_enabled` can be set to `true`.
+ - Maximum Lifespan: Unlimited.
+ - Additional note: Remember that using this type should follow a conscious decision not to introduce an instance setting.
+- `experiment`
+ - `default_enabled` **must not** be set to `true`.
+ - Maximum Lifespan: 6 months.
+
+### Introduce a new `feature_issue_url` field
+
+Keeping the URL to the original feature issue will allow automated cross-linking from the rollout
+and logging issues. The new field for this information is `feature_issue_url`.
+
+For instance:
+
+```yaml
+---
+name: auto_devops_banner_disabled
+feature_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/12345
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/678910
+rollout_issue_url: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/9876
+milestone: '16.5'
+type: gitlab_com_derisk
+group: group::pipeline execution
+```
+
+```yaml
+---
+name: ai_mr_creation
+feature_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/12345
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/14218
+rollout_issue_url: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/83652
+milestone: '16.3'
+type: beta
+group: group::code review
+default_enabled: true
+```
+
+### Streamline the feature flag rollout process
+
+1. (Process) Transition to **create rollout issues in the
+ [Production issue tracker](https://gitlab.com/gitlab-com/gl-infra/production/-/issues)** and adapt the
+ template to be closer to the
+ [Change management issue template](https://gitlab.com/gitlab-com/gl-infra/production/-/blob/master/.gitlab/issue_templates/change_management.md)
+ (see [this issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2780) for inspiration)
+ That way, the rollout issue would only concern the actual production changes (i.e. enablement/disablement
+ of the flag on production) and should be closed as soon as the production change is confirmed to work as expected.
+1. (Automation) Automate most rollout steps, such as:
+ - (Done) [Let the author know that their feature has been deployed to staging / canary / production environments](https://gitlab.com/gitlab-org/quality/triage-ops/-/issues/1403)
+ - (Done) [Cross-link actual feature flag state change (from Chatops project) to rollout issues](https://gitlab.com/gitlab-org/gitlab/-/issues/290770)
+ - (Done) [Let the author know that their `default_enabled: true` MR has been deployed to production and that the feature flag can be removed from production](https://gitlab.com/gitlab-org/quality/triage-ops/-/merge_requests/2482)
+ - Automate the creation of rollout issues when a feature flag is first introduced in a merge request,
+ and provide an diff suggestion to fill the `rollout_issue_url` field (Danger)
+ - Check and enforce feature flag definition constraints in merge requests (Danger)
+ - Provide a diff suggestion to correct the `milestone` field when it's not the same value as
+ the MR milestone (Danger)
+ - Upon feature flag state change, notify on Slack the group responsible for it (chatops)
+ - 7 days before the Maximum Lifespan of a feature flag is reached, automatically create a "cleanup MR" with the group label set, and
+ assigned to the feature flag author (if they're still with GitLab). We could take advantage of the [automation of repetitive developer tasks](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/134487)
+ - Enforce Maximum Lifespan of feature flags through automated reporting & regular review at the section level
+1. (Documentation/process) Ensure the rollout DRI stays online for a few hours after enabling a feature flag (ideally they'd enable the flag at the
+ beginning of their day) in case of any issue with the feature flag
+1. (Process) Provide a standardized set of rollout steps. Trade-offs to consider include:
+ - Likelihood of errors occurring
+ - Total actors (users / requests / projects / groups) affected by the feature flag rollout,
+ e.g. it will be bad if 100,000 users cannot log in when we roll out for 1%
+ - How long to wait between each step. Some feature flags only need to wait 10 minutes per step, some
+ flags should wait 24 hours. Ideally there should be automation to actively verify there
+ is no adverse effect for each step.
+
+### Provide better SSOT for the feature flag default states and current states & state changes on GitLab.com
+
+**GitLab customers**
+
+- [User documentation](../../../user/feature_flags.md):
+ Keep the current page but add filtering and sorting, similarly to the
+ [unofficial feature flags dashboard](https://samdbeckham.gitlab.io/feature-flags/).
+
+**Site reliability and Delivery engineers**
+
+We [assessed the usefulness of feature flag state change logging strategies](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/309)
+and it appears that both
+[internal GitLab.com feature flag state change issues](https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues)
+and [internal GitLab.com feature flag state change logs](https://nonprod-log.gitlab.net) are useful for different
+audiences.
+
+**GitLab Engineering & Infra/Quality Directors / VPs, and CTO**
+
+- [Internal Sisense dashboard](https://app.periscopedata.com/app/gitlab/792066/Engineering-::-Feature-Flags):
+ Streamline the current dashboard to be more useful for its stakeholders.
+
+**GitLab Engineering and Product managers**
+
+- ["Feature flags requiring attention" monthly reports](https://gitlab.com/gitlab-org/quality/triage-reports/-/issues/?sort=created_date&state=opened&search=Feature%20flags&in=TITLE&assignee_id=None&first_page_size=100):
+ Make the current reports more actionable by linking to automatically created MRs for removing feature flags as well as improving documentation and best-practices around feature flags.
+
+## Iterations
+
+This work is being done as part of dedicated epic:
+[Improve internal usage of Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551).
+This epic describes a meta reasons for making these changes.
+
+## Resources
+
+- [What Are Feature Flags?](https://launchdarkly.com/blog/what-are-feature-flags/#:~:text=Feature%20flags%20are%20a%20software,portions%20of%20code%20are%20executed)
+- [Feature Flags Best Practices](https://featureflags.io/feature-flags-best-practices/)
+- [Short-lived or Long-lived Flags? Explaining Feature Flag lifespans](https://configcat.com/blog/2022/07/08/how-long-should-you-keep-feature-flags/)
diff --git a/doc/architecture/blueprints/gitlab_ml_experiments/index.md b/doc/architecture/blueprints/gitlab_ml_experiments/index.md
index e0675bb5be6..b9830778902 100644
--- a/doc/architecture/blueprints/gitlab_ml_experiments/index.md
+++ b/doc/architecture/blueprints/gitlab_ml_experiments/index.md
@@ -120,51 +120,46 @@ However, Service-Integration will establish certain necessary and optional requi
###### Ease of Use, Ownership Requirements
-1. <a name="R100">`R100`</a>: Required: the platform should be easy to use: imagine Heroku with [GitLab Production Readiness-approved](https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/) defaults.
-1. <a name="R110">`R110`</a>: Required: with the exception of an Infrastructure-led onboarding process, services are owned, deployed and managed by stage-group teams. In other words,services follow a "You Build It, You Run It" model of ownership.
-1. <a name="R120">`R120`</a>: Required: programming-language agnostic: no requirements for services. Services should be packaged as container images.
-1. <a name="R130">`R130`</a>: Recommended: Each service should be evaluated against the GitLab.com [Service Maturity Model](https://about.gitlab.com/handbook/engineering/infrastructure/service-maturity-model/).
-1. <a name="R140">`R140`</a>: Recommended: services using the platform have expedited production-readiness processes.
- 1. Production-readiness requirements graded by service maturity: low-traffic, low-maturity experimental services will have lower requirement thresholds than more mature services.
- 1. By default, the platform should provide services with defaults that would pass production-readiness review for the lowest service maturity-level.
- 1. At introduction, lowest maturity services can be deployed without production readiness, provided the meet certain automatically validated requirements. This removes Infrastructure gate-keeping from being a blocker to experimental service delivery.
+| ID | Required | Detail | Epic/Issue | Done? |
+|---|---|---|---|---|
+| `R100` | Required | The platform should be easy to use: imagine Heroku with [GitLab Production Readiness-approved](https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/) defaults. | [Runway to [BETA] : Increased Adoption and Self Service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1115) | **{dotted-circle}** No |
+| `R110` | Required | With the exception of an Infrastructure-led onboarding process, services are owned, deployed and managed by stage-group teams. In other words,services follow a “You Build It, You Run It” model of ownership.| [[Paused] Discussion: Tiered Support Model for Runway](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/97) | **{dotted-circle}** No |
+| `R120` | Required | Programming-language agnostic: no requirements for services. Services should be packaged as container images.| [Runway to [BETA] : Increased Adoption and Self Service](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1115) | **{dotted-circle}** No |
+| `R130` | Recommended | Each service should be evaluated against the GitLab.com [Service Maturity Model](https://about.gitlab.com/handbook/engineering/infrastructure/service-maturity-model/).| [Discussion: Introduce an 'Infrastructure Well-Architected Service Framework'](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2537) | **{dotted-circle}** No |
+| `R140` | Recommended | Services using the platform have expedited production-readiness processes. {::nomarkdown}<ol><li>Production-readiness requirements graded by service maturity: low-traffic, low-maturity experimental services will have lower requirement thresholds than more mature services. </li><li> By default, the platform should provide services with defaults that would pass production-readiness review for the lowest service maturity-level. </li><li> At introduction, lowest maturity services can be deployed without production readiness, provided the meet certain automatically validated requirements. This removes Infrastructure gate-keeping from being a blocker to experimental service delivery.</li></ol>{:/} | | |
###### Observability Requirements
-1. <a name="R200">`R200`</a>: Required: the platform must provide SLIs for services out-of-the-box.
- 1. While it is recommended that services expose internal metrics, it is not mandatory. The platform will provide monitoring from the load-balancer. This is to speed up deployment by removing barriers to experimentation.
- 1. For services that provide internal metrics scrape endpoints, the platform must be configurable to collect these.
- 1. The platform must provide generic load-balancer level SLIs for all services. Service owners must be able to select from constructing SLIs from internal application metrics, the platform-provided external SLIs, or a combination of both.
-1. <a name="R210">`R210`</a>: Required: Observability dashboards, rules, alerts (with per-term routing) must be generated from a manifest.
-1. <a name="R220">`R220`</a>:Required: standardized logging infrastructure.
- 1. Mandate that all logging emitted from services must be Structured JSON. Text logs are permitted but not recommended.
- 1. See [Common Service Libraries](#common-service-libraries) for more details of building common SDKs for observability.
+| ID | Required | Detail | Epic/Issue | Done? |
+|---|---|---|---|---|
+| `R200` | Required | The platform must provide SLIs for services out-of-the-box.{::nomarkdown}<ol><li>While it is recommended that services expose internal metrics, it is not mandatory. The platform will provide monitoring from the load-balancer. This is to speed up deployment by removing barriers to experimentation.</li><li>For services that provide internal metrics scrape endpoints, the platform must be configurable to collect these.</li><li>The platform must provide generic load-balancer level SLIs for all services. Service owners must be able to select from constructing SLIs from internal application metrics, the platform-provided external SLIs, or a combination of both.</li></ol>{:/} | [Observability: Default Metrics](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/72), [Observability: Custom Metrics](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/67) | **{check-circle}** Yes |
+| `R210` | Required | Observability dashboards, rules, alerts (with per-term routing) must be generated from a manifest. | [Observability: Metrics Catalog](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/74) | **{check-circle}** Yes |
+| `R220` | Required | Standardized logging infrastructure.{::nomarkdown}<ol><li>Mandate that all logging emitted from services must be Structured JSON. Text logs are permitted but not recommended.</li><li>See <a href="#common-service-libraries">Common Service Libraries</a> for more details of building common SDKs for observability.</li></ol>{:/} | [Observability: Logs in Elasticsearch for model-gateway](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/75), [Observability: Runway logs available to users](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/84) | |
###### Deployment Requirements
-1. <a name="R300">`R300`</a>: Required: No secrets stored in CI/CD.
- 1. Authentication with Cloud Provider Resources should be exclusively via OIDC, managed as part of the platform.
- 1. Secrets should be stored in the Infrastructure-provided Hashicorp Vault for the environment and passed to applications through files or environment variables.
- 1. Generation and management of service account tokens should be done declaratively, without manual interaction.
-1. <a name="R310">`R310`</a>: Required: multiple environment should be supported, eg Staging and Production.
-1. <a name="R320">`R320`</a>: Required: the platform should be cost-effective. Kubernetes clusters should support multiple services and teams.
-1. <a name="R330">`R330`</a>: Recommended: gradual rollouts, rollbacks, blue-green deployments.
-1. <a name="R340">`R340`</a>: Required: services should be isolated from one another.
-1. <a name="R350">`R350`</a>: Recommended: services should have the ability to specify node characteristic requirements (eg, GPU).
-1. <a name="R360">`R360`</a>: Required: Developers should not need knowledge of Helm, Kubernetes, Prometheus in order to deploy. All required values are configured and validated in project-hosted manifest before generating Kubernetes manifests, Prometheus rules, etc.
-1. <a name="R370">`R370`</a>: Initially services should be synchronous only - using REST or GRPC requests.
- 1. This does not however preclude long-running HTTP(s) requests, for example long-polling or Websocket requests.
-1. <a name="R390">`R390`</a>: Each service hosted in its own GitLab repository with deployment manifest stored in the repository.
- 1. Continuous deployments that are initiated from the CI pipeline of the corresponding GitLab repository.
+| ID | Required | Detail | Epic/Issue | Done? |
+|---|---|---|---|---|
+| `R300` | Required | No secrets stored in CI/CD. {::nomarkdown} <ol><li>Authentication with Cloud Provider Resources should be exclusively via OIDC, managed as part of the platform.</li><li> Secrets should be stored in the Infrastructure-provided Hashicorp Vault for the environment and passed to applications through files or environment variables. </li><li>Generation and management of service account tokens should be done declaratively, without manual interaction.</li></ul>{:/} | [Secrets Management](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/52) | **{dotted-circle}** No |
+| `R310` | Required | Multiple environment should be supported, eg Staging and Production. | | **{check-circle}** Yes |
+| `R320` | Required | The platform should be cost-effective. Kubernetes clusters should support multiple services and teams. | | |
+| `R330` | Recommended | Gradual rollouts, rollbacks, blue-green deployments. | | |
+| `R340` | Required | Services should be isolated from one another. | | |
+| `R350` | Recommended | Services should have the ability to specify node characteristic requirements (eg, GPU). | | |
+| `R360` | Required | Developers should not need knowledge of Helm, Kubernetes, Prometheus in order to deploy. All required values are configured and validated in project-hosted manifest before generating Kubernetes manifests, Prometheus rules, etc. | | |
+| `R370` | | Initially services should be synchronous only - using REST or GRPC requests.{::nomarkdown}<ol><li>This does not however preclude long-running HTTP(s) requests, for example long-polling or Websocket requests.</li></ol>{:/} | | |
+| `R390` | | Each service hosted in its own GitLab repository with deployment manifest stored in the repository. {::nomarkdown}<ol><li>Continuous deployments that are initiated from the CI pipeline of the corresponding GitLab repository.</li></ol>{:/} | | |
##### Security Requirements
-1. <a name="R400">`R400`</a>: stateful services deployed on the platform that utilize their own stateful storage (for example, custom deployed Postgres instance), must not store application security tokens, cloud-provider service keys or other long-lived security tokens in their stateful stores.
-1. <a name="R410">`R410`</a>: long-lived shared secrets are discouraged, and should be referenced in the service manifest as such, to allow for accounting and monitoring.
-1. <a name="R420">`R420`</a>: services using long-lived shared secrets should ensure that secret rotation can take place without downtime.
- 1. During a rotation, old and new generations of secrets should pass authentication, allowing gradual roll-out of new secrets.
+| ID | Required | Detail | Epic/Issue | Done? |
+|---|---|---|---|---|
+| `R400` | | Stateful services deployed on the platform that utilize their own stateful storage (for example, custom deployed Postgres instance), must not store application security tokens, cloud-provider service keys or other long-lived security tokens in their stateful stores. | | |
+| `R410` | | Long-lived shared secrets are discouraged, and should be referenced in the service manifest as such, to allow for accounting and monitoring. | | |
+| `R420` | | Services using long-lived shared secrets should ensure that secret rotation can take place without downtime. {::nomarkdown}<ol><li>During a rotation, old and new generations of secrets should pass authentication, allowing gradual roll-out of new secrets.</li></ol>{:/} | | |
##### Common Service Libraries
-1. <a name="R500">`R500`</a>: Experimental services would be strongly encouraged to adopt and use [LabKit](https://gitlab.com/gitlab-org/labkit) (for Go services), or [LabKit-Ruby](https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby) for observability, context, correlation, FIPs verification, etc.
- 1. At present, there is no LabKit-Python library, but some experiments will run in Python, so building a library to providing observability, context, correlation services in Python will be required.
+| ID | Required | Detail | Epic/Issue | Done? |
+|---|---|---|---|---|
+| `R500` | Required | Experimental services would be strongly encouraged to adopt and use [LabKit](https://gitlab.com/gitlab-org/labkit) (for Go services), or [LabKit-Ruby](https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby) for observability, context, correlation, FIPs verification, etc. {::nomarkdown}<ol><li>At present, there is no LabKit-Python library, but some experiments will run in Python, so building a library to providing observability, context, correlation services in Python will be required. </li></ol>{:/} | | |
diff --git a/doc/architecture/blueprints/gitlab_steps/gitlab-ci.md b/doc/architecture/blueprints/gitlab_steps/gitlab-ci.md
new file mode 100644
index 00000000000..8f97c307b37
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_steps/gitlab-ci.md
@@ -0,0 +1,247 @@
+---
+owning-stage: "~devops::verify"
+description: Usage of the [GitLab Steps](index.md) with [`.gitlab-ci.yml`](../../../ci/yaml/index.md).
+---
+
+# Usage of the [GitLab Steps](index.md) with [`.gitlab-ci.yml`](../../../ci/yaml/index.md)
+
+This document describes how [GitLab Steps](index.md) are integrated into the `.gitlab-ci.yml`.
+
+GitLab Steps will be integrated using a three-stage execution cycle
+and replace `before_script:`, `script:` and `after_script:`.
+
+- `setup:`: Execution stage responsible for provisioning the environment,
+ including cloning the repository, restoring artifacts, or installing all dependencies.
+ This stage will replace implicitly cloning, restoring artifacts, and cache download.
+- `run:`: Execution stage responsible for running a test, build,
+ or any other main command required by that job.
+- `teardown:`: Execution stage responsible for cleaning the environment,
+ uploading artifacts, or storing cache. This stage will replace implicit
+ artifacts and cache uploads.
+
+Before we can achieve three-stage execution we will ship minimal initial support
+that does not require any prior GitLab integration.
+
+## Phase 1: Initial support
+
+Initially the Step Runner will be used externally, without any prior dependencies
+to GitLab:
+
+- The `step-runner` will be provided as part of a container image.
+- The `step-runner` will be explicitly run in the `script:` section.
+- The `$STEPS` environment variable will be executed as [`type: steps`](step-definition.md#the-steps-step-type).
+
+```yaml
+hello-world:
+ image: registry.gitlab.com/gitlab-org/step-runner
+ variables:
+ STEPS: |
+ - step: gitlab.com/josephburnett/component-hello-steppy@master
+ inputs:
+ greeting: "hello world"
+ script:
+ - /step-runner ci
+```
+
+## Phase 2: The addition of `run:` to `.gitlab-ci.yml`
+
+In Phase 2 we will add `run:` as a first class way to use GitLab Steps:
+
+- `run:` will use a [`type: steps`](step-definition.md#the-steps-step-type) syntax.
+- `run:` will replace usage of `before_script`, `script` and `after_script`.
+- All existing functions to support Git cloning, artifacts, and cache would continue to be supported.
+- It is yet to be defined how we would support `after_script`, which is executed unconditionally
+ or when the job is canceled.
+- `run:` will not be allowed to be combined with `before_script:`, `script:` or `after_script:`.
+- GitLab Rails would not parse `run:`, instead it would only perform static validation
+ with a JSON schema provided by the Step Runner.
+
+```yaml
+hello-world:
+ image: registry.gitlab.com/gitlab-org/step-runner
+ run:
+ - step: gitlab.com/josephburnett/component-hello-steppy@master
+ inputs:
+ greeting: "hello world"
+```
+
+The following example would **fail** syntax validation:
+
+```yaml
+hello-world:
+ image: registry.gitlab.com/gitlab-org/step-runner
+ run:
+ - step: gitlab.com/josephburnett/component-hello-steppy@master
+ inputs:
+ greeting: "hello world"
+ script: echo "This is ambiguous and invalid example"
+```
+
+### Transitioning from `before_script:`, `script:` and `after_script:`
+
+GitLab Rails would automatically convert the `*script:` syntax into relevant `run:` specification:
+
+- Today `before_script:` and `script:` are joined together as a single script for execution.
+- The `after_script:` section is always executed in a separate context, representing a separate step to be executed.
+- It is yet to be defined how we would retain the existing behavior of `after_script`, which is always executed
+ regardless of the job status or timeout, and uses a separate timeout.
+- We would retain all implicit behavior which defines all environment variables when translating `script:`
+ into step-based execution.
+
+For example, this CI/CD configuration:
+
+```yaml
+hello-world:
+ before_script:
+ - echo "Run before_script"
+ script:
+ - echo "Run script"
+ after_script:
+ - echo "Run after_script"
+```
+
+Could be translated into this equivalent specification:
+
+```yaml
+hello-world:
+ run:
+ - step: gitlab.com/gitlab-org/components/steps/legacy/script@v1.0
+ inputs:
+ script:
+ - echo "Run before_script"
+ - echo "Run script"
+ - step: gitlab.com/gitlab-org/components/steps/legacy/script@v1.0
+ inputs:
+ script:
+ - echo "Run after_script"
+ when: always
+```
+
+## Phase 3: The addition of `setup:` and `teardown:` to `.gitlab-ci.yml`
+
+The addition of `setup:` and `teardown:` will replace the implicit functions
+provided by GitLab Runner: Git clone, artifacts and cache handling:
+
+- The usage of `setup:` would stop GitLab Runner from implicitly cloning the repository.
+- `artifacts:` and `cache:`, when specified, would be translated and appended to `setup:` and `teardown:`
+ to provide backward compatibility for the old syntax.
+- `release:`, when specified, would be translated and appended to `teardown:`
+ to provide backward compatibility for the old syntax.
+- `setup:` and `teardown:` could be used in `default:` to simplify support
+ of common workflows like where the repository is cloned, or how the artifacts are handled.
+- The split into 3-stage execution additionally improves composability of steps with `extends:`.
+- The `hooks:pre_get_sources_script` would be implemented similar to [`script:`](#transitioning-from-before_script-script-and-after_script)
+ and be prepended to `setup:`.
+
+For example, this CI/CD configuration:
+
+```yaml
+rspec:
+ script:
+ - echo "This job uses a cache."
+ artifacts:
+ paths: [binaries/, .config]
+ cache:
+ key: binaries-cache
+ paths: [binaries/*.apk, .config]
+```
+
+Could be translated into this equivalent specification executed by a step runner:
+
+```yaml
+rspec:
+ setup:
+ - step: gitlab.com/gitlab-org/components/git/clone@v1.0
+ - step: gitlab.com/gitlab-org/components/artifacts/download@v1.0
+ - step: gitlab.com/gitlab-org/components/cache/restore@v1.0
+ inputs:
+ key: binaries-cache
+ run:
+ - step: gitlab.com/gitlab-org/components/steps/legacy/script@v1.0
+ inputs:
+ script:
+ - echo "This job uses a cache."
+ teardown:
+ - step: gitlab.com/gitlab-org/components/artifacts/upload@v1.0
+ inputs:
+ paths: [binaries/, .config]
+ - step: gitlab.com/gitlab-org/components/cache/restore@v1.0
+ inputs:
+ key: binaries-cache
+ paths: [binaries/*.apk, .config]
+```
+
+### Inheriting common operations with `default:`
+
+`setup:` and `teardown:` are likely to become very verbose over time. One way to simplify them
+is to allow inheriting the common `setup:` and `teardown:` operations
+with `default:`.
+
+The previous example could be simplified to:
+
+```yaml
+default:
+ setup:
+ - step: gitlab.com/gitlab-org/components/git/clone@v1.0
+ - step: gitlab.com/gitlab-org/components/artifacts/download@v1.0
+ - step: gitlab.com/gitlab-org/components/cache/restore@v1.0
+ inputs:
+ key: binaries-cache
+ teardown:
+ - step: gitlab.com/gitlab-org/components/artifacts/upload@v1.0
+ inputs:
+ paths: [binaries/, .config]
+ - step: gitlab.com/gitlab-org/components/cache/restore@v1.0
+ inputs:
+ key: binaries-cache
+ paths: [binaries/*.apk, .config]
+
+rspec:
+ run:
+ - step: gitlab.com/gitlab-org/components/steps/legacy/script@v1.0
+ inputs:
+ script:
+ - echo "This job uses a cache."
+
+linter:
+ run:
+ - step: gitlab.com/gitlab-org/components/steps/legacy/script@v1.0
+ inputs:
+ script:
+ - echo "Run linting"
+```
+
+### Parallel jobs and `setup:`
+
+With the introduction of `setup:` at some point in the future we will introduce
+an efficient way to parallelize the jobs:
+
+- `setup:` would define all steps required to provision the environment.
+- The result of `setup:` would be snapshot and distributed as the base
+ for all parallel jobs, if `parallel: N` is used.
+- The `run:` and `teardown:` would be run on top of cloned job, and all its services.
+- The runner would control and intelligently distribute all parallel
+ jobs, significantly cutting the resource requirements for fixed
+ parts of the job (Git clone, artifacts, installing dependencies.)
+
+```yaml
+rspec-parallel:
+ image: ruby:3.2
+ services: [postgres, redis]
+ parallel: 10
+ setup:
+ - step: gitlab.com/gitlab-org/components/git/clone@v1.0
+ - step: gitlab.com/gitlab-org/components/artifacts/download@v1.0
+ inputs:
+ jobs: [setup-all]
+ - script: bundle install --without production
+ run:
+ - script: bundle exec knapsack
+```
+
+Potential GitLab Runner flow:
+
+1. Runner receives the `rspec-parallel` job with `setup:` and `parallel:` configured.
+1. Runner executes a job on top of Kubernetes cluster using block volumes up to the `setup`.
+1. Runner then runs 10 parallel jobs in Kubernetes, overlaying the block volume from 2
+ and continue execution of `run:` and `teardown:`.
diff --git a/doc/architecture/blueprints/gitlab_steps/index.md b/doc/architecture/blueprints/gitlab_steps/index.md
index 74c9ba1498d..5e3becfec19 100644
--- a/doc/architecture/blueprints/gitlab_steps/index.md
+++ b/doc/architecture/blueprints/gitlab_steps/index.md
@@ -33,12 +33,12 @@ shows a need for a better way to define CI job execution.
## Motivation
-Even though the current [`.gitlab-ci.yml`](../../../ci/yaml/gitlab_ci_yaml.md) is reasonably flexible, it easily becomes very
+Even though the current [`.gitlab-ci.yml`](../../../ci/index.md#the-gitlab-ciyml-file) is reasonably flexible, it easily becomes very
complex when trying to support complex workflows. This complexity is represented
with repetetitve patterns, a purpose-specific syntax, or a complex sequence of commands
to execute.
-This is particularly challenging, because the [`.gitlab-ci.yml`](../../../ci/yaml/gitlab_ci_yaml.md)
+This is particularly challenging, because the [`.gitlab-ci.yml`](../../../ci/index.md#the-gitlab-ciyml-file)
is inflexible on more complex workflows that require fine-tuning or special behavior
for the CI job execution. Its prescriptive approach how to handle Git cloning,
when artifacts are downloaded, or how the shell script is being executed quite often
@@ -46,7 +46,7 @@ results in the need to work around the system for pipelines that are not "standa
or when new features are requested.
This proves especially challenging when trying to add a new syntax to the
-[`.gitlab-ci.yml`](../../../ci/yaml/gitlab_ci_yaml.md)
+[`.gitlab-ci.yml`](../../../ci/index.md#the-gitlab-ciyml-file)
to support a specific feature, like [`secure files`](../../../ci/secure_files/index.md)
or `release:` keyword. Adding these special features on a syntax level
results in a more complex config, which is harder to maintain, and more complex
@@ -131,7 +131,14 @@ TBD
## Proposal
-TBD
+### GitLab Steps definition and syntax
+
+- [Step Definition](step-definition.md).
+- [Syntactic Sugar extensions](steps-syntactic-sugar.md).
+
+### Integration of GitLab Steps in `.gitlab-ci.yml`
+
+- [Usage of the GitLab Steps with `.gitlab-ci.yml`](gitlab-ci.md).
## Design and implementation details
diff --git a/doc/architecture/blueprints/gitlab_steps/step-definition.md b/doc/architecture/blueprints/gitlab_steps/step-definition.md
new file mode 100644
index 00000000000..08ca1ab7c31
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_steps/step-definition.md
@@ -0,0 +1,368 @@
+---
+owning-stage: "~devops::verify"
+description: The Step Definition for [GitLab Steps](index.md).
+---
+
+# The Step definition
+
+A step is the minimum executable unit that user can provide and is defined in a `step.yml` file.
+
+The following step definition describes the minimal syntax supported.
+The syntax is extended with [syntactic sugar](steps-syntactic-sugar.md).
+
+A step definition consists of two documents. The purpose of the document split is
+to distinguish between the declaration and implementation:
+
+1. [Specification / Declaration](#step-specification):
+
+ Provides the specification which describes step inputs and outputs,
+ as well any other metadata that might be needed by the step in the future (license, author, etc.).
+ In programming language terms, this is similar to a function declaration with arguments and return values.
+
+1. [Implementation](#step-implementation):
+
+ The implementation part of the document describes how to execute the step, including how the environment
+ has to be configured, or how actions can be configured.
+
+## Example step that prints a message to stdout
+
+In the following step example:
+
+1. The declaration specifies that the step accepts a single input named `message`.
+ The `message` is a required argument that needs to be provided when running the step
+ because it does not define `default:`.
+1. The implementation section specifies that the step is of type `exec`. When run, the step
+ will execute an `echo` command with a single argument (the `message` value).
+
+```yaml
+# .gitlab/ci/steps/exec-echo.yaml
+spec:
+ inputs:
+ message:
+---
+type: exec
+exec:
+ command: [echo, "${{inputs.message}}"]
+```
+
+## Step specification
+
+The step specification currently only defines inputs and outputs:
+
+- Inputs:
+ - Can be required or optional.
+ - Have a name and can have a description.
+ - Can contain a list of accepted options. Options limit what value can be provided for the input.
+ - Can define matching regexp. The matching regexp limits what value can be provided for the input.
+ - Can be expanded with the usage of syntax `${{ inputs.input_name }}`.
+- All **input values** can be accessed when `type: exec` is used,
+ by decoding the `$STEP_JSON` file that does provide information about the context of the execution.
+- Outputs:
+ - Have a name and can have a description.
+ - Can be set by writing to a special [dotenv](https://github.com/bkeepers/dotenv) file named:
+ `$OUTPUT_FILE` with a format of `output_name=VALUE` per output.
+
+For example:
+
+```yaml
+spec:
+ inputs:
+ message_with_default:
+ default: "Hello World"
+ message_that_is_required:
+ description: "This description explains that the input is required, because it does not specify a default:"
+ type_with_limited_options:
+ options: [bash, powershell, detect]
+ type_with_default_and_limited_options:
+ default: bash
+ options: [bash, powershell, detect]
+ description: "Since the options are provided, the default: needs to be one of the options"
+ version_with_matching_regexp:
+ match: ^v\d+\.\d+$
+ description: "The match pattern only allows values similar to `v1.2`"
+ outputs:
+ code_coverage:
+ description: "Measured code coverage that was calculated as part of the step"
+---
+type: steps
+steps:
+ - step: ./bash-script.yaml
+ inputs:
+ script: "echo Code Coverage = 95.4% >> $OUTPUT_FILE"
+```
+
+## Step Implementation
+
+The step definition can use the following types to implement the step:
+
+- `type: exec`: Run a binary command, using STDOUT/STDERR for tracing the executed process.
+- `type: steps`: Run a sequence of steps.
+- `type: parallel` (Planned): Run all steps in parallel, waiting for all of them to finish.
+- `type: grpc` (Planned): Run a binary command but use gRPC for intra-process communication.
+- `type: container` (Planned): Run a nested Step Runner in a container image of choice,
+ transferring all execution flow.
+
+### The `exec` step type
+
+The ability to run binary commands is one of the primitive functions:
+
+- The command to execute is defined by the `exec:` section.
+- The result of the execution is the exit code of the command to be executed, unless the default behavior is overwritten.
+- The default working directory in which the command is executed is the directory in which the
+ step is located.
+- By default, the command is not time-limited, but can be time-limited during job execution with `timeout:`.
+
+For example, an `exec` step with no inputs:
+
+```yaml
+spec:
+---
+type: exec
+exec:
+ command: [/bin/bash, ./my-script.sh]
+ timeout: 30m
+ workdir: /tmp
+```
+
+#### Example step that executes user-defined command
+
+The following example is a minimal step definition that executes a user-provided command:
+
+- The declaration section specifies that the step accepts a single input named `script`.
+- The `script` input is a required argument that needs to be provided when running the step
+ because no `default:` is defined.
+- The implementation section specifies that the step is of type `exec`. When run, the step
+ will execute in `bash` passing the user command with `-c` argument.
+- The command to be executed will be prefixed with `set -veo pipefail` to print the execution
+ to the job log and exit on the first failure.
+
+```yaml
+# .gitlab/ci/steps/exec-script.yaml
+
+spec:
+ inputs:
+ script:
+ description: 'Run user script.'
+---
+type: exec
+exec:
+ command: [/usr/bin/env, bash, -c, "set -veo pipefail; ${{inputs.script}}"]
+```
+
+### The `steps` step type
+
+The ability to run multiple steps in sequence is one of the primitive functions:
+
+- A sequence of steps is defined by an array of step references: `steps: []`.
+- The next step is run only if previous step succeeded, unless the default behavior is overwritten.
+- The result of the execution is either:
+ - A failure at the first failed step.
+ - Success if all steps in sequence succeed.
+
+#### Steps that use other steps
+
+The `steps` type depends extensively on being able to use other steps.
+Each item in a sequence can reference other external steps, for example:
+
+```yaml
+spec:
+---
+type: steps
+steps:
+ - step: ./.gitlab/ci/steps/ruby/install.yml
+ inputs:
+ version: 3.1
+ env:
+ HTTP_TIMEOUT: 10s
+ - step: gitlab.com/gitlab-org/components/bash/script@v1.0
+ inputs:
+ script: echo Hello World
+```
+
+The `step:` value is a string that describes where the step definition is located:
+
+- **Local**: The definition can be retrieved from a local source with `step: ./path/to/local/step.yml`.
+ A local reference is used when the path starts with `./` or `../`.
+ The resolved path to another local step is always **relative** to the location of the current step.
+ There is no limitation where the step is located in the repository.
+- **Remote**: The definition can also be retrieved from a remote source with `step: gitlab.com/gitlab-org/components/bash/script@v1.0`.
+ Using a FQDN makes the Step Runner pull the repository or archive containing
+ the step, using the version provided after the `@`.
+
+The `inputs:` section is a list of key-value pairs. The `inputs:` specify values
+that are passed and matched against the [step specification](#step-specification).
+
+The `env:` section is a list of key-value pairs. `env:` exposes the given environment
+variables to all children steps, including [`type: exec`](#the-exec-step-type) or [`type: steps`](#the-steps-step-type).
+
+#### Remote Steps
+
+To use remote steps with `step: gitlab.com/gitlab-org/components/bash/script@v1.0`
+the step definitions must be stored in a structured-way. The step definitions:
+
+- Must be stored in the `steps/` folder.
+- Can be nested in sub-directories.
+- Can be referenced by the directory name alone if the step definition
+ is stored in a `step.yml` file.
+
+For example, the file structure for a repository hosted in `git clone https://gitlab.com/gitlab-org/components.git`:
+
+```plaintext
+├── steps/
+├── ├── secret_detection.yml
+| ├── sast/
+│ | └── step.yml
+│ └── dast
+│ ├── java.yml
+│ └── ruby.yml
+```
+
+This structure exposes the following steps:
+
+- `step: gitlab.com/gitlab-org/components/secret_detection@v1.0`: From the definition stored at `steps/secret_detection.yml`.
+- `step: gitlab.com/gitlab-org/components/sast@v1.0`: From the definition stored at `steps/sast/step.yml`.
+- `step: gitlab.com/gitlab-org/components/dast/java@v1.0`: From the definition stored at `steps/dast/java.yml`.
+- `step: gitlab.com/gitlab-org/components/dast/ruby@v1.0`: From the definition stored at `steps/dast/ruby.yml`.
+
+#### Example step that runs other steps
+
+The following example is a minimal step definition that
+runs other steps that are local to the current step.
+
+- The declaration specifies that the step accepts two inputs, each with
+ a default value.
+- The implementation section specifies that the step is of type `steps`, meaning
+ the step will execute the listed steps in sequence. The usage of a top-level
+ `env:` makes the `HTTP_TIMEOUT` variable available in all executed steps.
+
+```yaml
+spec:
+ inputs:
+ ruby_version:
+ default: 3.1
+ http_timeout:
+ default: 10s
+---
+type: steps
+env:
+ HTTP_TIMEOUT: ${{inputs.http_timeout}}
+steps:
+ - step: ./.gitlab/ci/steps/exec-echo.yaml
+ inputs:
+ message: "Installing Ruby ${{inputs.ruby_version}}..."
+ - step: ./.gitlab/ci/ruby/install.yaml
+ inputs:
+ version: ${{inputs.ruby_version}}
+```
+
+## Context and interpolation
+
+Every step definition is executed in a context object which
+stores the following information that can be used by the step definition:
+
+- `inputs`: The list of inputs, including user-provided or default.
+- `outputs`: The list of expected outputs.
+- `env`: The current environment variable values.
+- `job`: The metadata about the current job being executed.
+ - `job.project`: Information about the project, for example ID, name, or full path.
+ - `job.variables`: All [CI/CD Variables](../../../ci/variables/predefined_variables.md) as provided by the CI/CD execution,
+ including project variables, predefined variables, etc.
+ - `job.pipeline`: Information about the current executed pipeline, like the ID, name, full path
+- `step`: Information about the current executed step, like the location of the step, the version used, or the [specification](#step-specification).
+- `steps` (only for `type: exec`): - Information about each step in sequence to be run, containing information about the
+ result of the step execution, like status or trace log.
+ - `steps.<name-of-the-step>.status`: The status of the step, like `success` or `failed`.
+ - `steps.<name-of-the-step>.outputs.<output-name>`: To fetch the output provided by the step
+
+The context object is used to enable support for the interpolation in the form of `${{ <value> }}`.
+
+Interpolation:
+
+- Is forbidden in the [step specification](#step-specification) section.
+ The specification is static configuration that should not affected by the runtime environment.
+- Can be used in the [step implementation](#step-implementation) section. The implementation
+ describes the runtime set of instructions for how step should be executed.
+- Is applied to every value of the hash of each data structure.
+- Of the *values* of each hash is possible (for now). The interpolation of *keys* is forbidden.
+- Is done when executing and passing control to a given step, instead of running
+ it once when the configuration is loaded. This enables chaining outputs to inputs, or making steps depend on the execution
+ of earlier steps.
+
+For example:
+
+```yaml
+# .gitlab/ci/steps/exec-echo.yaml
+spec:
+ inputs:
+ timeout:
+ default: 10s
+ bash_support_version:
+---
+type: steps
+env:
+ HTTP_TIMEOUT: ${{inputs.timeout}}
+ PROJECT_ID: ${{job.project.id}}
+steps:
+ - step: ./my/local/step/to/echo.yml
+ inputs:
+ message: "I'm currently building a project: ${{job.project.full_path}}"
+ - step: gitlab.com/gitlab-org/components/bash/script@v${{inputs.bash_support_version}}
+```
+
+## Reference data structures describing YAML document
+
+```go
+package main
+
+type StepEnvironment map[string]string
+
+type StepSpecInput struct {
+ Default *string `yaml:"default"`
+ Description string `yaml:"description"`
+ Options *[]string `yaml:"options"`
+ Match *string `yaml:"match"`
+}
+
+type StepSpecOutput struct {
+}
+
+type StepSpecInputs map[string]StepSpecInput
+type StepSpecOutputs map[string]StepSpecOutput
+
+type StepSpec struct {
+ Inputs StepSpecInput `yaml:"inputs"`
+ Outputs StepSpecOutputs `yaml:"outputs"`
+}
+
+type StepSpecDoc struct {
+ Spec StepSpec `yaml:"spec"`
+}
+
+type StepType string
+
+const StepTypeExec StepType = "exec"
+const StepTypeSteps StepType = "steps"
+
+type StepDefinition struct {
+ Def StepSpecDoc `yaml:"-"`
+ Env StepEnvironment `yaml:"env"`
+ Steps *StepDefinitionSequence `yaml:"steps"`
+ Exec *StepDefinitionExec `yaml:"exec"`
+}
+
+type StepDefinitionExec struct {
+ Command []string `yaml:"command"`
+ WorkingDir *string `yaml:"working_dir"`
+ Timeout *time.Duration `yaml:"timeout"`
+}
+
+type StepDefinitionSequence []StepReference
+
+type StepReferenceInputs map[string]string
+
+type StepReference struct {
+ Step string `yaml:"step"`
+ Inputs StepReferenceInputs `yaml:"inputs"`
+ Env StepEnvironment `yaml:"env"`
+}
+```
diff --git a/doc/architecture/blueprints/gitlab_steps/steps-syntactic-sugar.md b/doc/architecture/blueprints/gitlab_steps/steps-syntactic-sugar.md
new file mode 100644
index 00000000000..3ca54a45477
--- /dev/null
+++ b/doc/architecture/blueprints/gitlab_steps/steps-syntactic-sugar.md
@@ -0,0 +1,66 @@
+---
+owning-stage: "~devops::verify"
+description: The Syntactic Sugar extensions to the Step Definition
+---
+
+# The Syntactic Sugar extensions to the Step Definition
+
+[The Step Definition](step-definition.md) describes a minimal required syntax
+to be supported. To aid common workflows the following syntactic sugar is used
+to extend different parts of that document.
+
+## Syntactic Sugar for Step Reference
+
+Each of syntactic sugar extensions is converted into the simple
+[step reference](step-definition.md#steps-that-use-other-steps).
+
+### Easily execute scripts in a target environment
+
+`script:` is a shorthand syntax to aid execution of simple scripts, which cannot be used with `step:`
+and is run by an externally stored step component provided by GitLab.
+
+The GitLab-provided step component performs shell auto-detection unless overwritten,
+similar to how GitLab Runner does that now: based on a running system.
+
+`inputs:` and `env:` can be used for additional control of some aspects of that step component.
+
+For example:
+
+```yaml
+spec:
+---
+type: steps
+steps:
+ - script: bundle exec rspec
+ - script: bundle exec rspec
+ inputs:
+ shell: sh # Force runner to use `sh` shell, instead of performing auto-detection
+```
+
+This syntax example translates into the following equivalent syntax for
+execution by the Step Runner:
+
+```yaml
+spec:
+---
+type: steps
+steps:
+ - step: gitlab.com/gitlab-org/components/steps/script@v1.0
+ inputs:
+ script: bundle exec rspec
+ - step: gitlab.com/gitlab-org/components/steps/script@v1.0
+ inputs:
+ script: bundle exec rspec
+ shell: sh # Force runner to use `sh` shell, instead of performing auto-detection
+```
+
+This syntax example is **invalid** (and ambiguous) because the `script:` and `step:` cannot be used together:
+
+```yaml
+spec:
+---
+type: steps
+steps:
+ - step: gitlab.com/my-component/ruby/install@v1.0
+ script: bundle exec rspec
+```
diff --git a/doc/architecture/blueprints/google_artifact_registry_integration/index.md b/doc/architecture/blueprints/google_artifact_registry_integration/index.md
index 4c2bfe95c5e..ef66ae33b2a 100644
--- a/doc/architecture/blueprints/google_artifact_registry_integration/index.md
+++ b/doc/architecture/blueprints/google_artifact_registry_integration/index.md
@@ -116,6 +116,6 @@ One alternative solution considered was to use the Docker/OCI API provided by GA
- **Multiple Requests**: To retrieve all the required information about each image, multiple requests to different endpoints (listing tags, obtaining image manifests, and image configuration blobs) would have been necessary, leading to a `1+N` performance issue.
-GitLab had previously faced significant challenges with the last two limitations, prompting the development of a custom [GitLab Container Registry API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/api.md) to address them. Additionally, GitLab decided to [deprecate support](../../../update/deprecations.md#use-of-third-party-container-registries-is-deprecated) for connecting to third-party container registries using the Docker/OCI API due to these same limitations and the increased cost of maintaining two solutions in parallel. As a result, there is an ongoing effort to replace the use of the Docker/OCI API endpoints with custom API endpoints for all container registry functionalities in GitLab.
+GitLab had previously faced significant challenges with the last two limitations, prompting the development of a custom [GitLab Container Registry API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/gitlab/api.md) to address them. Additionally, GitLab decided to [deprecate support](../../../update/deprecations.md#use-of-third-party-container-registries-is-deprecated) for connecting to third-party container registries using the Docker/OCI API due to these same limitations and the increased cost of maintaining two solutions in parallel. As a result, there is an ongoing effort to replace the use of the Docker/OCI API endpoints with custom API endpoints for all container registry functionalities in GitLab.
Considering these factors, the decision was made to build the GAR integration from scratch using the proprietary GAR API. This approach provides more flexibility and control over the integration and can serve as a foundation for future expansions, such as support for other GAR artifact formats.
diff --git a/doc/architecture/blueprints/new_diffs.md b/doc/architecture/blueprints/new_diffs.md
index b5aeb9b8aa8..af1e4679c14 100644
--- a/doc/architecture/blueprints/new_diffs.md
+++ b/doc/architecture/blueprints/new_diffs.md
@@ -68,6 +68,35 @@ compared with the pros and cons of alternatives.
## Design and implementation details
+### Workspace & Artifacts
+
+- We will store implementation details like metrics, budgets, and development & architectural patterns here in the docs
+- We will store large bodies of research, the results of audits, etc. in the [wiki](https://gitlab.com/gitlab-com/create-stage/new-diffs/-/wikis/home) of the [New Diffs project](https://gitlab.com/gitlab-com/create-stage/new-diffs)
+- We will store audio & video recordings on the public Youtube channel in the Code Review / New Diffs playlist
+- We will store drafts, meeting notes, and other temporary documents in public Google docs
+
+### Definitions
+
+#### Maintainability
+
+Maintainable projects are _simple_ projects.
+
+Simplicity is the opposite of complexity. This uses a definition of simple and complex [described by Rich Hickey in "Simple Made Easy"](https://www.infoq.com/presentations/Simple-Made-Easy/) (Strange Loop, 2011).
+
+- Maintainable code is simple (single task, single concept, separate from other things).
+- Maintainable projects expand on simple code by having simple structure (folders define classes of behaviors, e.g. you can be assured that a component directory will never initiate a network call, because that would be complecting visual display with data access)
+- Maintainable applications flow out of simple organization and simple code. The old saying is a cluttered desk is representative of a cluttered mind. Rigorous discipline on simplicity will be represented in our output (the product). By being strict about working simply, we will naturally produce applications where our users can more easily reason about their behavior.
+
+#### Done
+
+GitLab has an existing [definition of done](/ee/development/contributing/merge_request_workflow.md#definition-of-done) which is geared primarily toward identifying when an MR is ready to be merged.
+
+In addition to the items in the GitLab definition of done, work on new diffs should also adhere to the following requirements:
+
+- Meets or exceeds all metrics
+ - Meets or exceeds our minimum accessibility metrics (these are explicitly not part of our defined priorities, since they are non-negotiable)
+- All work is fully documented for engineers (user documentation is a requirement of the standard definition of done)
+
<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
diff --git a/doc/architecture/blueprints/observability_logging/diagrams.drawio b/doc/architecture/blueprints/observability_logging/diagrams.drawio
new file mode 100644
index 00000000000..79b05247437
--- /dev/null
+++ b/doc/architecture/blueprints/observability_logging/diagrams.drawio
@@ -0,0 +1 @@
+<mxfile host="Electron" modified="2023-10-29T14:03:45.654Z" agent="5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/20.7.4 Chrome/106.0.5249.199 Electron/21.3.3 Safari/537.36" etag="mgCNcxJzZIj4Fii1_swS" version="20.7.4" type="device"><diagram id="eudcs1I04LxSKviLHc7n" name="Page-1">7VrZcuMoFP0aP9olCS3Wo+MsPVPpjnuSqkk/TWGJ2FSwcSG8zdcPWEiWQF5aXuRUzZPhglgOh8Pl4hboT1ZPDM7G32mMSMux4lUL3Lccx3aBK36kZZ1aur6fGkYMx6rS1vCK/0XKaCnrHMcoKVXklBKOZ2VjRKdTFPGSDTJGl+VqH5SUe53BETIMrxEkpvVvHPOxmoUTbO3fEB6Ns55tP0xLJjCrrGaSjGFMlwUTeGiBPqOUp6nJqo+IBC/DJf3ucUdpPjCGpvyYD+gIg8WPhP75GFv47m244iuvrVpZQDJXE/5jOmIoSdSY+ToDgtH5NEayLbsF7pZjzNHrDEaydCmWXtjGfEJUsTm2rCPEOFoVTGqsT4hOEGdrUSWnjsJNEQcEKr/cLoOdYTsuLIGvbFCt/ChveguOSCh8qrF6/t5++Qn+wqHDrX8GsfVu9WAFVn2Co89vdJ4gAy6x2jOZjNYEC9wYOAzaMEX4eZgbYPQ52uD+MueiGaTsSbpLbC9H2oC1AvydSAeOhrTnmUhbFUiHl0LaMZD+OUebIb8itsCRCXcNdp6EmeuEJcxyvSpgFlRAZoNLYeYZmO3k5AdBq57URoEFmsYqeR8RmCQ4KmO1Bdba1BZDfVdFm8wvWdLxsuz9qljzfp3lVpi/F9K/CuntJzKzzogtcwPEsEAHMVXjAxPSp4SyzXwAsB4fQSDXW4G+3RAoNmTd0KKEzlmEDm94DtkI8UN0NalSoIJXQYXMxhCBHC/Kw62ih+phQLGYSM5EPyjvXkP/0mmqr4rng9aQB3bIQNZQioPR0Iat+bTrE9g3CPzy9vAsLGLFiTjaxao7PuFSF2VqJFNvaAo3c+g1rgigW1YEF1ScV/t4cHZFCK6qCFYn9Iqi0LY6Vi1Z6Hg7hWGfBmxa1gXjQsLgmMKw27tqShgCfT936wqDpZ113esKQ/cUYbi7OWHwrKaFITwF0B+nAQpZpC55rnUZfP3GhTfzBfcpb0Flp1R69ncxTMY5igXEpH0AuRC16cbiWPIukXBGP/MrqWPosiaTLkKe45b1uiTX9rFSncvzfh/OOq/ihke6Yt0mFdfVrqxuXVfMDbWGbK2hCyuubV51L+A6aBR98OwQlCla75Jh/4Y3cbqDsC+scpCu/v90PQddzXjBl6VrcNN8bVReneAAzY7lKwi1q5p/XYfWBrfIV6cOYe1bZmt4U2x1z8RWD1yZra7B1l56R9hB2fmEpBXAnXTscQTJMxwiMqAJ5phKL3ZIOaeTQoUewSNZwKl2X6Bp5LufP++c6dagx7wrXhdABTku97hwlfhtURPUtaNODCYRlONZt0NCo89mNnhKzaZ2ONDdnroBFsMR0x8JLr3DzdBr8+fRnsPIr+vtXzNG6B/L4Zs6pL6sS3WVYPfvUXhfSKWdv3HdMoe7X4LDXY3DTl0Z1uLcbnBlDpuB7kY43LHdo0ODbbcDdjPZuR0qHxtAbJbKukehOwJHU9nWqHzlJxvbfGJ4wpzAobC9DBPEFnCICeYS2oGA7IOyiUhGZJ5IJhisX+IJgRt39fLvNfofj/J8gQN+WEUC3f874nIgstt/gKXob/9HBx7+Aw==</diagram></mxfile> \ No newline at end of file
diff --git a/doc/architecture/blueprints/observability_logging/index.md b/doc/architecture/blueprints/observability_logging/index.md
new file mode 100644
index 00000000000..d8259e0a736
--- /dev/null
+++ b/doc/architecture/blueprints/observability_logging/index.md
@@ -0,0 +1,632 @@
+---
+status: proposed
+creation-date: "2023-10-29"
+authors: [ "@vespian_gl" ]
+coach: "@mappelman"
+approvers: [ "@sguyon", "@nicholasklick" ]
+owning-stage: "~monitor::observability"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# GitLab Observability - Logging
+
+## Summary
+
+This design document outlines a system for storing and querying logs which will be a part of GitLab Observability Backend (GOB), together with [tracing](../observability_tracing/index.md) and [metrics](../observability_metrics/index.md).
+At its core the system is leveraging [OpenTelemetry logging](https://opentelemetry.io/docs/specs/otel/logs/) specification for data ingestion and ClickHouse database for storage.
+The users will interact with the data through GitLab UI.
+The system itself is multi-tenant and offers our users a way to store their application logs, query them, and in future iterations correlate with other observability signals (traces, errors, metrics, etc...).
+
+## Motivation
+
+After [tracing](../observability_tracing/index.md) and [metrics](../observability_metrics/index.md), logging is the last observability signal that we need to support to be able to provide our users with a fully-fledged observability solution.
+
+One could argue that logging itself is also the most important observability signal because it is so widespread.
+It predates metrics and tracing in the history of application observability and is usually implemented as one of the first things during development.
+
+Without logging support, it would be very hard if not impossible to fully understand for our users the performance and operation of the applications developed by them with the help of our platform.
+
+### Goals
+
+- **multi-tenant**: each user and their data should be isolated from others that are using the platform.
+ Users may query only the data that they have sent to the platform.
+- **follows OpenTelemetry standards**: logs ingestion should follow the [OpenTelemetry protocol](https://opentelemetry.io/docs/specs/otel/logs/data-model/).
+ Apart from being able to re-use the tooling and know-how that was already developed for OpenTelemetry protocol, we will not have to reinvent the wheel when it comes to wire protocol and data storage format.
+- **uses ClickHouse as a data storage backend**: ClickHouse has become the go-to solution for observability data at GitLab for a plethora of reasons.
+ Our tracing and metrics solutions already use it, so logging should be consistent with it and not introduce new dependencies.
+- **Users can query their data using reasonably complex queries**: storing logs by itself will not bring much value to our users.
+
+### Non-Goals
+
+- **complex query support and logs analytics** - at least in the first iteration we do not plan to support complex queries, in particular `GROUP BY` queries that users may want to use for quantitative logs analytics.
+ Supporting it is not trivial and requires some more research and work in the area of query language syntax.
+- **advanced data retention** - logs differ from traces and metrics concerning legal requirements.
+ Authorities may request logs stored by us as part of e.g. ongoing investigations.
+ In the initial iteration, we need to caution our users that our system is not ready for that and they need a secondary system for now if they intend to store e.g. access logs.
+ We will need more work around logs/data integrity and long-term storage policies to handle this use case.
+- **data deletion** - apart from the case where the data simply expires after a predefined storage period, we do not plan to support deleting individual logs by users.
+ This is left for later iterations.
+- **linking logs to traces** - we do not intend to support linking logs to traces in the first iteration, at least not in the UI.
+- **logs sampling** - for traces we expect users to sample their data before sending it to us while we focus only on enforcing the limits/quotas.
+ Logs should follow this pattern.
+ The log sampling implementation seems immature as well - a log sampler is [implemented in OTEL Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/14920), but it is not clear if it can work together with traces sampling, and there is no official specification ([issue](https://github.com/open-telemetry/opentelemetry-specification/issues/2237), [pull request](https://github.com/open-telemetry/opentelemetry-specification/pull/2482)).
+
+## Proposal
+
+The architecture of logs ingestion follows the patterns outlined in the [tracing](../observability_tracing/index.md) and [metrics](../observability_metrics/index.md) proposals:
+
+![System Overview](system_overview.png)
+
+We re-use the components that were introduced by these proposals, so there are not going to be any new services added.
+Each top-level GitLab namespace has its own OTEL collector to which ingestion requests are directed by the cluster-wide Ingress.
+On the other hand, there is a single, cluster-wide query service that handles queries from users.
+The query service is tenant-aware.
+Rate-limiting of the user requests is done at the Ingress level.
+The cluster-wide Ingress is currently done using Traefik, and it is shared with all other services in the cluster.
+
+### Ingestion path
+
+We receive Log objects from customers in the JSON format over HTTP.
+The request arrives at the cluster-wide Ingress which routes the request to the appropriate OTEL collector.
+The collector then processes this request and executes INSERT statements against Clickhouse.
+
+### Read path
+
+GOB exposes an HTTP/JSON API that e.g. GitLab UI uses to query and then render logs.
+The cluster-wide Ingress is routing the requests to the query service which in turn parses the API request and executes an SQL query against ClickHouse.
+The results are then formatted into JSON response and sent back to the client.
+
+## Design and implementation details
+
+### Legacy code
+
+Handling logging signals is heavily influenced by the large amount of legacy code that needs to be supported, contrary to trace and metric signals.
+For metrics and tracing, OpenTelemetry specification defines new APIs and SDKs that can be leveraged.
+With logs, OpenTelemetry acts more like a bridge and enables legacy libraries/code to send their data to us.
+
+Users may create Log signals from plain log files using [filelogreceiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver) or [fluentd](https://docs.fluentbit.io/manual/pipeline/outputs/opentelemetry).
+Existing log libraries may use [Log Bridge API](https://opentelemetry.io/docs/specs/otel/logs/bridge-api/) to emit logs using OTEL protocol.
+In time the ecosystem will most probably develop and the number of options will grow.
+The assumption is made that _how_ logs are ingested is up to the user.
+
+Hence we expose only an HTTP endpoint that accepts logs in OTEL format and assume that logs are already properly parsed and formatted.
+
+### Logs, Events, and Span Events
+
+Log messages can be sent using three different objects according to the OTEL specification:
+
+- [Log](https://opentelemetry.io/docs/specs/otel/logs/)
+- [Event](https://opentelemetry.io/docs/specs/otel/logs/event-api/)
+- [Span Event](https://opentelemetry.io/docs/concepts/signals/traces/#span-events)
+
+At least in the first iteration we can only either support Logs, Events, or Span-Events.
+
+We can't send Span Events as there are lots of legacy code that can not or will not implement tracing for various reasons.
+
+Even though Events use the same data model internally, their semantics differ.
+Logs have a mandatory severity level as a first-class parameter that Events do not need to have, and Events have a mandatory `event.name` and optional `event.domain` keys in the `Attributes` field of the Log record.
+Further, logs typically have messages in string form and events have data in the form of key-value pairs.
+There is a [discussion](https://github.com/open-telemetry/oteps/blob/main/text/0202-events-and-logs-api.md) to separate Log and Event APIs.
+More information on the differences between these two can be found [here](https://github.com/open-telemetry/oteps/blob/main/text/0202-events-and-logs-api.md#subtle-differences-between-logs-and-events).
+
+From the perspective of a developer/potential user, there seems to be no logging use case that couldn't be modeled as a Log record instead of sending an Event explicitly.
+Examples that the community gives e.g. [here](https://github.com/open-telemetry/opentelemetry-specification/issues/3254) or [here](https://github.com/open-telemetry/oteps/blob/main/text/0202-events-and-logs-api.md#subtle-differences-between-logs-and-events) are not convincing enough and could simply be modeled as Log records.
+
+Hence the decision to only support Log objects seems like a boring and simple solution.
+
+### Rate-limiting
+
+Similar to traces, logging data ingestion will be done at the Ingress level.
+As part of [the forward-auth](https://doc.traefik.io/traefik/middlewares/http/forwardauth/) flow, Traefik will forward the request to Gatekeeper which in turn leverages Redis for counting.
+This is currently done only for [the ingestion path](https://gitlab.com/gitlab-org/opstrace/opstrace/-/merge_requests/2236).
+Please check the MR description for more details on how it works.
+The read path rate limiting implementation is tracked [here](https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/2356).
+
+### Database schema
+
+[OpenTelemetry specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md) defines a set of fields that are required by the implementations.
+There are some small discrepancies between the documented schema and the [protobuf definition](https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/logs/v1/logs.proto), namely, TraceFlags is defined as an 8-bit field in the documentation whereas it is a 32-bit wide field in the proto definition.
+The remaining 24 bits are reserved.
+The Log message body may be any object and there is [no size limitation for the record](https://github.com/open-telemetry/opentelemetry-specification/issues/1251).
+For the purpose of this design document, we will assume that it is going to be an arbitrary string, either plain text or e.g. JSON, without length limits.
+
+#### Data filtering
+
+The schema uses Bloom Filters extensively.
+They prevent false negatives, but false positives are still possible, hence we will not be able to provide `!=` queries to users.
+The `Body` field is a special case, as it uses [`tokenbf_v1` tokenized Bloom Filter](https://clickhouse.com/docs/en/optimize/skipping-indexes#bloom-filter-types).
+The `tokenbf_v1` skipping index sees like a simpler and more lightweight approach than the `ngrambf_v1` index.
+Based on the very preliminary benchmarks below `ngrambf_v1` index will be also much more difficult to tune.
+The limitation is though that our users will be able to search only the full words for now.
+We (gu)estimate that there may be up to 10,000 different words in a given granule, and we aim for a 0.1% probability of false positives
+Using [this tool](https://krisives.github.io/bloom-calculator/) the optimal size of the filter was calculated at 143776 bits and 10 hash functions.
+
+#### Skipping indexes, `==`, `!=` and `LIKE` operators
+
+Skipping indexes only optimize searching for granules to scan.
+`==` and `LIKE` queries work as they should, the `!=` always results in a full scan due to Bloom Filters limitations.
+At least in the first iteration we will not make `!=` operator available to users.
+
+Based on the data, it may be much easier for us to tune the `tokenbf_v1` filter in the first iteration than the `ngrambf_v1`, because in the case of `ngrambf_v1` queries almost always result in a full scan for any reasonably big dataset.
+The reason for that is that the number of ngrams in the index is much higher than tokens hence matches are more frequent for data with high cardinality of words/symbols.
+
+A very preliminary benchmark was conducted to verify these assumptions.
+
+As testing data, we used the following table schemas and inserts/functions.
+They simulate a single tenant, as we want to focus only on the `Body` field.
+Normally the primary index would allow us to skip granules where there is no data for a given tenant.
+
+`tokenbf_v1` version of the table:
+
+```plaintext
+CREATE TABLE tbl2
+(
+ `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
+ `TraceId` String CODEC(ZSTD(1)),
+ `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
+ `Duration` UInt8 CODEC(ZSTD(1)),
+ `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
+ `Body` String CODEC(ZSTD(1)),
+ INDEX idx_body Body TYPE tokenbf_v1(143776, 10, 0) GRANULARITY 1
+)
+ENGINE = MergeTree
+PARTITION BY toDate(Timestamp)
+ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
+SETTINGS index_granularity = 8192
+```
+
+`ngrambf_v1` version of the table:
+
+```plaintext
+CREATE TABLE tbl3
+(
+ `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
+ `TraceId` String CODEC(ZSTD(1)),
+ `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
+ `Duration` UInt8 CODEC(ZSTD(1)),
+ `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
+ `Body` String CODEC(ZSTD(1)),
+ INDEX idx_body Body TYPE ngrambf_v1(4,143776, 10, 0) GRANULARITY 1
+)
+ENGINE = MergeTree
+PARTITION BY toDate(Timestamp)
+ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
+SETTINGS index_granularity = 8192
+```
+
+In both cases, their `Body` fields were filled with data that simulates a JSON map object:
+
+```plaintext
+CREATE FUNCTION genmap AS (n) -> arrayMap (x-> (x::String, (x*(rand()%40000+1))::String), range(1, n));
+
+INSERT INTO tbl(2|3)
+SELECT
+ now() - randUniform(1, 1_000_000) as Timestamp,
+ randomPrintableASCII(2) as TraceId,
+ randomPrintableASCII(2) as ServiceName,
+ rand32() as Duration,
+ randomPrintableASCII(2) as SpanName,
+ toJSONString(genmap(rand()%40+1)::Map(String, String)) as Body
+FROM numbers(10_000_000);
+```
+
+In the case of the `tokenbf_v1` table, we have:
+
+- `==` equality works, skipping index resulted in 224/1264 granules scanned:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select count(*) from tbl2 where Body == '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+EXPLAIN indexes = 1
+SELECT count(*)
+FROM tbl2
+WHERE Body = '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+Query id: 60827945-a9b0-42f9-86a8-dfe77758a6b1
+
+┌─explain───────────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ Aggregating │
+│ Expression (Before GROUP BY) │
+│ Filter (WHERE) │
+│ ReadFromMergeTree (logging.tbl2) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Partition │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Skip │
+│ Name: idx_body │
+│ Description: tokenbf_v1 GRANULARITY 1 │
+│ Parts: 62/69 │
+│ Granules: 224/1264 │
+└───────────────────────────────────────────────────┘
+
+23 rows in set. Elapsed: 0.019 sec.
+```
+
+- `!=` inequality works as well, but results in fulltext scan - all granules were scanned:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select count(*) from tbl2 where Body != '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+EXPLAIN indexes = 1
+SELECT count(*)
+FROM tbl2
+WHERE Body != '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+Query id: 01584696-30d8-4711-8469-44d4f2629c98
+
+┌─explain───────────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ Aggregating │
+│ Expression (Before GROUP BY) │
+│ Filter (WHERE) │
+│ ReadFromMergeTree (logging.tbl2) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Partition │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Skip │
+│ Name: idx_body │
+│ Description: tokenbf_v1 GRANULARITY 1 │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+└───────────────────────────────────────────────────┘
+
+23 rows in set. Elapsed: 0.017 sec.
+```
+
+- `LIKE` queries work, 271/1264 granules scanned:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select * from tbl2 where Body like '%"11":"162052"%';
+
+EXPLAIN indexes = 1
+SELECT *
+FROM tbl2
+WHERE Body LIKE '%"11":"162052"%'
+
+Query id: 86e99d7a-6567-4000-badc-d0b8b2dc8936
+
+┌─explain─────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ ReadFromMergeTree (logging.tbl2) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Partition │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 69/69 │
+│ Granules: 1264/1264 │
+│ Skip │
+│ Name: idx_body │
+│ Description: tokenbf_v1 GRANULARITY 1 │
+│ Parts: 64/69 │
+│ Granules: 271/1264 │
+└─────────────────────────────────────────────┘
+
+20 rows in set. Elapsed: 0.047 sec.
+```
+
+`ngrambf_v1` tokenizer will be much harder to tune and use correctly:
+
+- equality using n-gram indexes works as well, but due to the high granularity of tokens in the bloom filter, we aren't skipping many granules:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select count(*) from tbl3 where Body == '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+EXPLAIN indexes = 1
+SELECT count(*)
+FROM tbl3
+WHERE Body = '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+Query id: 22836e2d-5e49-4f51-b23c-facf5a3102c2
+
+┌─explain───────────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ Aggregating │
+│ Expression (Before GROUP BY) │
+│ Filter (WHERE) │
+│ ReadFromMergeTree (logging.tbl3) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Partition │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Skip │
+│ Name: idx_body │
+│ Description: ngrambf_v1 GRANULARITY 1 │
+│ Parts: 60/60 │
+│ Granules: 1239/1257 │
+└───────────────────────────────────────────────────┘
+
+23 rows in set. Elapsed: 0.025 sec.
+```
+
+- inequality here also results in a full scan:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select count(*) from tbl3 where Body != '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+EXPLAIN indexes = 1
+SELECT count(*)
+FROM tbl3
+WHERE Body != '{"1":"14732","2":"29464","3":"44196","4":"58928","5":"73660","6":"88392","7":"103124","8":"117856","9":"132588","10":"147320","11":"162052"}'
+
+Query id: 2378c885-65b0-4be0-9564-fa7ba7c79172
+
+┌─explain───────────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ Aggregating │
+│ Expression (Before GROUP BY) │
+│ Filter (WHERE) │
+│ ReadFromMergeTree (logging.tbl3) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Partition │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Skip │
+│ Name: idx_body │
+│ Description: ngrambf_v1 GRANULARITY 1 │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+└───────────────────────────────────────────────────┘
+
+23 rows in set. Elapsed: 0.022 sec.
+```
+
+- LIKE statements work, but result in a fullscan as the ngrams match all the granules:
+
+```plaintext
+zara.engel.vespian.net :) explain indexes=1 select * from tbl3 where Body like '%"11":"162052"%';
+
+EXPLAIN indexes = 1
+SELECT *
+FROM tbl3
+WHERE Body LIKE '%"11":"162052"%'
+
+Query id: 957d8c98-819e-4487-93ac-868ffe0485ec
+
+┌─explain─────────────────────────────────────┐
+│ Expression ((Projection + Before ORDER BY)) │
+│ ReadFromMergeTree (logging.tbl3) │
+│ Indexes: │
+│ MinMax │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Partition │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ PrimaryKey │
+│ Condition: true │
+│ Parts: 60/60 │
+│ Granules: 1257/1257 │
+│ Skip │
+│ Name: idx_body │
+│ Description: ngrambf_v1 GRANULARITY 1 │
+│ Parts: 60/60 │
+│ Granules: 1251/1257 │
+└─────────────────────────────────────────────┘
+
+20 rows in set. Elapsed: 0.023 sec.
+```
+
+#### Data Deduplication
+
+To provide cost-efficient service to our users, we need to think about deduplicating the data we get from our users.
+ClickHouse [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree) deduplicates data automatically based on the primary key.
+We can't include all the relevant `Log` entry fields in the primary field, hence the idea of a Fingerprint as the very last part of the Primary Key.
+We normally do not use it for indexing, just to prevent unique records from being garbage collected.
+The fingerprint calculation algorithm and length have not been chosen yet, we may use the same one that `metrics` are using to calculate their Fingerprint.
+For now, we assume that it is 128-bit wide (16 8-bit chars).
+The columns we use for fingerprint calculation are the columns that are not present in the primary key: `Body`, `ResourceAttributes`, and `LogAttributes`.
+The fingerprint, due to very high cardinality, will need to go into the last place in the primary index.
+
+#### Data Retention
+
+There is a legal question of how long logs need to be stored and whether we allow for their deletion (e.g. due to the leak of some private data or data related to an investigation).
+In some jurisdictions, logs need to be kept for years and there must be no way to delete them.
+This affects deduplication unless we include the ObservedTimestamp in the fingerprint.
+As pointed out in the `Non-Goals` section, this is an issue we are going to tackle in future iterations.
+
+#### Ingestion-time fields
+
+I am intentionally not pulling [semantic convention fields](https://opentelemetry.io/docs/specs/semconv/general/logs/) into separate columns as users will use countless number of log formats, and it will probably not be possible to identify properties worth becoming a column.
+
+The `ObservedTimestamp` field is set by the collector during the ingestion.
+Users query by the `Timestamp` field and the log pruning is driven by the `ObservedTimestamp` field.
+The disadvantage of this approach is that `TTL DELETE` may not remove parts as early as we would like to because the primary index and TTL column differ so the data may not be localized.
+This seems like a good tradeoff though.
+We will offer users a predefined storage period that starts with the ingestion.
+In case when users ingest logs that have timestamps in the future or the past, the pruning of old logs could start too early or too late.
+Users could abuse the claimed log timestamp too to delay pruning.
+The `ObservedTimestamp` approach does not have these issues.
+
+During the ingestion, the `SeverityText` field is parsed into `SeverityNumber` if the `SeverityNumber` field has not been set.
+Queries will be using the `SeverityNumber` field as it is more efficient than plain text and offers higher granularity.
+
+```plaintext
+DROP TABLE if exists logs;
+CREATE TABLE logs
+(
+ `ProjectId` String CODEC(ZSTD(1)),
+ `Fingerprint` FixedString(16) CODEC(ZSTD(1)),
+ `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
+ `ObservedTimestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
+ `TraceId` FixedString(16) CODEC(ZSTD(1)),
+ `SpanId` FixedString(8) CODEC(ZSTD(1)),
+ `TraceFlags` UInt32 CODEC(ZSTD(1)),
+ `SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
+ `SeverityNumber` UInt8 CODEC(ZSTD(1)),
+ `ServiceName` String CODEC(ZSTD(1)),
+ `Body` String CODEC(ZSTD(1)),
+ `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
+ `LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
+ INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
+ INDEX idx_span_id SpanId TYPE bloom_filter(0.001) GRANULARITY 1,
+ INDEX idx_trace_flags TraceFlags TYPE set(2) GRANULARITY 1,
+ INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
+ INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
+ INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
+ INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
+ INDEX idx_body Body TYPE tokenbf_v1(143776, 10, 0) GRANULARITY 1
+)
+ENGINE = ReplacingMergeTree
+PARTITION BY toDate(Timestamp)
+ORDER BY (ProjectId, ServiceName, SeverityNumber, toUnixTimestamp(Timestamp), TraceId, Fingerprint)
+TTL toDateTime(ObservedTimestamp) + toIntervalDay(30)
+SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;
+```
+
+### Query API, querying UI
+
+The main idea behind query API/workflow introduced by this proposal is to give users the freedom to query while at the same time providing limits both when it comes to query complexity and query resource usage/execution time.
+We can't foresee how users are going to query their data, nor how the data will look exactly - some will use Attributes, some will just focus on log-level, etc...
+
+In Clickhouse, individual queries [may have settings](https://clickhouse.com/docs/knowledgebase/configure-a-user-setting), which include [query complexity settings](https://clickhouse.com/docs/en/operations/settings/query-complexity).
+The query limits would be appended to each query automatically by the query service when constructing SQL statements.
+
+Fulltext queries for the Log entry `Body` field would be handled transparently by the query service as well thanks to ClickHouse optimizing `LIKE` queries using BloomFilters and tokenization of the search term.
+In future iterations we may want to consider n-gram tokenization, for now, the queries will be limited to full words only.
+
+It is up for debate whether we want to deduplicate log entries in the UI in case the user ingests duplicates.
+We could use the `max(ObservedTimestamp)` function to avoid duplicated entries in the time between records ingestion and ReplacingMergeTree's eventual deduplication kicking in.
+Definitely not in the first iteration though.
+
+The query service would also transparently translate the `SeverityText` attributes of the query into `SeverityNumber` while constructing the query.
+
+#### Query Service API schema
+
+We can't allow UI to send us SQL queries as that would open the possibility of abusing the system by users.
+We are also unable to support all the use cases that users could come up with when given the full flexibility of SQL query language.
+So the idea is for UI to provide a simple creator-like experience that would guide users.
+Something very similar to what GitLab currently has for searching MRs and Issues.
+UI code would then translate the query that the user came up with into a JSON and send it for processing to the query service.
+Based on the JSON received, the query service would then template an SQL query together with the query limits we mentioned above.
+
+For now, the UI and the JSON API would support only a basic set of operations on given fields:
+
+- Timestamp: `>`, `<`, `==`
+- TraceId: `==`, later iterations `in`
+- SpanId: `==`, later iterations `in`
+- TraceFlags: `==`, `!=`, later iterations:`in`, `notIn`
+- SeverityText: `==`, `!=`, later iterations: `in`, `notIn`
+- SeverityNumber: `<`,`>`, `==`, `!=`, later iterations: `in`, `notIn`
+- ServiceName: `==`, `!=`, later iterations: `in`, `notIn`
+- Body: `==`, `CONTAINS`
+- ResourceAttributes: `key==value`, `mapContains(key)`
+- LogAttributes: `key==value`, `mapContains(key)`
+
+The format of the intermediate JSON could look like the following:
+
+```yaml
+{
+ "query": [
+ { "type": "()|AND|OR",
+ "operands": {
+ [...]
+ },
+ {
+ "type": "==|!=|<|>|CONTAINS",
+ "column": "...",
+ "val": "..."
+ }
+ ]
+}
+```
+
+The `==|!=|<|>|CONTAINS` are non-nesting operands, they operate on concrete columns and result in `WHEN` conditions after being processed by the query service.
+The `()|AND|OR` are nesting operands and can only include other non-nesting operands.
+We may defer the implementation of the nesting operands for later iterations.
+There is implicit AND between the operands at the top level of the query structure.
+
+The query schema is intentionally kept simple compared to [the one used in the metrics proposal](../observability_metrics/index.md#api-structure).
+We may add fields like `QueryContext`, `BackendContext`, etc... in later iterations once a need arises.
+For now, we keep the schema as simple as possible and just make sure that the API is versioned so that we can change it easily in the future.
+
+## Open questions
+
+### Logging SDK Maturity
+
+OTEL standard does not intend to provide a standalone SDK for logging just like it did e.g. tracing.
+It may consider doing so only for a programming language that does not have its logging libraries which should be a pretty rare thing.
+All the existing logging libraries should instead use [bridge API](https://opentelemetry.io/docs/specs/otel/logs/bridge-api/) to interact with OTEL collector/send logs using OTEL Logs standard.
+
+The majority of languages have already made the required adjustments, except for Go.
+There is only very minimal support for GO ([repo](https://github.com/agoda-com/opentelemetry-go), [repo](https://github.com/agoda-com/opentelemetry-logs-go)).
+The official Uber Zap repository has barely an [issue](https://github.com/uber-go/zap/issues/654) about emitting events in spans.
+Opentelemetry [status page](https://opentelemetry.io/docs/instrumentation/go/) states that Go support as not implemented yet.
+
+The lack of native OTEL SDK support for emitting logs in Go may be an issue for us if we want to dogfood logging.
+We could work around these limitations in large extent by parsing log files using [filelogreceiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver) or [fluentd](https://docs.fluentbit.io/manual/pipeline/outputs/opentelemetry).
+Contributing and improving the support of Go in OTEL is also a valid option.
+
+## Future work
+
+### Support for != operators in queries
+
+Bloom filters that we use in schemas do not allow for testing if the given term is NOT present in the log entry's body/attributes.
+This is a small but valid use case.
+A solution for that may be [inverted indexes](https://clickhouse.com/blog/clickhouse-search-with-inverted-indices) but this is still an experimental feature.
+
+### Documentation
+
+As part of the documentation effort, we may want to provide examples of how sending data to GOB can be done in different languages (uber-zap, logrus, log4j, etc...) just like we do for error tracking.
+Some of the applications can't be easily modified to send data to us (e.g. systemd/journald) and a log tailing/parsing needs to be employed using [filelogreceiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver) or [fluentd](https://docs.fluentbit.io/manual/pipeline/outputs/opentelemetry).
+We could probably address both cases above by instrumenting our infrastructure and linking to our code from the documentation.
+This way we can both dog-food our solution, save some money as the GCE logging solution is pretty expensive, and give users real-life examples of how they can instrument their infrastructure.
+
+This could be one of the follow-up tasks once we are done with the implementation.
+
+### User query resource usage monitoring
+
+Long-term, we will need a way to monitor the number of user queries that failed due to limits enforcement and resource usage in general to fine-tune the query limits and make sure that users are not too aggressively restricted.
+
+## Iterations
+
+Please refer to [Observability Group planning epic](https://gitlab.com/groups/gitlab-org/opstrace/-/epics/92) and its linked issues for up-to-date information.
diff --git a/doc/architecture/blueprints/observability_logging/system_overview.png b/doc/architecture/blueprints/observability_logging/system_overview.png
new file mode 100644
index 00000000000..30c6510c3dc
--- /dev/null
+++ b/doc/architecture/blueprints/observability_logging/system_overview.png
Binary files differ
diff --git a/doc/architecture/blueprints/organization/diagrams/organization-isolation-broken.drawio.png b/doc/architecture/blueprints/organization/diagrams/organization-isolation-broken.drawio.png
new file mode 100644
index 00000000000..cd1301bb0bc
--- /dev/null
+++ b/doc/architecture/blueprints/organization/diagrams/organization-isolation-broken.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/organization/diagrams/organization-isolation.drawio.png b/doc/architecture/blueprints/organization/diagrams/organization-isolation.drawio.png
new file mode 100644
index 00000000000..a9ff4ae5165
--- /dev/null
+++ b/doc/architecture/blueprints/organization/diagrams/organization-isolation.drawio.png
Binary files differ
diff --git a/doc/architecture/blueprints/organization/index.md b/doc/architecture/blueprints/organization/index.md
index 258a624e371..49bf18442e9 100644
--- a/doc/architecture/blueprints/organization/index.md
+++ b/doc/architecture/blueprints/organization/index.md
@@ -323,6 +323,7 @@ In iteration 2, an Organization MVC Experiment will be released. We will test th
- Organizations can be deleted.
- Organization Owners can access the Activity page for the Organization.
- Forking across Organizations will be defined.
+- [Organization Isolation](isolation.md) will be finished to meet the requirements of the initial set of customers
### Iteration 3: Organization MVC Beta (FY25Q1)
@@ -333,6 +334,7 @@ In iteration 3, the Organization MVC Beta will be released.
- Organization Owners can create, edit and delete Groups from the Groups overview.
- Organization Owners can create, edit and delete Projects from the Projects overview.
- The Organization URL path can be changed.
+- [Organization Isolation](isolation.md) is available.
### Iteration 4: Organization MVC GA (FY25Q2)
@@ -398,3 +400,4 @@ See [Organization: Frequently Asked Questions](organization-faq.md).
- [Cells blueprint](../cells/index.md)
- [Cells epic](https://gitlab.com/groups/gitlab-org/-/epics/7582)
- [Namespaces](../../../user/namespace/index.md)
+- [Organization Isolation](isolation.md)
diff --git a/doc/architecture/blueprints/organization/isolation.md b/doc/architecture/blueprints/organization/isolation.md
new file mode 100644
index 00000000000..238269c4329
--- /dev/null
+++ b/doc/architecture/blueprints/organization/isolation.md
@@ -0,0 +1,152 @@
+---
+status: ongoing
+creation-date: "2023-10-11"
+authors: [ "@DylanGriffith" ]
+coach:
+approvers: [ "@lohrc", "@alexpooley" ]
+owning-stage: "~devops::data stores"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# Organization Isolation
+
+This blueprint details requirements for Organizations to be isolated.
+Watch a [video introduction](https://www.youtube.com/watch?v=kDinjEHVVi0) that summarizes what Organization isolation is and why we need it.
+Read more about what an Organization is in [Organization](index.md).
+
+## What?
+
+<img src="diagrams/organization-isolation.drawio.png" width="800">
+
+All Cell-local data and functionality in GitLab (all data except the few
+things that need to exist on all Cells in the cluster) must be isolated.
+Isolation means that data or features can never cross Organization boundaries.
+Many features in GitLab can link data together.
+A few examples of things that would be disallowed by Organization Isolation are:
+
+1. [Related issues](../../../user/project/issues/related_issues.md): Users would not be able to take an issue in one Project in `Organization A` and relate that issue to another issue in a Project in `Organization B`.
+1. [Share a project/group with a group](../../../user/group/manage.md#share-a-group-with-another-group): Users would not be allowed to share a Group or Project in `Organization A` with another Group or Project in `Organization B`.
+1. [System notes](../../../user/project/system_notes.md): Users would not get a system note added to an issue in `Organization A` if it is mentioned in a comment on an issue in `Organization B`.
+
+## Why?
+
+<img src="diagrams/organization-isolation-broken.drawio.png" width="800">
+
+[GitLab Cells](../cells/index.md) depend on using the Organization as the sharding key, which will allow us to shard data between different Cells.
+Initially, when we start rolling out Organizations, we will be working with a single Cell `Cell 1`.
+`Cell 1` is our current GitLab.com deployment.
+Newly created Organizations will be created on `Cell 1`.
+Once Cells are ready, we will deploy `Cell 2` and begin migrating Organizations from `Cell 1` to `Cell 2`.
+Migrating workloads off will be critical to allowing us to rebalance our data across a fleet of servers and eventually run much smaller GitLab instances (and databases).
+
+If today we allowed users to create Organizations that linked to data in other Organizations, these links would suddenly break when an Organization is moved to a different Cell (because it won't know about the other Organization).
+For this reason we need to ensure from the very beginning of rolling out Organizations to customers that it is impossible to create any links that cross the Organization boundary, even when Organizations are still on the same Cell.
+If we don't, we will create even more mixed up related data that cannot be migrated between Cells.
+Not fulfilling the requirement of isolation means we risk creating a new top-level data wrapper (Organization) that cannot actually be used as a sharding key.
+
+The Cells project initially started with the assumption that we'd be able to shard by top-level Groups.
+We quickly learned that there were no constraints in the application that isolated top-level Groups.
+Many users (including ourselves) had created multiple top-level Groups and linked data across them.
+So we decided that the only way to create a viable sharding key was to create another wrapper around top-level Groups.
+Organizations were something our customers already wanted to gain more administrative capabilities as available in self-managed, and aggregate data across multiple top-level Groups, so this became a logical choice.
+Again, this leads us to realize that we cannot allow multiple Organizations to get mixed in together the same way we had with top-level Groups, otherwise we will end up back where we started.
+
+## How?
+
+Multiple POCs have been implemented to demonstrate how we will provide robust developer facing and customer facing constraints in the GitLab application and database that enforce the described isolation constraint.
+These are:
+
+1. [Enforce Organization Isolation based on `project_id` and `namespace_id` column on every table](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/133576)
+1. [Enforce Organization Isolation based on `organization_id` on every table](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/129889)
+1. [Validate if a top-level group is isolated to be migrated to an Organization](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/131968)
+
+The major constraint these POCs were trying to overcome was that there is no standard way in the GitLab application or database to even determine what Organization (or Project or namespace) a piece of data belongs to.
+This means that the first step is to implement a standard way to efficiently find the parent Organization for any model or row in the database.
+
+The proposed solution is ensuring that every single table that exists in the `gitlab_main_cell` and `gitlab_ci_cell` (Cell-local) databases must include a valid sharding key that is either `project_id` or `namespace_id`.
+At first we considered enforcing everything to have an `organization_id`, but we determined that this would be too expensive to update for customers that need to migrate large Groups out of the default Organization.
+The added benefit is that more than half of our tables already have one of these columns.
+Additionally, if we can't consistently attribute data to a top-level Group, then we won't be able to validate if a top-level Group is safe to be moved to a new Organization.
+
+Once we have consistent sharding keys we can use them to validate all data on insert are not crossing any Organization boundaries.
+We can also use these sharding keys to help us decide whether:
+
+- Existing namespaces in the default Organization can be moved safely to a new Organization, because the namespace is already isolated.
+- The namespace owner would need to remove some links before migrating to a new Organization.
+- A set of namespaces is isolated as a group and could be moved together in bulk to a new Organization.
+
+## Detailed steps
+
+1. Implement developer facing documentation explaining the requirement to add these sharding keys and how they should choose between `project_id` and `namespace_id`.
+1. Add a way to declare a sharding key in `db/docs` and automatically populate it for all tables that already have a sharding key
+1. Implement automation in our CI pipelines and/or DB migrations that makes it impossible to create new tables without a sharding key.
+1. Implement a way for people to declare a desired sharding key in `db/docs` as
+ well as a path to the parent table from which it is migrated. Will only be
+ needed temporarily for tables that don't have a sharding key
+1. Attempt to populate as many "desired sharding key" as possible in an
+ automated way and delegate the MRs to other teams
+1. Fan out issues to other teams to manually populate the remaining "desired
+ sharding key"
+1. Start manually creating then automating the creation of migrations for
+ tables to populate sharding keys from "desired sharding key"
+1. Once all tables have sharding keys or "desired sharding key", we ship an
+ evolved version of the
+ [POC](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/133576), which
+ will enforce that newly inserted data cannot cross Organization boundaries.
+ This may need to be expanded to more than just foreign keys, and should also
+ include loose foreign keys and possibly any relationships described in
+ models. It can temporarily depend on inferring, at runtime, the sharding key
+ from the "desired sharding key" which will be a less performant option while
+ we backfill the sharding keys to all tables but allow us to unblock
+ implementing the isolation rules and user experience of isolation.
+1. Finish migration of ~300 tables that are missing a sharding key:
+ 1. The Tenant Scale team migrates the first few tables.
+ 1. We build a dashboard showing our progress and continue to create
+ automated MRs for the sharding keys that can be automatically inferred
+ and automate creating issues for all the sharding keys that can't be
+ automatically inferred
+1. Validate that all existing `project_id` and `namespace_id` columns on all Cell-local tables can reliably be assumed to be the sharding key. This requires assigning issues to teams to confirm that these columns aren't used for some other purpose that would actually not be suitable. If there is an issue with a table we need to migrate and rename these columns, and then add a new `project_id` or `namespace_id` column with the correct sharding key.
+1. We allow customers to create new Organizations without the option to migrate namespaces into them. All namespaces need to be newly created in their new Organization.
+1. Implement new functionality in GitLab similar to the [POC](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/131968), which allows a namespace owner to see if their namespace is fully isolated.
+1. Implement functionality that allows namespace owners to migrate an existing namespace from one Organization to another. Most likely this will be existing customers that want to migrate their namespace out of the default Organization into a newly created Organization. Only isolated namespaces as implemented in the previous step will be allowed to move.
+1. Expand functionality to validate if a namespace is isolated, so that users can select multiple namespaces they own and validate that the selected group of namespaces is isolated. Links between the selected namespaces would stay intact.
+1. Implement functionality that allows namespace owners to migrate multiple existing namespaces from one Organization to another. Only isolated namespaces as implemented in the previous step will be allowed to move.
+1. We build better tooling to help namespace owners with cleaning up unwanted links outside of their namespace to allow more customers to migrate to a new Organization. This step would be dependent on the amount of existing customers that actually have links to clean up.
+
+The implementation of this effort will be tracked in [#11670](https://gitlab.com/groups/gitlab-org/-/epics/11670).
+
+## Alternatives considered
+
+### Add any data that need to cross Organizations to cluster-wide tables
+
+We plan on having some data at the cluster level in our Cells architecture (for example
+Users), so it might stand to reason that we can make any data cluster-wide
+that might need to cross Organization boundaries and this would solve the problem.
+
+This could be an option for a limited set of features and may turn out to be
+necessary for some critical workflows.
+However, this should not become the default option, because it will ultimately lead to the Cells architecture not achieving the horizontal scaling goals.
+Features like [sharing a group with a group](../../../user/group/manage.md#share-a-group-with-another-group) are very tightly connected to some of the worst performing functionality in our
+application with regard to scalability.
+We are hoping that by splitting up our databases in Cells we will be able to unlock more scaling headroom and reduce the problems associated with supporting these features.
+
+### Do nothing and treat these anomalies as an acceptable edge case
+
+This idea hasn't been explored deeply but is rejected on the basis that these
+anomalies will appear as data loss while moving customer data between Cells.
+Data loss is a very serious kind of bug, especially when customers are not opting into being moved between servers.
+
+### Solve these problems feature by feature
+
+This could be done, for example, by implementing an application rule that
+prevents users from adding an issue link between Projects on different Organizations.
+We would need to find all such features by asking teams, and
+they would need to fix them all as a special case business rule.
+
+This may be a viable, less robust option, but it does not give us a lot of confidence in our system.
+Without a robust way to ensure that all Organization data is isolated, we would have to trust that each feature we implement has been manually checked.
+This creates a real risk that we miss something, and again we would end up with customer data loss.
+Another challenge here is that if we are not confident in our isolation constraints, then we may end up attributing various unrelated bugs to possible data loss.
+As such it could become a rabbit hole to debug all kinds of unrelated bugs.
diff --git a/doc/architecture/blueprints/runner_admission_controller/index.md b/doc/architecture/blueprints/runner_admission_controller/index.md
index 92c824527ec..21dc1d53303 100644
--- a/doc/architecture/blueprints/runner_admission_controller/index.md
+++ b/doc/architecture/blueprints/runner_admission_controller/index.md
@@ -1,7 +1,7 @@
---
status: proposed
creation-date: "2023-03-07"
-authors: [ "@ajwalker" ]
+authors: [ "@ajwalker", "@johnwparent" ]
coach: [ "@ayufan" ]
approvers: [ "@DarrenEastman", "@engineering-manager" ]
owning-stage: "~devops::<stage>"
@@ -14,7 +14,7 @@ The GitLab `admission controller` (inspired by the [Kubernetes admission control
An admission controller can be registered to the GitLab instance and receive a payload containing jobs to be created. Admission controllers can be _mutating_, _validating_, or both.
-- When _mutating_, mutatable job information can be modified and sent back to the GitLab instance. Jobs can be modified to conform to organizational policy, security requirements, or have, for example, their tag list modified so that they're routed to specific runners.
+- When _mutating_, mutable job information can be modified and sent back to the GitLab instance. Jobs can be modified to conform to organizational policy, security requirements, or have, for example, their tag list modified so that they're routed to specific runners.
- When _validating_, a job can be denied execution.
## Motivation
@@ -35,12 +35,12 @@ Before going further, it is helpful to level-set the current job handling mechan
- On the request from a runner to the API for a job, the database is queried to verify that the job parameters matches that of the runner. In other words, when runners poll a GitLab instance for a job to execute they're assigned a job if it matches the specified criteria.
- If the job matches the runner in question, then the GitLab instance connects the job to the runner and changes the job state to running. In other words, GitLab connects the `job` object with the `Runner` object.
- A runner can be configured to run un-tagged jobs. Tags are the primary mechanism used today to enable customers to have some control of which Runners run certain types of jobs.
-- So while runners are scoped to the instance, group, or project, there are no additional access control mechanisms today that can easily be expanded on to deny access to a runner based on a user or group identifier.
+- So while runners are scoped to the instance, group, or project, there are no additional access control mechanisms today that can be expanded on to deny access to a runner based on a user or group identifier.
-The current CI jobs queue logic is as follows. **Note - in the code ww still use the very old `build` naming construct, but we've migrated from `build` to `job` in the product and documentation.
+The current CI jobs queue logic is as follows. **Note - in the code we still use the very old `build` naming construct, but we've migrated from `build` to `job` in the product and documentation.
```ruby
-jobs =
+jobs =
if runner.instance_type?
jobs_for_shared_runner
elsif runner.group_type?
@@ -96,22 +96,31 @@ Each runner has a tag such as `zone_a`, `zone_b`. In this scenario the customer
1. When a job is created the `project information` (`project_id`, `job_id`, `api_token`) will be used to query GitLab for specific details.
1. If the `user_id` matches then the admissions controller modifies the job tag list. `zone_a` is added to the tag list as the controller has detected that the user triggering the job should have their jobs run IN `zone_a`.
+**Scenario 3**: Runner pool with specific tag scheme, user only has access to a specific subset
+
+Each runner has a tag identifier unique to that runner, e.g. `DiscoveryOne`, `tugNostromo`, `MVSeamus`, etc. Users have arbitrary access to these runners, however we don't want to fail a job on access denial, instead we want to prevent the job from being executed on runners to which the user does not have access. We also don't want to reduce the pool of runners the job can be run on.
+
+1. Configure an admissions controller to mutate jobs based on `user_id`.
+1. When a job is created the `project information` (`project_id`, `job_id`, `api_token`) will be used to query GitLab for specific details.
+1. The admission controller queries available runners with the `user_id` and collects all runners for which the job cannot be run. If this is _all_ runners, the admission controller rejects the job, which is dropped. No tags are modified, and a message is included indicating the reasoning. If there are runners for which the user has permissions, the admission controller filters the associated runners for which there are no permissions.
+
### MVC
#### Admission controller
1. A single admission controller can be registered at the instance level only.
-1. The admission controller must respond within 30 seconds.
-1. The admission controller will receive an array of individual jobs. These jobs may or may not be related to each other. The response must contain only responses to the jobs made as part of the request.
+1. The admission controller must respond within 1 hr.
+1. The admission controller will receive individual jobs. The response must contain only responses to that job.
+1. The admission controller will recieve an API callback for rejection and acceptance, with the acceptance callback accepting mutation parameters.
#### Job Lifecycle
-1. The lifecycle of a job will be updated to include a new `validating` state.
+1. The `preparing` job state will be expanded to include the validation process prerequisite.
```mermaid
stateDiagram-v2
- created --> validating
- state validating {
+ created --> preparing
+ state preparing {
[*] --> accept
[*] --> reject
}
@@ -127,10 +136,12 @@ Each runner has a tag such as `zone_a`, `zone_b`. In this scenario the customer
executed --> created: retry
```
-1. When the state is `validating`, the mutating webhook payload is sent to the admission controller.
-1. For jobs where the webhook times out (30 seconds) their status should be set as though the admission was denied. This should
+1. When the state is `preparing`, the mutating webhook payload is sent to the admission controller asynchronously. This will be retried a number of times as needed.
+1. The `preparing` state will wait for a response from the webhook or until timeout.
+1. The UI should be updated with the current status of the job prerequisites and admission
+1. For jobs where the webhook times out (1 hour) their status should be set as though the admission was denied with a timeout reasoning. This should
be rare in typical circumstances.
-1. Jobs with denied admission can be retried. Retried jobs will be resent to the admission controller along with any mutations that they received previously.
+1. Jobs with denied admission can be retried. Retried jobs will be resent to the admission controller without tag mutations or runner filtering reset.
1. [`allow_failure`](../../../ci/yaml/index.md#allow_failure) should be updated to support jobs that fail on denied admissions, for example:
```yaml
@@ -141,8 +152,8 @@ be rare in typical circumstances.
on_denied_admission: true
```
-1. The UI should be updated to display the reason for any job mutations (if provided).
-1. A table in the database should be created to store the mutations. Any changes that were made, like tags, should be persisted and attached to `ci_builds` with `acts_as_taggable :admission_tags`.
+1. The UI should be updated to display the reason for any job mutations (if provided) or rejection.
+1. Tag modifications applied by the Admission Controller should be persisted by the system with associated reasoning for any modifications, acceptances, or rejections
#### Payload
@@ -153,8 +164,10 @@ be rare in typical circumstances.
1. The response payload is comprised of individual job entries consisting of:
- Job ID.
- Admission state: `accepted` or `denied`.
- - Mutations: Only `tags` is supported for now. The tags provided replaces the original tag list.
+ - Mutations: `additions` and `removals`. `additions` supplements the existing set of tags, `removals` removes tags from the current tag list
- Reason: A controller can provide a reason for admission and mutation.
+ - Accepted Runners: runners to be considered for job matching, can be empty to match all runners
+ - Rejected Runners: runners that should not be considered for job matching, can be empty to match all runners
##### Example request
@@ -170,7 +183,9 @@ be rare in typical circumstances.
...
},
"tags": [ "docker", "windows" ]
- },
+ }
+]
+[
{
"id": 245,
"variables": {
@@ -180,7 +195,9 @@ be rare in typical circumstances.
...
},
"tags": [ "linux", "eu-west" ]
- },
+ }
+]
+[
{
"id": 666,
"variables": {
@@ -202,20 +219,29 @@ be rare in typical circumstances.
"id": 123,
"admission": "accepted",
"reason": "it's always-allow-day-wednesday"
- },
+ }
+]
+[
{
"id": 245,
"admission": "accepted",
- "mutations": {
- "tags": [ "linux", "us-west" ]
+ "tags": {
+ "add": [ "linux", "us-west" ],
+ "remove": [...]
},
- "reason": "user is US employee: retagged region"
- },
+ "runners": {
+ "accepted_ids": ["822993167"],
+ "rejected_ids": ["822993168"]
+ },
+ "reason": "user is US employee: retagged region; user only has uid on runner 822993167"
+ }
+]
+[
{
"id": 666,
"admission": "rejected",
"reason": "you have no power here"
- },
+ }
]
```
@@ -229,13 +255,32 @@ be rare in typical circumstances.
### Implementation Details
-1. _placeholder for steps required to code the admissions controller MVC_
+#### GitLab
+
+1. Expand `preparing` state to engage the validation process via the `prerequsite` interface.
+1. Amend `preparing` state to indicate to user, via the UI and API, the status of job preparation with regard to the job prerequisites
+ 1. Should indicate status of each prerequisite resource for the job separately as they are asynchronous
+ 1. Should indicate overall prerequisite status
+1. Introduce a 1 hr timeout to the entire `preparing` state
+1. Add an `AdmissionValidation` prerequisite to the `preparing` status dependencies via `Gitlab::Ci::Build::Prerequisite::Factory`
+1. Convert the Prerequisite factory and `preparing` status to operate asynchronously
+1. Convert `PreparingBuildService` to operate asynchronously
+1. `PreparingBuildService` transitions the job from preparing to failed or pending depending on success of validation.
+1. AdmissionValidation performs a reasonable amount of retries when sending request
+1. Add API endpoint for Webhook/Admission Controller response callback
+ 1. Accepts Parameters:
+ - Acceptance/Rejection
+ - Reason String
+ - Tag mutations (if accepted, otherwise ignored)
+ 1. Callback encodes one time auth token
+1. Introduce new failure reasoning on validation rejection
+1. Admission controller impacts on job should be persisted
+1. Runner selection filtering per job as a function of the response from the Admission controller (mutating web hook) should be added
## Technical issues to resolve
| issue | resolution|
| ------ | ------ |
-|We may have conflicting tag-sets as mutating controller can make it possible to define AND, OR and NONE logical definition of tags. This can get quite complex quickly. | |
|Rule definition for the queue web hook|
|What data to send to the admissions controller? Is it a subset or all of the [predefined variables](../../../ci/variables/predefined_variables.md)?|
|Is the `queueing web hook` able to run at GitLab.com scale? On GitLab.com we would trigger millions of webhooks per second and the concern is that would overload Sidekiq or be used to abuse the system.
diff --git a/doc/architecture/blueprints/secret_detection/index.md b/doc/architecture/blueprints/secret_detection/index.md
index fc97ca71d7f..76bf6dd4088 100644
--- a/doc/architecture/blueprints/secret_detection/index.md
+++ b/doc/architecture/blueprints/secret_detection/index.md
@@ -26,28 +26,22 @@ job logs, and project management features such as issues, epics, and MRs.
### Goals
-- Support asynchronous secret detection for the following scan targets:
- - push events
- - issuable creation
- - issuable updates
- - issuable comments
+- Support platform-wide detection of tokens to avoid secret leaks
+- Prevent exposure by rejecting detected secrets
+- Provide scalable means of detection without harming end user experience
-### Non-Goals
+See [target types](#target-types) for scan target priorities.
-The current proposal is limited to asynchronous detection and alerting only.
+### Non-Goals
-**Blocking** secrets on push events is high-risk to a critical path and
-would require extensive performance profiling before implementing. See
-[a recent example](https://gitlab.com/gitlab-org/gitlab/-/issues/246819#note_1164411983)
-of a customer incident where this was attempted.
+Initial proposal is limited to detection and alerting across platform, with rejection only
+during [preceive Git interactions and browser-based detection](#iterations).
Secret revocation and rotation is also beyond the scope of this new capability.
Scanned object types beyond the scope of this MVC include:
-- Media types (JPEGs, PDFs,...)
-- Snippets
-- Wikis
+See [target types](#target-types) for scan target priorities.
#### Management UI
@@ -69,7 +63,13 @@ which remain focused on active detection.
## Proposal
-To achieve scalable secret detection for a variety of domain objects a dedicated
+The first iteration of the experimental capability will feature a blocking
+pre-receive hook implemented within the Rails application. This iteration
+will be released in an experimental state to select users and provide
+opportunity for the team to profile the capability before considering extraction
+into a dedicated service.
+
+In the future state, to achieve scalable secret detection for a variety of domain objects a dedicated
scanning service must be created and deployed alongside the GitLab distribution.
This is referred to as the `SecretScanningService`.
@@ -94,10 +94,10 @@ as self-managed instances.
The critical paths as outlined under [goals above](#goals) cover two major object
types: Git blobs (corresponding to push events) and arbitrary text blobs.
-The detection flow for push events relies on subscribing to the PostReceive hook
-to enqueue Sidekiq requests to the `SecretScanningService`. The `SecretScanningService`
-service fetches enqueued refs, queries Gitaly for the ref blob contents, scans
-the commit contents, and notifies the Rails application when a secret is detected.
+The detection flow for push events relies on subscribing to the PreReceive hook
+to scan commit data using the [PushCheck interface](https://gitlab.com/gitlab-org/gitlab/blob/3f1653f5706cd0e7bbd60ed7155010c0a32c681d/lib/gitlab/checks/push_check.rb). This `SecretScanningService`
+service fetches the specified blob contents from Gitaly, scans
+the commit contents, and rejects the push when a secret is detected.
See [Push event detection flow](#push-event-detection-flow) for sequence.
The detection flow for arbitrary text blobs, such as issue comments, relies on
@@ -112,13 +112,33 @@ storage. See discussion [in this issue](https://gitlab.com/groups/gitlab-org/-/e
around scanning during streaming and the added complexity in buffering lookbacks
for arbitrary trace chunks.
-In any case of detection, the Rails application manually creates a vulnerability
+In the case of a push detection, the commit is rejected and error returned to the end user.
+In any other case of detection, the Rails application manually creates a vulnerability
using the `Vulnerabilities::ManuallyCreateService` to surface the finding in the
existing Vulnerability Management UI.
See [technical discovery](https://gitlab.com/gitlab-org/gitlab/-/issues/376716)
for further background exploration.
+### Target types
+
+Target object types refer to the scanning targets prioritized for detection of leaked secrets.
+
+In order of priority this includes:
+
+1. non-binary Git blobs
+1. job logs
+1. issuable creation (issues, MRs, epics)
+1. issuable updates (issues, MRs, epics)
+1. issuable comments (issues, MRs, epics)
+
+Targets out of scope for the initial phases include:
+
+- Media types (JPEG, PDF, ...)
+- Snippets
+- Wikis
+- Container images
+
### Token types
The existing Secret Detection configuration covers ~100 rules across a variety
@@ -135,16 +155,17 @@ Token types to identify in order of importance:
### Detection engine
-Our current secret detection offering utilizes [Gitleaks](https://github.com/zricethezav/gitleaks/)
+Our current secret detection offering uses [Gitleaks](https://github.com/zricethezav/gitleaks/)
for all secret scanning in pipeline contexts. By using its `--no-git` configuration
we can scan arbitrary text blobs outside of a repository context and continue to
-utilize it for non-pipeline scanning.
+use it for non-pipeline scanning.
-Given our existing familiarity with the tool and its extensibility, it should
-remain our engine of choice. Changes to the detection engine are out of scope
-unless benchmarking unveils performance concerns.
+In the case of PreReceive detection, we rely on a combination of keyword/substring matches
+for pre-filtering and `re2` for regex detections. See [spike issue](https://gitlab.com/gitlab-org/gitlab/-/issues/423832) for initial benchmarks
-Notable alternatives include high-performance regex engines such as [hyperscan](https://github.com/intel/hyperscan) or it's portable fork [vectorscan](https://github.com/VectorCamp/vectorscan).
+Changes to the detection engine are out of scope until benchmarking unveils performance concerns.
+
+Notable alternatives include high-performance regex engines such as [Hyperscan](https://github.com/intel/hyperscan) or it's portable fork [Vectorscan](https://github.com/VectorCamp/vectorscan).
### High-level architecture
@@ -167,37 +188,42 @@ for past discussion around scaling approaches.
sequenceDiagram
autonumber
actor User
- User->>+Workhorse: git push
+ User->>+Workhorse: git push with-secret
+ Workhorse->>+Gitaly: tcp
+ Gitaly->>+Rails: PreReceive
+ Rails->>-Gitaly: ListAllBlobs
+ Gitaly->>-Rails: ListAllBlobsResponse
+
+ Rails->>+GitLabSecretDetection: Scan(blob)
+ GitLabSecretDetection->>-Rails: found
+
+ Rails->>User: rejected: secret found
+
+ User->>+Workhorse: git push without-secret
Workhorse->>+Gitaly: tcp
- Gitaly->>+Rails: grpc
- Sidekiq->>+Rails: poll job
- Rails->>-Sidekiq: PostReceive worker
- Sidekiq-->>+Sidekiq: enqueue PostReceiveSecretScanWorker
-
- Sidekiq->>+Rails: poll job
- loop PostReceiveSecretScanWorker
- Rails->>-Sidekiq: PostReceiveSecretScanWorker
- Sidekiq->>+SecretScanningSvc: ScanBlob(ref)
- SecretScanningSvc->>+Sidekiq: accepted
- Note right of SecretScanningSvc: Scanning job enqueued
- Sidekiq-->>+Rails: done
- SecretScanningSvc->>+Gitaly: retrieve blob
- SecretScanningSvc->>+SecretScanningSvc: scan blob
- SecretScanningSvc->>+Rails: secret found
- end
+ Gitaly->>+Rails: PreReceive
+ Rails->>-Gitaly: ListAllBlobs
+ Gitaly->>-Rails: ListAllBlobsResponse
+
+ Rails->>+GitLabSecretDetection: Scan(blob)
+ GitLabSecretDetection->>-Rails: not_found
+
+ Rails->>User: OK
```
## Iterations
- ✓ Define [requirements for detection coverage and actions](https://gitlab.com/gitlab-org/gitlab/-/issues/376716)
-- ✓ Implement [Clientside detection of GitLab tokens in comments/issues](https://gitlab.com/gitlab-org/gitlab/-/issues/368434)
-- PoC of secret scanning service
- - Benchmarking of issuables, comments, job logs and blobs to gain confidence that the total costs will be viable
- - Capacity planning for addition of service component to Reference Architectures headroom
- - Service capabilities
+- ✓ Implement [Browser-based detection of GitLab tokens in comments/issues](https://gitlab.com/gitlab-org/gitlab/-/issues/368434)
+- ✓ [PoC of secret scanning service](https://gitlab.com/gitlab-org/secure/pocs/secret-detection-go-poc/)
+- ✓ [PoC of secret scanning gem](https://gitlab.com/gitlab-org/gitlab/-/issues/426823)
+- [Pre-Production Performance Profiling for pre-receive PoCs](https://gitlab.com/gitlab-org/gitlab/-/issues/428499)
+ - Profiling service capabilities
+ - ✓ [Benchmarking regex performance between Ruby and Go approaches](https://gitlab.com/gitlab-org/gitlab/-/issues/423832)
- gRPC commit retrieval from Gitaly
- - blob scanning
+ - transfer latency, CPU, and memory footprint
- Implementation of secret scanning service MVC (targeting individual commits)
+- Capacity planning for addition of service component to Reference Architectures headroom
- Security and readiness review
- Deployment and monitoring
- Implementation of secret scanning service MVC (targeting arbitrary text blobs)
diff --git a/doc/architecture/blueprints/secret_manager/decisions/002_gcp_kms.md b/doc/architecture/blueprints/secret_manager/decisions/002_gcp_kms.md
new file mode 100644
index 00000000000..c750164632f
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/decisions/002_gcp_kms.md
@@ -0,0 +1,101 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab Secrets Manager ADR 002: Use GCP Key Management Service'
+---
+
+# GitLab Secrets Manager ADR 002: Use GCP Key Management Service
+
+## Context
+
+Following from [ADR 001: Use envelope encryption](001_envelop_encryption.md), we need to find a solution to securely
+store asymmetric keys belonging to each vault.
+
+## Decision
+
+We decided to rely on Google CLoud Platform (GCP) Key Management Service (KMS) to manage the asymmetric keys
+used by the GitLab Secrets Manager vaults.
+
+Using GCP provides a few advantages:
+
+1. Avoid implementing our own secure storage of cryptographic keys.
+1. Support for Hardware Security Modules (HSM).
+
+```mermaid
+sequenceDiagram
+ participant A as Client
+ participant B as GitLab Rails
+ participant C as GitLab Secrets Service
+ participant D as GCP Key Management Service
+
+ Note over B,D: Initialize vault for project/group/organization
+
+ B->>C: Initialize vault - create key pair
+
+ Note over D: Incurs cost per key
+ C->>D: Create new asymmetric key
+ D->>C: Returns public key
+ C->>B: Returns vault public key
+ B->>B: Stores vault public key
+
+ Note over A,C: Creating a new secret
+
+ A->>B: Create new secret
+ B->>B: Generate new symmetric data key
+ B->>B: Encrypts secret with data key
+ B->>B: Encrypts data key with vault public key
+ B->>B: Stores envelope (encrypted secret + encrypted data key)
+ B-->>B: Discards plain-text data key
+ B->>A: Success
+
+ Note over A,D: Retrieving a secret
+
+ A->>B: Get secret
+ B->>B: Retrieves envelope (encrypted secret + encrypted data key)
+ B->>C: Decrypt data key
+ Note over D: Incurs cost per decryption request
+ C->>D: Decrypt data key
+ D->>C: Returns plain-text data key
+ C->>B: Returns plain-text data key
+ B->>B: Decrypts secret
+ B-->>B: Discards plain-text data key
+ B->>A: Returns secret
+```
+
+For security purpose, we decided to use Hardware Security Module (HSM) to protect the keys in GCP KMS.
+
+## Consequences
+
+### Authentication
+
+With keys stored in GCP KMS, we need to de-multiplex between identities configured in GCP KMS and
+identities defined in GitLab so that decryption requests can be authenticated accordingly.
+
+### Cost
+
+With the use of GCP KMS, we need to account for the following cost:
+
+1. Number of keys required
+1. Number of key operations
+1. HSM Protection level
+
+The number of keys required would be dependent on the number of projects, groups, and organizations using this feature.
+A single asymmetric key is required for each project, group or organization.
+
+Each cryptographic key operation would also incur cost and it varies per protection level.
+Based on the proposed design above, this would incur cost at each secret decryption request.
+
+We may implement a multi-tier protection level, supporting different protection types for different users.
+
+The pricing table of GCP KMS can be found [here](https://cloud.google.com/kms/pricing).
+
+### Feature availability for Self-Managed customers
+
+Using GCP KMS as a backend means that this solution cannot be deployed into self-managed environments.
+To make this feature available to Self-Managed customers, this feature needs to be a GitLab Cloud Connector feature.
+
+## Alternatives
+
+We considered generating and storing private keys within GitLab Secrets Service,
+but this would not meet the requirements for [FIPS Compliance](../../../../development/fips_compliance.md).
+
+On the other hand, GCP HSM Keys comply with [FIPS 140-2 Level 3](https://cloud.google.com/docs/security/key-management-deep-dive#fips_140-2_validation).
diff --git a/doc/architecture/blueprints/secret_manager/decisions/003_go_service.md b/doc/architecture/blueprints/secret_manager/decisions/003_go_service.md
new file mode 100644
index 00000000000..561a1bde24e
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/decisions/003_go_service.md
@@ -0,0 +1,37 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab Secrets Manager ADR 003: Implement Secrets Manager in Go'
+---
+
+# GitLab Secrets Manager ADR 003: Implement Secrets Manager in Go
+
+Following [ADR-002](002_gcp_kms.md) highlighting the need to integrate with GCP
+services, we do need to decide what tech stack is going to be used to build
+GitLab Secrets Manager Service (GSMS).
+
+## Context
+
+At GitLab, we usually build satellite services around GitLab Rails in Go.
+This is especially a good choice of technology for services that may heavily
+leverage concurrency and caching, where cache could be invalidated / refreshed
+asynchronously.
+
+Go-based [GCP KMS](https://cloud.google.com/kms/docs/reference/libraries#client-libraries-usage-go)
+client library also seems to expose a reliable interface to access KMS.
+
+## Decision
+
+Implement GitLab Secrets Manager Service in Go. Use
+[labkit](https://gitlab.com/gitlab-org/labkit) as a minimalist library to
+provide common functionality shared by satellite servicies.
+
+## Consequences
+
+The team that is going to own GitLab Secrets Manager feature will need to gain
+more Go expertise.
+
+## Alternatives
+
+We considered implementing GitLab Secrets Manager Service in Ruby, but we
+concluded that using Ruby will not allow us to build a service that will be
+efficient enough.
diff --git a/doc/architecture/blueprints/secret_manager/decisions/004_staleless_kms.md b/doc/architecture/blueprints/secret_manager/decisions/004_staleless_kms.md
new file mode 100644
index 00000000000..3de8adfd3a7
--- /dev/null
+++ b/doc/architecture/blueprints/secret_manager/decisions/004_staleless_kms.md
@@ -0,0 +1,49 @@
+---
+owning-stage: "~devops::verify"
+description: 'GitLab Secrets Manager ADR 004: Sateless Key Management Service'
+---
+
+# GitLab Secrets Manager ADR 004: Stateless Key Management Service
+
+In [ADR-002](002_gcp_kms.md) we decided that we want to use Google's Cloud Key
+Management Service to store private encryption keys. This will allow us to meet
+various compliance requirements easier.
+
+In this ADR we are going to describe the desired architecture of GitLab Secrets
+Management Service, making it a stateless service, that is not connected to a
+persistent datastore, other than an ephemeral local storage.
+
+## Context
+
+## Decision
+
+Make GitLab Secrets Management Service a stateless application, not being
+connected to a global data storage, like a relational or NoSQL database.
+
+We are only going to support local block storage, presumably only for caching
+purposes.
+
+In order to manage decryption cost wisely, we would need to implement
+multi-tier protection layers, and in-memory, per-instance,
+[symmetric decryption key](001_envelop_encryption.md) caching, with cache TTL
+depending on the protection tier. A hardware or software key can be used in
+Google's Cloud KMS, depending on the tier too.
+
+## Consequences
+
+1. All private keys are going to be stored in Google's Cloud KMS.
+1. Multi-tier protection will be implemented, with higher tries offering more protection.
+1. Protection tier will be defined on per-organization level on the GitLab Rails Service side.
+1. Depending on the protection level used, symmetric decryption keys can be in-memory cached.
+1. The symmetric key's cache must not be valid for more than 24 hours..
+1. The highest protection tier will use Hardware Security Module and no caching.
+1. The GitLab Secrets Management Service will not store access-control metadata.
+1. Identity de-multiplexing will happen on GitLab Rails Service side.
+1. Decryption request will be signed by an organization's public key.
+1. The service will verify decryption requestor's identity by checking the signature.
+
+## Alternatives
+
+We considered using a relational database, or a NoSQL database, both
+self-managed and managed by a Cloud Provider, but concluded that this would add
+a lot of complexity and would weaken the security posture of the service.
diff --git a/doc/architecture/blueprints/secret_manager/index.md b/doc/architecture/blueprints/secret_manager/index.md
index 2a840f8d846..ac30f3399d8 100644
--- a/doc/architecture/blueprints/secret_manager/index.md
+++ b/doc/architecture/blueprints/secret_manager/index.md
@@ -59,12 +59,18 @@ This blueprint does not cover the following:
- Secrets such as access tokens created within GitLab to allow external resources to access GitLab, e.g personal access tokens.
+## Decisions
+
+- [ADR-001: Use envelope encryption](decisions/001_envelop_encryption.md)
+- [ADR-002: Use GCP Key Management Service](decisions/002_gcp_kms.md)
+- [ADR-003: Build Secrets Manager in Go](decisions/003_go_service.md)
+
## Proposal
The secrets manager feature will consist of three core components:
1. GitLab Rails
-1. GitLab Secrets Service
+1. GitLab Secrets Manager Service
1. GCP Key Management
At a high level, secrets will be stored using unique encryption keys in order to achieve isolation
@@ -86,13 +92,15 @@ The plain-text secret would be encrypted using a single use data key.
The data key is then encrypted using the public key belonging to the group or project.
Both, the encrypted secret and the encrypted data key, are being stored in the database.
-**2. GitLab Secrets Manager**
+**2. GitLab Secrets Manager Service**
-GitLab Secrets Manager will be a new component in the GitLab overall architecture. This component serves the following purpose:
+GitLab Secrets Manager Service will be a new component in the GitLab overall architecture. This component serves the following purpose:
1. Correlating GitLab identities into GCP identities for access control.
1. A proxy over GCP Key Management for decrypting operations.
+[The service will use Go-based tech stack](decisions/003_go_service.md) and [labkit](https://gitlab.com/gitlab-org/labkit).
+
**3. GCP Key Management**
We choose to leverage GCP Key Management to build on the security and trust that GCP provides on cryptographic operations.
@@ -120,10 +128,6 @@ Hence, GCP Key Management is the natural choice for a cloud-based key management
To extend this service to self-managed GitLab instances, we would consider using GitLab Cloud Connector as a proxy between
self-managed GitLab instances and the GitLab Secrets Manager.
-## Decision Records
-
-- [001: Use envelope encryption](decisions/001_envelop_encryption.md)
-
## Alternative Solutions
Other solutions we have explored:
diff --git a/doc/architecture/blueprints/work_items/index.md b/doc/architecture/blueprints/work_items/index.md
index e12bb4d8773..74690d34088 100644
--- a/doc/architecture/blueprints/work_items/index.md
+++ b/doc/architecture/blueprints/work_items/index.md
@@ -64,7 +64,7 @@ You can also refer to fields of [Work Item](../../../api/graphql/reference/index
All Work Item types share the same pool of predefined widgets and are customized by which widgets are active on a specific type. The list of widgets for any certain Work Item type is currently predefined and is not customizable. However, in the future we plan to allow users to create new Work Item types and define a set of widgets for them.
-### Work Item widget types (updating)
+### Widget types (updating)
| Widget | Description | Feature flag | Write permission | GraphQL Subscription Support |
|---|---|---|---|---|
@@ -86,6 +86,36 @@ All Work Item types share the same pool of predefined widgets and are customized
| [WorkItemWidgetTestReports](../../../api/graphql/reference/index.md#workitemwidgettestreports) | Test reports associated with a work item | | | |
| [WorkItemWidgetWeight](../../../api/graphql/reference/index.md#workitemwidgetweight) | Set weight of a work item | |`Reporter`|No|
+#### Widget availability (updating)
+
+| Widget | Epic | Issue | Task | Objective | Key Result |
+|---|---|---|---|---|---|
+| [WorkItemWidgetAssignees](../../../api/graphql/reference/index.md#workitemwidgetassignees) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetAwardEmoji](../../../api/graphql/reference/index.md#workitemwidgetawardemoji) | ✅ | ✔️ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetCurrentUserTodos](../../../api/graphql/reference/index.md#workitemwidgetcurrentusertodos) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetDescription](../../../api/graphql/reference/index.md#workitemwidgetdescription) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetHealthStatus](../../../api/graphql/reference/index.md#workitemwidgethealthstatus) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetHierarchy](../../../api/graphql/reference/index.md#workitemwidgethierarchy) | ✔ | ✔️ | ❌ | ✅ | ❌ |
+| [WorkItemWidgetIteration](../../../api/graphql/reference/index.md#workitemwidgetiteration) | ❌ | ✅ | ✅ | ❌ | ❌ |
+| [WorkItemWidgetLabels](../../../api/graphql/reference/index.md#workitemwidgetlabels) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetLinkedItems](../../../api/graphql/reference/index.md#workitemwidgetlinkeditems) | ✔️ | ✔️ | ✔️ | ✅ | ✅ |
+| [WorkItemWidgetMilestone](../../../api/graphql/reference/index.md#workitemwidgetmilestone) | 🔍 | ✅ | ✅ | ✅ | ❌ |
+| [WorkItemWidgetNotes](../../../api/graphql/reference/index.md#workitemwidgetnotes) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetNotifications](../../../api/graphql/reference/index.md#workitemwidgetnotifications) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [WorkItemWidgetProgress](../../../api/graphql/reference/index.md#workitemwidgetprogress) | ❌ | ❌ | ❌ | ✅ | ✅ |
+| [WorkItemWidgetStartAndDueDate](../../../api/graphql/reference/index.md#workitemwidgetstartandduedate) | 🔍 | ✅ | ✅ | ❌ | ✅ |
+| [WorkItemWidgetStatus](../../../api/graphql/reference/index.md#workitemwidgetstatus) | ❓ | ❓ | ❓ | ❓ | ❓ |
+| [WorkItemWidgetTestReports](../../../api/graphql/reference/index.md#workitemwidgettestreports) | ❌ | ❌ | ❌ | ❌ | ❌ |
+| [WorkItemWidgetWeight](../../../api/graphql/reference/index.md#workitemwidgetweight) | 🔍 | ✅ | ✅ | ❌ | ❌ |
+
+##### Legend
+
+- ✅ - Widget available
+- ✔️ - Widget planned to be available
+- ❌ - Widget not available
+- ❓ - Widget pending for consideration
+- 🔍 - Alternative widget planned
+
### Work item relationships
Work items can be related to other work items in a number of different ways: