Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/architecture/blueprints')
-rw-r--r--doc/architecture/blueprints/_template.md8
-rw-r--r--doc/architecture/blueprints/ci_data_decay/index.md20
-rw-r--r--doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md94
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md17
-rw-r--r--doc/architecture/blueprints/ci_scale/index.md149
-rw-r--r--doc/architecture/blueprints/cloud_native_build_logs/index.md2
-rw-r--r--doc/architecture/blueprints/cloud_native_gitlab_pages/index.md4
-rw-r--r--doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md8
-rw-r--r--doc/architecture/blueprints/consolidating_groups_and_projects/index.md2
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database/index.md2
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/index.md2
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/read_mostly.md2
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/time_decay.md6
-rw-r--r--doc/architecture/blueprints/database_testing/index.md4
-rw-r--r--doc/architecture/blueprints/feature_flags_development/index.md4
-rw-r--r--doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md2
-rw-r--r--doc/architecture/blueprints/image_resizing/index.md2
-rw-r--r--doc/architecture/blueprints/pods/index.md170
-rw-r--r--doc/architecture/blueprints/pods/iteration0-organizations-introduction.pngbin0 -> 326285 bytes
-rw-r--r--doc/architecture/blueprints/pods/term-cluster.pngbin0 -> 271291 bytes
-rw-r--r--doc/architecture/blueprints/pods/term-organization.pngbin0 -> 22575 bytes
-rw-r--r--doc/architecture/blueprints/pods/term-pod.pngbin0 -> 16104 bytes
-rw-r--r--doc/architecture/blueprints/pods/term-top-level-namespace.pngbin0 -> 11451 bytes
-rw-r--r--doc/architecture/blueprints/rate_limiting/index.md6
-rw-r--r--doc/architecture/blueprints/runner_scaling/index.md280
-rw-r--r--doc/architecture/blueprints/work_items/index.md130
26 files changed, 713 insertions, 201 deletions
diff --git a/doc/architecture/blueprints/_template.md b/doc/architecture/blueprints/_template.md
index e99ce61970a..7637c3bf5fa 100644
--- a/doc/architecture/blueprints/_template.md
+++ b/doc/architecture/blueprints/_template.md
@@ -1,4 +1,7 @@
---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
status: proposed
creation-date: yyyy-mm-dd
authors: [ "@username" ]
@@ -52,7 +55,10 @@ good title can help communicate what the blueprint is and should be considered
as part of any review.
-->
-[[_TOC_]]
+<!--
+For long pages, consider creating a table of contents.
+The `[_TOC_]` function is not supported on docs.gitlab.com.
+-->
## Summary
diff --git a/doc/architecture/blueprints/ci_data_decay/index.md b/doc/architecture/blueprints/ci_data_decay/index.md
index 23c8e9df1bb..221c2364f79 100644
--- a/doc/architecture/blueprints/ci_data_decay/index.md
+++ b/doc/architecture/blueprints/ci_data_decay/index.md
@@ -102,9 +102,9 @@ Epic: [Reduce the rate of builds metadata table growth](https://gitlab.com/group
### Partition CI/CD pipelines database tables
-After we move CI/CD metadata to a different store, or reduce the rate of
+Even if we move CI/CD metadata to a different store, or reduce the rate of
metadata growth in a different way, the problem of having billions of rows
-describing pipelines, builds and artifacts, remains. We still need to keep
+describing pipelines, builds and artifacts, remains. We still may need to keep
reference to the metadata we might store in object storage and we still do need
to be able to retrieve this information reliably in bulk (or search through
it).
@@ -123,12 +123,12 @@ multiple smaller ones, using PostgreSQL partitioning features.
There are a few approaches we can take to partition CI/CD data. A promising one
is using list-based partitioning where a partition number is assigned a
pipeline, and gets propagated to all resources that are related to this
-pipeline. We assign the partition number based on when the pipeline was created
-or when we observed the last processing activity in it. This is very flexible
-because we can extend this partitioning strategy at will; for example with this
-strategy we can assign an arbitrary partition number based on multiple
-partitioning keys, combining time-decay-based partitioning with tenant-based
-partitioning on the application level.
+pipeline. We will assign a partition number using a
+[uniform logical partition ID](pipeline_partitioning.md#why-do-we-want-to-use-explicit-logical-partition-ids)
+This is very flexible because we can extend this partitioning strategy at will;
+for example with this strategy we can assign an arbitrary partition number
+based on multiple partitioning keys, combining time-decay-based partitioning
+with tenant-based partitioning on the application level if desired.
Partitioning rarely accessed data should also follow the policy defined for
builds archival, to make it consistent and reliable.
@@ -177,7 +177,7 @@ everyone to understand the vision described in this architectural blueprint.
### Removing pipeline data
-While it might be tempting to simply remove old or archived data from our
+While it might be tempting to remove old or archived data from our
databases this should be avoided. It is usually not desired to permanently
remove user data unless consent is given to do so. We can, however, move data
to a different data store, like object storage.
@@ -245,6 +245,7 @@ In progress.
- 2022-04-15: Partitioned pipeline data associations PoC [shipped](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/84071).
- 2022-04-30: Additional [benchmarking started](https://gitlab.com/gitlab-org/gitlab/-/issues/361019) to evaluate impact.
- 2022-06-31: [Pipeline partitioning design](pipeline_partitioning.md) document [merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/87683) merged.
+- 2022-09-01: Engineering effort started to implement partitioning.
## Who
@@ -273,6 +274,7 @@ Domain experts:
|------------------------------|------------------------|
| Verify / Pipeline execution | Fabio Pitino |
| Verify / Pipeline execution | Marius Bobin |
+| Verify / Pipeline insights | Maxime Orefice |
| PostgreSQL Database | Andreas Brandl |
<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
index baec14e3f0f..5f907ecdaa4 100644
--- a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
+++ b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
@@ -60,7 +60,7 @@ out of a database to a different place when data is no longer relevant or
needed. Our dataset is extremely large (tens of terabytes), so moving such a
high volume of data is challenging. When time-decay is implemented using
partitioning, we can archive the entire partition (or set of partitions) by
-simply updating a single record in one of our database tables. It is one of the
+updating a single record in one of our database tables. It is one of the
least expensive ways to implement time-decay patterns at a database level.
![decomposition_partitioning_comparison.png](decomposition_partitioning_comparison.png)
@@ -87,6 +87,7 @@ incidents, over the last couple of months, for example:
- S2: 2022-04-12 [Transactions detected that have been running for more than 10m](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6821)
- S2: 2022-04-06 [Database contention plausibly caused by excessive `ci_builds` reads](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6773)
- S2: 2022-03-18 [Unable to remove a foreign key on `ci_builds`](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6642)
+- S2: 2022-10-10 [The queuing_queries_duration SLI apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7852#note_1130123525)
We have approximately 50 `ci_*` prefixed database tables, and some of them
would benefit from partitioning.
@@ -259,7 +260,7 @@ smart enough to move rows between partitions on its own.
A partitioned table is called a **routing** table and it will use the `p_`
prefix which should help us with building automated tooling for query analysis.
-A table partition will be simply called **partition** and it can use the a
+A table partition will be called **partition** and it can use the a
physical partition ID as suffix, leaded by a `p` letter, for example
`ci_builds_p101`. Existing CI tables will become **zero partitions** of the
new routing tables. Depending on the chosen
@@ -278,6 +279,20 @@ also find information about which logical partitions are "active" or
"archived", which will help us to implement a time-decay pattern using database
declarative partitioning.
+Doing that will also allow us to use a Unified Resource Identifier for
+partitioned resources, that will contain a pointer to a pipeline ID, we could
+then use to efficiently lookup a partition the resource is stored in. It might
+be important when a resources can be directly referenced by an URL, in UI or
+API. We could use an ID like `1e240-5ba0` for pipeline `123456`, build `23456`.
+Using a dash `-` can prevent an identifier from being highlighted and copied
+with a mouse double-click. If we want to avoid this problem, we can use any
+character of written representation that is not present in base-16 numeral
+system - any letter from `g` to `z` in Latin alphabet, for example `x`. In that
+case an example of an URI would look like `1e240x5ba0`. If we decide to update
+the primary identifier of a partitioned resource (today it is just a big
+integer) it is important to design a system that is resilient to migrating data
+between partitions, to avoid changing idenfiers when rebalancing happens.
+
`ci_partitions` table will store information about a partition identifier,
pipeline ids range it is valid for and whether the partitions have been
archived or not. Additional columns with timestamps may be helpful too.
@@ -304,7 +319,7 @@ of storing archived data in PostgreSQL will be reduced significantly this way.
There are some technical details here that are out of the scope of this
description, but by using this strategy we can "archive" data, and make it much
-less expensive to reside in our PostgreSQL cluster by simply toggling a boolean
+less expensive to reside in our PostgreSQL cluster by toggling a boolean
column value.
## Accessing partitioned data
@@ -317,7 +332,7 @@ with its `partition_id`, and we will be able to find the partition that the
pipeline data is stored in.
We will need to constrain access to searching through pipelines, builds,
-artifacts etc. Search can not be done through all partitions, as it would not
+artifacts etc. Search cannot be done through all partitions, as it would not
be efficient enough, hence we will need to find a better way of searching
through archived pipelines data. It will be necessary to have different access
patterns to access archived data in the UI and API.
@@ -343,7 +358,7 @@ has_many :builds, -> (pipeline) { where(partition_id: pipeline.partition_id) }
```
The problem with this approach is that it makes preloading much more difficult
-as instance dependent associations can not be used with preloads:
+as instance dependent associations cannot be used with preloads:
```plaintext
ArgumentError: The association scope 'builds' is instance dependent (the
@@ -351,6 +366,33 @@ scope block takes an argument). Preloading instance dependent scopes is not
supported.
```
+### Primary key
+
+Primary key must include the partitioning key column to partition the table.
+
+We first create a unique index including the `(id, partition_id)`.
+Then, we drop the primary key constraint and use the new index created to set
+the new primary key constraint.
+
+`ActiveRecord` [does not support](https://github.com/rails/rails/blob/6-1-stable/activerecord/lib/active_record/attribute_methods/primary_key.rb#L126)
+composite primary keys, so we must force it to treat the `id` column as a primary key:
+
+```ruby
+class Model < ApplicationRecord
+ self.primary_key = 'id'
+end
+```
+
+The application layer is now ignorant of the database structure and all of the
+existing queries from `ActiveRecord` continue to use the `id` column to access
+the data. There is some risk to this approach because it is possible to
+construct application code that results in duplicate models with the same `id`
+value, but on a different `partition_id`. To mitigate this risk we must ensure
+that all inserts use the database sequence to populate the `id` since they are
+[guaranteed](https://www.postgresql.org/docs/12/sql-createsequence.html#id-1.9.3.81.7)
+to allocate distinct values and rewrite the access patterns to include the
+`partition_id` value. Manually assigning the ids during inserts must be avoided.
+
### Foreign keys
Foreign keys must reference columns that either are a primary key or form a
@@ -403,7 +445,7 @@ partition, `auto_canceled_by_partition_id`, and the FK becomes:
```sql
ALTER TABLE ONLY p_ci_pipelines
- ADD CONSTRAINT fk_cancel_redundant_pieplines
+ ADD CONSTRAINT fk_cancel_redundant_pipelines
FOREIGN KEY (auto_canceled_by_id, auto_canceled_by_partition_id)
REFERENCES p_ci_pipelines(id, partition_id) ON DELETE SET NULL;
```
@@ -610,6 +652,40 @@ application-wide outage.
1. Make it possible to create partitions in an automatic way.
1. Deliver the new architecture to self-managed instances.
+The diagram below visualizes this plan on Gantt chart. Please note that dates
+on the chart below are just estimates to visualize the plan better, these are
+not deadlines and can change at any time.
+
+```mermaid
+gantt
+ title CI Data Partitioning Timeline
+ dateFormat YYYY-MM-DD
+ axisFormat %m-%y
+
+ section Phase 0
+ Build data partitioning strategy :done, 0_1, 2022-06-01, 90d
+ section Phase 1
+ Partition biggest CI tables :1_1, after 0_1, 140d
+ Biggest table partitioned :milestone, metadata, 2022-12-01, 1min
+ Tables larger than 100GB partitioned :milestone, 100gb, after 1_1, 1min
+ section Phase 2
+ Add paritioning keys to SQL queries :2_1, after 1_1, 120d
+ Emergency partition detachment possible :milestone, detachment, 2023-04-01, 1min
+ All SQL queries are routed to partitions :milestone, routing, after 2_1, 1min
+ section Phase 3
+ Build new data access patterns :3_1, 2023-03-01, 120d
+ New API endpoint created for inactive data :milestone, api1, 2023-05-01, 1min
+ Filtering added to existing API endpoints :milestone, api2, 2023-07-01, 1min
+ section Phase 4
+ Introduce time-decay mechanisms :4_1, 2023-06-01, 120d
+ Inactive partitions are not being read :milestone, part1, 2023-08-01, 1min
+ Performance of the database cluster improves :milestone, part2, 2023-09-01, 1min
+ section Phase 5
+ Introduce auto-partitioning mechanisms :5_1, 2023-07-01, 120d
+ New partitions are being created automatically :milestone, part3, 2023-10-01, 1min
+ Partitioning is made available on self-managed :milestone, part4, 2023-11-01, 1min
+```
+
## Conclusions
We want to build a solid strategy for partitioning CI/CD data. We are aware of
@@ -637,8 +713,8 @@ Authors:
Recommenders:
-| Role | Who |
-|------------------------|-----------------|
-| Distingiushed Engineer | Kamil Trzciński |
+| Role | Who |
+|-------------------------------|-----------------|
+| Senior Distingiushed Engineer | Kamil Trzciński |
<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 94ec3e2f894..115f6909d2d 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -1,7 +1,7 @@
---
stage: Stage
group: Pipeline Authoring
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Create a catalog of shareable pipeline constructs'
---
@@ -107,7 +107,18 @@ identifying abstract concepts and are subject to changes as we refine the design
- **Catalog** is the collection of projects that are set to contain components.
- **Version** is the release name of a tag in the project, which allows components to be pinned to a specific revision.
-## Characteristics of a component
+## Definition of pipeline component
+
+A pipeline component is a reusable single-purpose building block that abstracts away a single pipeline configuration unit. Components are used to compose a part or entire pipeline configuration.
+It can optionally take input parameters and set output data to be adaptable and reusable in different pipeline contexts,
+while encapsulating and isolating implementation details.
+
+Components allow a pipeline to be assembled by using abstractions instead of having all the details defined in one place.
+When using a component in a pipeline, a user shouldn't need to know the implementation details of the component and should
+only rely on the provided interface. The interface will have a version / revision, so that users understand which revision they are interfacing with.
+
+A pipeline component defines its type which indicates in which context of the pipeline configuration the component can be used.
+For example, a component of type X can only be used according to the type X use-case.
For best experience with any systems made of components it's fundamental that components are single purpose,
isolated, reusable and resolvable.
@@ -118,7 +129,7 @@ isolated, reusable and resolvable.
- **Reusability:** a component is designed to be used in different pipelines.
Depending on the assumptions it's built on a component can be more or less generic.
Generic components are more reusable but may require more customization.
-- **Resolvable:** When a component depends on another component, this dependency needs to be explicit and trackable. Hidden dependencies can lead to myriads of problems.
+- **Resolvable:** When a component depends on another component, this dependency must be explicit and trackable.
## Proposal
diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md
index 75c4d05c334..c02fb35974b 100644
--- a/doc/architecture/blueprints/ci_scale/index.md
+++ b/doc/architecture/blueprints/ci_scale/index.md
@@ -17,11 +17,15 @@ and has become [one of the most beloved CI/CD solutions](https://about.gitlab.co
GitLab CI/CD has come a long way since the initial release, but the design of
the data storage for pipeline builds remains almost the same since 2012. We
store all the builds in PostgreSQL in `ci_builds` table, and because we are
-creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU),
-we are reaching database limits that are slowing our development velocity down.
+creating more than 5 million builds each day on GitLab.com we are reaching
+database limits that are slowing our development velocity down.
-On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created and the number of
-builds continues to grow exponentially.
+On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created. In
+February 2022 we reached 2 billion of CI/CD build stored in the database. The
+number of builds continues to grow exponentially.
+
+The screenshot below shows our forecast created at the beginning of 2021, that
+turned out to be quite accurate.
![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png)
@@ -34,9 +38,9 @@ builds continues to grow exponentially.
The current state of CI/CD product architecture needs to be updated if we want
to sustain future growth.
-### We are running out of the capacity to store primary keys
+### We were running out of the capacity to store primary keys: DONE
-The primary key in `ci_builds` table is an integer generated in a sequence.
+The primary key in `ci_builds` table is an integer value, generated in a sequence.
Historically, Rails used to use [integer](https://www.postgresql.org/docs/14/datatype-numeric.html)
type when creating primary keys for a table. We did use the default when we
[created the `ci_builds` table in 2012](https://gitlab.com/gitlab-org/gitlab/-/blob/046b28312704f3131e72dcd2dbdacc5264d4aa62/db/ci/migrate/20121004165038_create_builds.rb).
@@ -45,34 +49,32 @@ since the release of Rails 5. The framework is now using `bigint` type that is 8
bytes long, however we have not migrated primary keys for `ci_builds` table to
`bigint` yet.
-We will run out of the capacity of the integer type to store primary keys in
-`ci_builds` table before December 2021. When it happens without a viable
-workaround or an emergency plan, GitLab.com will go down.
-
-`ci_builds` is just one of the tables that are running out of the primary keys
-available in Int4 sequence. There are multiple other tables storing CI/CD data
-that have the same problem.
+In early 2021 we had estimated that would run out of the capacity of the integer
+type to store primary keys in `ci_builds` table before December 2021. If it had
+happened without a viable workaround or an emergency plan, GitLab.com would go
+down. `ci_builds` was just one of many tables that were running out of the
+primary keys available in Int4 sequence.
-Primary keys problem will be tackled by our Database Team.
+Before October 2021, our Database team had managed to migrate all the risky
+tables' primary keys to big integers.
-**Status**: In October 2021, the primary keys in CI tables were migrated
-to big integers. See the [related Epic](https://gitlab.com/groups/gitlab-org/-/epics/5657) for more details.
+See the [related Epic](https://gitlab.com/groups/gitlab-org/-/epics/5657) for more details.
-### The table is too large
+### Some CI/CD database tables are too large: IN PROGRESS
-There is more than a billion rows in `ci_builds` table. We store more than 2
-terabytes of data in that table, and the total size of indexes is more than 1
-terabyte (as of February 2021).
+There is more than two billion rows in `ci_builds` table. We store many
+terabytes of data in that table, and the total size of indexes is measured in
+terabytes as well.
-This amount of data contributes to a significant performance problems we
-experience on our primary PostgreSQL database.
+This amount of data contributes to a significant number of performance
+problems we experience on our CI PostgreSQL database.
-Most of the problem are related to how PostgreSQL database works internally,
+Most of the problems are related to how PostgreSQL database works internally,
and how it is making use of resources on a node the database runs on. We are at
-the limits of vertical scaling of the primary database nodes and we frequently
-see a negative impact of the `ci_builds` table on the overall performance,
-stability, scalability and predictability of the database GitLab.com depends
-on.
+the limits of vertical scaling of the CI primary database nodes and we
+frequently see a negative impact of the `ci_builds` table on the overall
+performance, stability, scalability and predictability of the CI database
+GitLab.com depends on.
The size of the table also hinders development velocity because queries that
seem fine in the development environment may not work on GitLab.com. The
@@ -90,41 +92,40 @@ environment.
We also expect a significant, exponential growth in the upcoming years.
One of the forecasts done using [Facebook's Prophet](https://facebook.github.io/prophet/)
-shows that in the first half of
-2024 we expect seeing 20M builds created on GitLab.com each day. In comparison
-to around 2M we see created today, this is 10x growth our product might need to
-sustain in upcoming years.
+shows that in the first half of 2024 we expect seeing 20M builds created on
+GitLab.com each day. In comparison to around 5M we see created today. This is
+10x growth from numbers we saw in 2021.
![CI builds daily forecast](ci_builds_daily_forecast.png)
**Status**: As of October 2021 we reduced the growth rate of `ci_builds` table
-by writing build options and variables to `ci_builds_metadata` table. We plan
-to ship further improvements that will be described in a separate blueprint.
+by writing build options and variables to `ci_builds_metadata` table. We are
+also working on partitioning the largest CI/CD database tables using
+[time decay pattern](../ci_data_decay/index.md).
-### Queuing mechanisms are using the large table
+### Queuing mechanisms were using the large table: DONE
-Because of how large the table is, mechanisms that we use to build queues of
-pending builds (there is more than one queue), are not very efficient. Pending
-builds represent a small fraction of what we store in the `ci_builds` table,
-yet we need to find them in this big dataset to determine an order in which we
-want to process them.
+Because of how large the table is, mechanisms that we used to build queues of
+pending builds (there is more than one queue), were not very efficient. Pending
+builds represented a small fraction of what we store in the `ci_builds` table,
+yet we needed to find them in this big dataset to determine an order in which we
+wanted to process them.
-This mechanism is very inefficient, and it has been causing problems on the
-production environment frequently. This usually results in a significant drop
-of the CI/CD Apdex score, and sometimes even causes a significant performance
+This mechanism was very inefficient, and it had been causing problems on the
+production environment frequently. This usually resulted in a significant drop
+of the CI/CD Apdex score, and sometimes even caused a significant performance
degradation in the production environment.
-There are multiple other strategies that can improve performance and
-reliability. We can use [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or
-[a separate table that will accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766)
-and we want to explore them.
+There were multiple other strategies that we considered to improve performance and
+reliability. We evaluated using [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or
+[a separate table that would accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766).
+We decided to proceed with the latter.
-**Status**: As of October 2021 the new architecture
-[has been implemented on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
-The following epic tracks making it generally available:
-[Make the new pending builds architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
+In October 2021 we finished shipping the new architecture of builds queuing
+[on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
+We then made the new architecture [generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
-### Moving big amounts of data is challenging
+### Moving big amounts of data is challenging: IN PROGRESS
We store a significant amount of data in `ci_builds` table. Some of the columns
in that table store a serialized user-provided data. Column `ci_builds.options`
@@ -144,24 +145,27 @@ described in a separate architectural blueprint.
## Proposal
-Making GitLab CI/CD product ready for the scale we expect to see in the
-upcoming years is a multi-phase effort.
-
-First, we want to focus on things that are urgently needed right now. We need
-to fix primary keys overflow risk and unblock other teams that are working on
-database partitioning and sharding.
-
-We want to improve known bottlenecks, like
-builds queuing mechanisms that is using the large table, and other things that
-are holding other teams back.
-
-Extending CI/CD metrics is important to get a better sense of how the system
-performs and to what growth should we expect. This will make it easier for us
-to identify bottlenecks and perform more advanced capacity planning.
-
-Next step is to better understand how we can leverage strong time-decay
-characteristic of CI/CD data. This might help us to partition CI/CD dataset to
-reduce the size of CI/CD database tables.
+Below you can find the original proposal made in early 2021 about how we want
+to move forward with CI Scaling effort:
+
+> Making GitLab CI/CD product ready for the scale we expect to see in the
+> upcoming years is a multi-phase effort.
+>
+> First, we want to focus on things that are urgently needed right now. We need
+> to fix primary keys overflow risk and unblock other teams that are working on
+> database partitioning and sharding.
+>
+> We want to improve known bottlenecks, like
+> builds queuing mechanisms that is using the large table, and other things that
+> are holding other teams back.
+>
+> Extending CI/CD metrics is important to get a better sense of how the system
+> performs and to what growth should we expect. This will make it easier for us
+> to identify bottlenecks and perform more advanced capacity planning.
+>
+> Next step is to better understand how we can leverage strong time-decay
+> characteristic of CI/CD data. This might help us to partition CI/CD dataset to
+> reduce the size of CI/CD database tables.
## Iterations
@@ -170,15 +174,12 @@ Work required to achieve our next CI/CD scaling target is tracked in the
1. ✓ Migrate primary keys to big integers on GitLab.com.
1. ✓ Implement the new architecture of builds queuing on GitLab.com.
-1. [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
+1. ✓ [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
1. [Partition CI/CD data using time-decay pattern](../ci_data_decay/index.md).
## Status
-|-------------|--------------|
-| Created at | 21.01.2021 |
-| Approved at | 26.04.2021 |
-| Updated at | 28.02.2022 |
+Created at 21.01.2021, approved at 26.04.2021.
Status: In progress.
diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md
index 3a06d73141b..b77d7998fc8 100644
--- a/doc/architecture/blueprints/cloud_native_build_logs/index.md
+++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Next iteration of build logs architecture at GitLab'
---
diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
index 431bc19ad84..127badabb71 100644
--- a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
+++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.'
---
@@ -20,7 +20,7 @@ company behind the project.
This effort is described in more detail
[in the infrastructure team handbook page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
-GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes
+GitLab Pages is tightly coupled with NFS and to unblock Kubernetes
migration a significant change to GitLab Pages' architecture is required. This
is an ongoing work that we have started more than a year ago. This blueprint
might be useful to understand why it is important, and what is the roadmap.
diff --git a/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md b/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
index 5f0f0a7aa63..4111e2ef056 100644
--- a/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
+++ b/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Making a GitLab codebase composable - allowing to run parts of the application'
---
@@ -50,7 +50,7 @@ codebase without clear boundaries results in a number of problems and inefficien
we usually need to run a whole test suite to confidently know which parts are affected. This to
some extent can be improved by building a heuristic to aid this process, but it is prone to errors and hard
to keep accurate at all times
-- All components need to be loaded at all times in order to run only parts of the application
+- All components need to be loaded at all times to run only parts of the application
- Increased resource usage, as we load parts of the application that are rarely used in a given context
- The high memory usage results in slowing the whole application as it increases GC cycles duration
creating significantly longer latency for processing requests or worse cache usage of CPUs
@@ -208,7 +208,7 @@ graph LR
### Application Layers on GitLab.com
-Due to its scale, GitLab.com requires much more attention to run. This is needed in order to better manage resources
+Due to its scale, GitLab.com requires much more attention to run. This is needed to better manage resources
and provide SLAs for different functional parts. The chart below provides a simplistic view of GitLab.com application layers.
It does not include all components, like Object Storage nor Gitaly nodes, but shows the GitLab Rails dependencies between
different components and how they are configured on GitLab.com today:
@@ -543,7 +543,7 @@ Controllers, Serializers, some presenters and some of the Grape:Entities are als
Potential challenges with moving Controllers:
-- We needed to extend `Gitlab::Patch::DrawRoute` in order to support `engines/web_engine/config/routes` and `engines/web_engine/ee/config/routes` in case when `web_engine` is loaded. Here is potential [solution](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/53720#note_506957398).
+- We needed to extend `Gitlab::Patch::DrawRoute` to support `engines/web_engine/config/routes` and `engines/web_engine/ee/config/routes` in case when `web_engine` is loaded. Here is potential [solution](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/53720#note_506957398).
- `Gitlab::Routing.url_helpers` paths are used in models and services, that could be used by Sidekiq (for example `Gitlab::Routing.url_helpers.project_pipelines_path` is used by [ExpirePipelineCacheService](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/ci/expire_pipeline_cache_service.rb#L20) in [ExpirePipelineCacheWorker](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/workers/expire_pipeline_cache_worker.rb#L18)))
### Packwerk
diff --git a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
index df8686ed0aa..433c23bf188 100644
--- a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
+++ b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: Consolidating groups and projects
---
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index 9d40593d7ce..58d59fe5737 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -1,7 +1,7 @@
---
stage: Package
group: Package
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Container Registry metadata database'
---
diff --git a/doc/architecture/blueprints/database/scalability/patterns/index.md b/doc/architecture/blueprints/database/scalability/patterns/index.md
index 4a9bb003763..ec00d757377 100644
--- a/doc/architecture/blueprints/database/scalability/patterns/index.md
+++ b/doc/architecture/blueprints/database/scalability/patterns/index.md
@@ -1,7 +1,7 @@
---
stage: Data Stores
group: Database
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Learn how to scale the database through the use of best-of-class database scalability patterns'
---
diff --git a/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
index 0780ae3c4d5..6cf8e17edeb 100644
--- a/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
+++ b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
@@ -1,7 +1,7 @@
---
stage: Data Stores
group: Database
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Learn how to scale operating on read-mostly data at scale'
---
diff --git a/doc/architecture/blueprints/database/scalability/patterns/time_decay.md b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
index 7a64f0cb7c6..ff5f7c25ea1 100644
--- a/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
+++ b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
@@ -1,7 +1,7 @@
---
stage: Data Stores
group: Database
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Learn how to operate on large time-decay data'
---
@@ -27,7 +27,7 @@ application.
Let's first consider entities with no inherent time-related bias for their data.
A record for a user or a project may be equally important and frequently accessed, irrelevant to when
-it was created. We can not predict by using a user's `id` or `created_at` how often the related
+it was created. We cannot predict by using a user's `id` or `created_at` how often the related
record is accessed or updated.
On the other hand, a good example for datasets with extreme time-decay effects are logs and time
@@ -91,7 +91,7 @@ a maximum of a month of events, restricted to 6 months in the past.
### Immutability
The third characteristic of time-decay data is that their **time-decay status does not change**.
-Once they are considered "old", they can not switch back to "new" or relevant again.
+Once they are considered "old", they cannot switch back to "new" or relevant again.
This definition may sound trivial, but we have to be able to make operations over "old" data **more**
expensive (for example, by archiving or moving them to less expensive storage) without having to worry about
diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md
index 5bc9528d568..3f8041ea416 100644
--- a/doc/architecture/blueprints/database_testing/index.md
+++ b/doc/architecture/blueprints/database_testing/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Database Testing'
---
@@ -84,7 +84,7 @@ The short-term focus is on testing regular migrations (typically schema changes)
In order to secure this process and meet compliance goals, the runner environment is treated as a *production* environment and similarly locked down, monitored and audited. Only Database Maintainers have access to the CI pipeline and its job output. Everyone else can only see the results and statistics posted back on the merge request.
-We implement a secured CI pipeline on <https://ops.gitlab.net> that adds the execution steps outlined above. The goal is to secure this pipeline in order to solve the following problem:
+We implement a secured CI pipeline on <https://ops.gitlab.net> that adds the execution steps outlined above. The goal is to secure this pipeline to solve the following problem:
Make sure we strongly protect production data, even though we allow everyone (GitLab team/developers) to execute arbitrary code on the thin-clone which contains production data.
diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md
index 08253ac883c..866be9d8a70 100644
--- a/doc/architecture/blueprints/feature_flags_development/index.md
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
@@ -1,12 +1,12 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Internal usage of Feature Flags for GitLab development'
---
-# Usage of Feature Flags for GitLab development
+# Architectural discussion of feature flags
Usage of feature flags become crucial for the development of GitLab. The
feature flags are a convenient way to ship changes early, and safely rollout
diff --git a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
index b22636ac1d9..19fd995bead 100644
--- a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
+++ b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
@@ -1,7 +1,7 @@
---
stage: Configure
group: Configure
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'GitLab to Kubernetes communication'
---
diff --git a/doc/architecture/blueprints/image_resizing/index.md b/doc/architecture/blueprints/image_resizing/index.md
index f2fd7543b90..dd7ce27f459 100644
--- a/doc/architecture/blueprints/image_resizing/index.md
+++ b/doc/architecture/blueprints/image_resizing/index.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
comments: false
description: 'Image Resizing'
---
diff --git a/doc/architecture/blueprints/pods/index.md b/doc/architecture/blueprints/pods/index.md
index fc33a4f441b..01d56c483ea 100644
--- a/doc/architecture/blueprints/pods/index.md
+++ b/doc/architecture/blueprints/pods/index.md
@@ -22,12 +22,14 @@ We use the following terms to describe components and properties of the Pods arc
### Pod
-A Pod is a set of infrastructure components that contains multiple workspaces that belong to different organizations. The components include both datastores (PostgreSQL, Redis etc.) and stateless services (web etc.). The infrastructure components provided within a Pod are shared among workspaces but not shared with other Pods. This isolation of infrastructure components means that Pods are independent from each other.
+A Pod is a set of infrastructure components that contains multiple top-level namespaces that belong to different organizations. The components include both datastores (PostgreSQL, Redis etc.) and stateless services (web etc.). The infrastructure components provided within a Pod are shared among organizations and their top-level namespaces but not shared with other Pods. This isolation of infrastructure components means that Pods are independent from each other.
+
+![Term Pod](term-pod.png)
#### Pod properties
- Each pod is independent from the others
-- Infrastructure components are shared by workspaces within a Pod
+- Infrastructure components are shared by organizations and their top-level namespaces within a Pod
- More Pods can be provisioned to provide horizontal scalability
- A failing Pod does not lead to failure of other Pods
- Noisy neighbor effects are limited to within a Pod
@@ -36,23 +38,50 @@ A Pod is a set of infrastructure components that contains multiple workspaces th
Discouraged synonyms: GitLab instance, cluster, shard
-### Workspace
+### Cluster
+
+A cluster is a collection of Pods.
+
+![Term Cluster](term-cluster.png)
+
+#### Cluster properties
+
+- A cluster holds cluster-wide metadata, for example Users, Routes, Settings.
+
+Discouraged synonyms: whale
+
+### Organizations
+
+GitLab references [Organizations in the initial set up](../../../topics/set_up_organization.md) and users can add a (free text) organization to their profile. There is no Organization entity established in the GitLab codebase.
+
+As part of delivering Pods, we propose the introduction of an `organization` entity. Organizations would represent billable entities or customers.
+
+Organizations are a known concept, present for example in [AWS](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/core-concepts.html) and [GCP](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#organizations).
+
+Organizations work under the following assumptions:
+
+1. Users care about what happens within their organizations.
+1. Features need to work within an organization.
+1. Only few features need to work across organizations.
+1. Users understand that the majority of pages they view are only scoped to a single organization at a time.
+1. Organizations are located on a single pod.
-A [workspace](../../../user/workspace/index.md) is the name for the top-level namespace that is used by organizations to manage everything GitLab. It will provide similar administrative capabilities to a self-managed instance.
+![Term Organization](term-organization.png)
-See more in the [workspace group overview](https://about.gitlab.com/direction/manage/workspace/#overview).
+#### Organization properties
-#### Workspace properties
+- Top-level namespaces belong to organizations
+- Users can be members of different organizations
+- Organizations are isolated from each other by default meaning that cross-namespace features will only work for namespaces that exist within a single organization
+- User namespaces must not belong to an organization
-- Workspaces are isolated from each other by default
-- A workspace is located on a single Pod
-- Workspaces share the resources provided by a Pod
+Discouraged synonyms: Billable entities, customers
### Top-Level namespace
A top-level namespace is the logical object container in the code that represents all groups, subgroups and projects that belong to an organization.
-A top-level namespace is the root of nested collection namespaces and projects. The namespace and its related entities form a tree-like hierarchy: Namespaces are the nodes of the tree, projects are the leaves. An organization usually contains a single top-level namespace, called a workspace.
+A top-level namespace is the root of nested collection namespaces and projects. The namespace and its related entities form a tree-like hierarchy: Namespaces are the nodes of the tree, projects are the leaves.
Example:
@@ -61,21 +90,30 @@ Example:
- `gitlab-org` is a `top-level namespace`; the root for all groups and projects of an organization
- `gitlab` is a `project`; a project of the organization.
+Top-level namespaces may [be replaced by workspaces](https://gitlab.com/gitlab-org/gitlab/-/issues/368237#high-level-goals). This proposal only uses the term top-level namespaces as the workspace definition is ongoing.
+
Discouraged synonyms: Root-level namespace
+![Term Top-level Namespace](term-top-level-namespace.png)
+
#### Top-level namespace properties
-Same as workspaces.
+- Top-level namespaces belonging to an organization are located on the same Pod
+- Top-level namespaces can interact with other top-level namespaces that belong to the same organization
### Users
-Users are available globally and not restricted to a single Pod. Users can create multiple workspaces and they may be members of several workspaces and contribute to them. Because users' activity is not limited to an individual Pod, their activity needs to be aggregated across Pods to reflect all their contributions (for example TODOs). This means, the Pods architecture may need to provide a central dashboard.
+Users are available globally and not restricted to a single Pod. Users can be members of many different organizations with varying permissions. Inside organizations, users can create multiple top-level namespaces. User activity is not limited to a single organization but their contributions (for example TODOs) are only aggregated within an organization. This avoids the need for aggregating across pods.
#### User properties
- Users are shared globally across all Pods
-- Users can create multiple workspaces
-- Users can be a member of multiple workspaces
+- Users can create multiple top-level namespaces
+- Users can be a member of multiple top-level namespaces
+- Users can be a member of multiple organizations
+- Users can administrate organizations
+- User activity is aggregated within an organization
+- Every user has one personal namespace
## Goals
@@ -87,7 +125,7 @@ Pods provide a horizontally scalable solution because additional Pods can be cre
### Increased availability
-A major challenge for shared-infrastructure architectures is a lack of isolation between workspaces. This can lead to noisy neighbor effects. A organization's behavior inside a workspace can impact all other workspaces. This is highly undesirable. Pods provide isolation at the pod level. A group of organizations is fully isolated from other organizations located on a different Pod. This minimizes noisy neighbor effects while still benefiting from the cost-efficiency of shared infrastructure.
+A major challenge for shared-infrastructure architectures is a lack of isolation between top-level namespaces. This can lead to noisy neighbor effects. A organization's behavior inside a top-level namespace can impact all other organizations. This is highly undesirable. Pods provide isolation at the pod level. A group of organizations is fully isolated from other organizations located on a different Pod. This minimizes noisy neighbor effects while still benefiting from the cost-efficiency of shared infrastructure.
Additionally, Pods provide a way to implement disaster recovery capabilities. Entire Pods may be replicated to read-only standbys with automatic failover capabilities.
@@ -105,35 +143,113 @@ Pods would provide a solution for organizations in the small to medium business
(See [segmentation definitions](https://about.gitlab.com/handbook/sales/field-operations/gtm-resources/#segmentation).)
Larger organizations may benefit substantially from [GitLab Dedicated](../../../subscriptions/gitlab_dedicated/index.md).
+At this moment, GitLab.com has "social-network"-like capabilities that may not fit well into a more isolated organization model. Removing those features, however, possesses some challenges:
+
+1. How will existing `gitlab-org` contributors contribute to the namespace??
+1. How do we move existing top-level namespaces into the new model (effectively breaking their social features)?
+
+We should evaluate if the SMB and mid market segment is interested in these features, or if not having them is acceptable in most cases.
+
## High-level architecture problems to solve
A number of technical issues need to be resolved to implement Pods (in no particular order). This section will be expanded.
-1. How are users of an organization routed to the correct Pod containing their workspace?
+1. How are users of an organization routed to the correct Pod?
1. How do users authenticate?
1. How are Pods rebalanced?
1. How are Pods provisioned?
1. How can Pods implement disaster recovery capabilities?
-## Iteration 1
+## Iteration plan
+
+We can't ship the entire Pods architecture in one go - it is too large. Instead, we are adopting an iteration plan that provides value along the way.
+
+1. Introduce organizations
+1. Migrate existing top-level namespaces to organizations
+1. Create new organizations on `pod_0`
+1. Migrate existing organizations from `pod_0` to `pod_n`
+1. Add additional Pod capabilities (DR, Regions)
+
+### Iteration 0: Introduce organizations
+
+In the first iteration, we introduce the concept of an organization
+as a way to group top-level namespaces together. Support for organizations **does not require any Pods work** but having them will make all subsequent iterations of Pods simpler. This is mainly because we can group top-level namespaces for a single organization onto a Pod. Within an organization all interactions work as normal but we eliminate any cross-organizational interactions except in well defined cases (e.g. forking).
+
+This means that we don't have a large number of cross-pod interactions.
+
+Introducing organizations allows GitLab to move towards a multi-tenant system that is similar to Discord's with a single user account but many different "servers" - our organizations - that allow users to switch context. This model harmonizes the UX across self-managed and our SaaS Platforms and is a good fit for Pods.
+
+Organizations solve the following problems:
+
+1. We can group top-level namespaces by organization. It is very similar to the initial concept of "instance groups". For example these two top-level namespaces would belong to the organization `GitLab`:
+ 1. `https://gitlab.com/gitlab-org/`
+ 1. `https://gitlab.com/gitlab-com/`
+1. We can isolate organizations from each other. Top-level namespaces of the same organization can interact within organizations but are not allowed to interact with other namespaces in other organizations. This is useful for customers because it means an organization provides clear boundaries - similar to a self-managed instance. This means we don't have to aggregate user dashboards across everything and can locally scope them to organizations.
+1. We don't need to define hierarchies inside an organization. It is a container that could be filled with whatever hierarchy / entity set makes sense (workspaces, top-level namespaces etc.)
+1. Self-managed instances would set a default organization.
+1. Organizations can control user-profiles in a central way. This could be achieved by having an organization specific user-profile. Such a profile makes it possible for the organization administrators to control the user role in a company, enforce user emails, or show a graphical indicator of a user being part of the organization. An example would be a "GitLab Employee stamp" on comments.
+
+![Move to Organizations](iteration0-organizations-introduction.png)
+
+#### Why would customers opt-in to Organizations?
+
+By introducing organizations and Pods we can improve the reliability, performance and availability of our SaaS Platforms.
+
+The first iteration of organizations would also have some benefits by providing more isolation. A simple example would be that `@` mentions could be scoped to an organization.
+
+Future iterations would create additional value but are beyond the scope of this blueprint.
+
+Organizations will likely be required in the future as well.
+
+#### Initial user experience
+
+1. We create a default `GitLab.com public` organization and assign all public top-level namespaces to it. This allows existing users to access all the data on GitLab.com, exactly as it does now.
+1. Any user wanting to opt-in to the benefits of organizations will need to set a single default organization. Any attempts for these users to load a global page like `/dashboard` will end up redirecting to `/-/organizations/<DEFAULT_ORGANIZATION>/dashboard`.
+1. New users that opted in to organizations will only ever see data that is related to a single organization. Upon login, data is shown for the default organization. It will be clear to the user how they can switch to a different organization. Users can still navigate to the `GitLab.com` organization but they won't see TODOs from their new organizations in any such views. Instead they'd need to navigate directly to `/organizations/my-company/-/dashboard`.
+
+### Migrating to Organizations
+
+Existing customers could also opt-in to migrate their existing top-level paid namespaces to become part of an organization. In most cases this will be a 1-to-1 mapping. But in some cases it may allow a customer to move multiple top-level namespaces into one organization (for example GitLab).
+
+Migrating to Organizations would be optional. We could even recruit a few beta testers early on to see if this works for them. GitLab itself could dogfood organizations and we'd surface a lot of issues restricting interactions with other namespaces.
+
+## Iteration 1 - Introduce Pod US 0
+
+### GitLab.com as Pod US0
+
+GitLab.com will be treated as the first pod `Pod US 0`. It will be unique and much larger compared to newly created pods. All existing top-level namespaces and organizations will remain on `Pod US 0` in the first iteration.
+
+### Users are globally available
+
+Users are globally available and the same for all pods. This means that user data needs to be handled separately, for example via decomposition, see [!95941](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/95941).
+
+### Pod groundwork
+
+In this iteration, we'll lay all the groundwork to support a second Pod for new organizations. This will be transparent to customers.
+
+## Iteration 2 - Introduce Pod US 1
+
+### Add new organizations to Pod US 1
+
+After we are ready to support a second Pod, newly created organizations are located by default on `Pod US 1`. The user experience for organizations is already well established.
+
+### Migrate existing organizations from Pod US 0 to Pod US 1
-Ultimately, a Pods architecture should offer the same user experience as self-managed and GitLab dedicated. However, at this moment GitLab.com has many more "social-network"-like capabilities that will be difficult to implement with a Pods architecture. We should evaluate if the SMB and mid market segment is interested in these features, or if not having them is acceptable in most cases.
+We know that we'll have to move organizations from `Pod US 0` to other pods to reduce its size and ultimately retire the existing GitLab.com architecture.
-The first iteration of Pods will still contain some limitations that would break cross-workspace workflows. This means it may only be acceptable for new customers, or for existing customers that are briefed.
+By introducing organizations early, we should be able to draw strong "boundaries" across organizations and support migrating existing organizations to a new Pod.
-Limitations are:
+This is likely going to be GitLab itself - if we can dogfood this, we are likely going to be successful with other organizations as well.
-- An organization can create only a single workspace.
-- Workspaces are isolated from each other. This means cross-workspace workflows are broken.
+## Iteration 3 - Introduce Regions
-## Iteration 2
+We can now leverage the Pods architecture to introduce Regions.
-Based on user research, we may want to change certain features to work across namespaces to allow organizations to interact with each other in specific circumstances. We may also allow organizations to have more than one workspace. This is particularly relevant for organizations with sub-divisions, or multi-national organizations that want to have workspaces in different regions.
+## Iteration 4 - Introduce cross-organizational interactions as needed
-Additional features:
+Based on user research, we may want to change certain features to work across organizations. Examples include:
-- Specific features allow for cross-workspace interactions, for example forking, search.
-- An organization can own multiple workspaces on different Pods.
+- Specific features allow for cross-organization interactions, for example forking, search.
### Links
diff --git a/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png b/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png
new file mode 100644
index 00000000000..5f5cad7b169
--- /dev/null
+++ b/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-cluster.png b/doc/architecture/blueprints/pods/term-cluster.png
new file mode 100644
index 00000000000..f52e31b52ad
--- /dev/null
+++ b/doc/architecture/blueprints/pods/term-cluster.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-organization.png b/doc/architecture/blueprints/pods/term-organization.png
new file mode 100644
index 00000000000..f605adb124d
--- /dev/null
+++ b/doc/architecture/blueprints/pods/term-organization.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-pod.png b/doc/architecture/blueprints/pods/term-pod.png
new file mode 100644
index 00000000000..d8f79df2f29
--- /dev/null
+++ b/doc/architecture/blueprints/pods/term-pod.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-top-level-namespace.png b/doc/architecture/blueprints/pods/term-top-level-namespace.png
new file mode 100644
index 00000000000..c1cd317d878
--- /dev/null
+++ b/doc/architecture/blueprints/pods/term-top-level-namespace.png
Binary files differ
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md
index 692cef4b11d..2ed66f22b53 100644
--- a/doc/architecture/blueprints/rate_limiting/index.md
+++ b/doc/architecture/blueprints/rate_limiting/index.md
@@ -65,7 +65,7 @@ Inc._
- There is no way to automatically notify a user when they are approaching thresholds.
- There is no single way to change limits for a namespace / project / user / customer.
- There is no single way to monitor limits through real-time metrics.
-- There is no framework for hierarchical limit configuration (instance / namespace / sub-group / project).
+- There is no framework for hierarchical limit configuration (instance / namespace / subgroup / project).
- We allow disabling rate-limiting for some marquee SaaS customers, but this
increases a risk for those same customers. We should instead be able to set
higher limits.
@@ -357,7 +357,7 @@ hierarchy. Choosing a proper solution will require a thoughtful research.
1. Build application limits API in a way that it can be easily extracted to a separate service.
1. Build application limits definition in a way that is independent from the Rails application.
1. Build tooling that produce consistent behavior and results across programming languages.
-1. Build the new framework in a way that we can extend to allow self-managed admins to customize limits.
+1. Build the new framework in a way that we can extend to allow self-managed administrators to customize limits.
1. Maintain consistent features and behavior across SaaS and self-managed codebase.
1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it.
@@ -388,7 +388,7 @@ Proposal:
| Author | Hayley Swimelar |
| Engineering Leader | Sam Goldstein |
| Product Manager | |
-| Architecture Evolution Coach | |
+| Architecture Evolution Coach | Andrew Newdigate |
| Recommender | |
| Recommender | |
| Recommender | |
diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md
index 8f7062a1148..415884449ed 100644
--- a/doc/architecture/blueprints/runner_scaling/index.md
+++ b/doc/architecture/blueprints/runner_scaling/index.md
@@ -43,10 +43,10 @@ to be able to keep using this and ship fixes and updates needed for our use case
and the documentation for it has been removed from the official page. This
means that the original reason to use Docker Machine is no longer valid too.
-To keep supporting our customers and the wider community we need to design a
-new mechanism for GitLab Runner auto-scaling. It not only needs to support
-auto-scaling, but it also needs to do that in the way to enable us to build on
-top of it to improve efficiency, reliability and availability.
+To keep supporting our customers and the wider community and to improve our SaaS runners
+maintenance we need to design a new mechanism for GitLab Runner auto-scaling. It not only
+needs to support auto-scaling, but it also needs to do that in the way to enable us to
+build on top of it to improve efficiency, reliability and availability.
We call this new mechanism the "next GitLab Runner Scaling architecture".
@@ -62,6 +62,66 @@ subject to change or delay. The development, release and timing of any
products, features, or functionality remain at the sole discretion of GitLab
Inc._
+## Continuing building on Docker Machine
+
+At this moment one of our core products - GitLab Runner - and one of its most
+important features - ability to auto-scale job execution environments - depends
+on an external product that is abandoned.
+
+Docker Machine project itself is also hard to maintain. Its design starts to
+show its age, which makes it hard to bring new features and fixes. A huge
+codebase that it brings with a lack of internal knowledge about it makes it
+hard for our maintainers to support and properly handle incoming feature
+requests and community contributions.
+
+Docker Machine and it integrated 20+ drivers for cloud and virtualization
+providers creates also another subset of problems, like:
+
+- Each cloud/virtualization environment brings features that come and go
+ and we would need to maintain support for them (add new features, fix
+ bugs).
+
+- We basically need to become experts for each of the virtualization/cloud
+ provider to properly support integration with their API,
+
+- Every single provider that Docker Machine integrates with has its
+ bugs, security releases, vulnerabilities - to maintain the project properly
+ we would need to be on top of all of that and handle updates whenever
+ they are needed.
+
+Another problem is the fact that Docker Machine, from its beginnings, was
+focused on managing Linux based instances only. Despite that at some moment
+Docker got official and native integration on Windows, Docker Machine never
+followed this step. Nor its designed to make such integration easy.
+
+There is also no support for MacOS. This one is obvious - Docker Machine is a
+tool to maintain hosts for Docker Engine and there is no native Docker Engine
+for MacOS. And by native we mean MacOS containers executed within MacOS
+operating system. Docker for MacOS product is not a native support - it's just
+a tooling and a virtualized Linux instance installed with it that makes it
+easier to develop **Linux containers** on MacOS development instances.
+
+This means that only one of three of our officially supported platforms -
+Linux, Windows and MacOS - have a fully-featured support for CI/CD
+auto-scaling. For Windows there is a possibility to use Kubernetes (which in
+some cases have limitations) and maybe with a lot of effort we could bring
+support for Windows into Docker Machine. But for MacOS, there is no
+auto-scaling solution provided natively by GitLab Runner.
+
+This is a huge limitation for our users and a frequently requested feature.
+It's also a limitation for our SaaS runners offering. We've maintained to
+create some sort of auto-scaling for our SaaS Windows and SaaS MacOS runners
+hacking around Custom executor. But experiences from past three years show
+that it's not the best way of doing this. And yet, after this time, Windows
+and MacOS runners autoscaling lacks a lot of performance and feature support
+that we have with our SaaS Linux runners.
+
+To keep supporting our customers and the wider community and to improve our
+SaaS runners maintenance we need to design a new mechanism for GitLab Runner
+auto-scaling. It not only needs to support auto-scaling, but it also needs to
+do that in the way to enable us to build on top of it to improve efficiency,
+reliability and availability.
+
## Proposal
Currently, GitLab Runner auto-scaling can be configured in a few ways. Some
@@ -94,7 +154,7 @@ data that can be shared between job runs.
Because there is no viable replacement and we might be unable to support all
cloud providers that Docker Machine used to support, the key design requirement
is to make it really simple and easy for the wider community to write a custom
-GitLab auto-scaling plugin, whatever cloud provider they might be using. We
+GitLab plugin for whatever cloud provider they might be using. We
want to design a simple abstraction that users will be able to build on top, as
will we to support existing workflows on GitLab.com.
@@ -129,12 +189,11 @@ the need of rebuilding GitLab Runner whenever it happens.
### 💡 Write a solid documentation about how to build your own plugin
-It is important to show users how to build an auto-scaling plugin, so that they
+It is important to show users how to build a plugin, so that they
can implement support for their own cloud infrastructure.
-Building new plugins should be simple, and with the support of great
-documentation it should not require advanced skills, like understanding how
-gRPC works. We want to design the plugin system in a way that the entry barrier
+Building new plugins should be simple and supported with great
+documentation. We want to design the plugin system in a way that the entry barrier
for contributing new plugins is very low.
### 💡 Build a PoC to run multiple builds on a single machine
@@ -171,7 +230,128 @@ configures the Docker daemon there to allow external authenticated requests. It
stores credentials to such ephemeral Docker environments on disk. Once a
machine has been provisioned and made available for GitLab Runner Manager to
run builds, it is using one of the existing executors to run a user-provided
-script. In auto-scaling, this is typically done using Docker executor.
+script. In auto-scaling, this is typically done using the Docker executor.
+
+### Separation of concerns
+
+There are several concerns represented in the current architecture. They are
+coupled in the current implementation so we will break them out here to consider
+them each separately.
+
+- **Virtual Machine (VM) shape**. The underlying provider of a VM requires configuration to
+ know what kind of machine to create. E.g. Cores, memory, failure domain,
+ etc... This information is very provider specific.
+- **VM lifecycle management**. Multiple machines will be created and a
+ system must keep track of which machines belong to this executor. Typically
+ a cloud provider will have a way to manage a set of homogenous machines.
+ E.g. GCE Instance Group. The basic operations are increase, decrease and
+ usually delete a specific machine.
+- **VM autoscaling**. In addition to low-level lifecycle management,
+ job-aware capacity decisions must be made to the set of machines to provide
+ capacity when it is needed but not maintain excess capacity for cost reasons.
+- **Job to VM mapping (routing)**. Currently the system assigns only one job to a
+ given a machine. A machine may be reused based on the specific executor
+ configuration.
+- **In-VM job execution**. Within each VM a job must be driven through
+ various pre-defined stages and results and trace information returned
+ to the Runner system. These details are highly dependent on the VM
+ architecture and operating system as well as Executor type.
+
+The current architecture has several points of coupling between concerns.
+Coupling reduces opportunities for abstraction (e.g. community supported
+plugins) and increases complexity, making the code harder to understand,
+test, maintain and extend.
+
+A primary design decision will be which concerns to externalize to the plugin
+and which should remain with the runner system. The current implementation
+has several abstractions internally which could be used as cut points for a
+new abstraction.
+
+For example the [`Build`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L125)
+type uses the [`GetExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L171)
+function to get an executor provider based on a dispatching executor string.
+Various executor types register with the system by being imported and calling
+[`RegisterExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L154)
+during initialization. Here the abstractions are the [`ExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L80)
+and [`Executor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L59)
+interfaces.
+
+Within the `docker+autoscaling` executor the [`machineExecutor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L19)
+type has a [`Machine`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/helpers/docker/machine.go#L7)
+interface which it uses to aquire a VM during the common [`Prepare`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L71)
+phase. This abstraction primarily creates, accesses and deletes VMs.
+
+There is no current abstraction for the VM autoscaling logic. It is tightly
+coupled with the VM lifecycle and job routing logic. Creating idle capacity
+happens as a side-effect of calling [`Acquire`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/provider.go#L449) on the `machineProvider` while binding a job to a VM.
+
+There is also no current abstraction for in-VM job execution. VM-specific
+commands are generated by the Runner Manager using the [`GenerateShellScript`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L336)
+function and [injected](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L373)
+into the VM as the manager drives the job execution stages.
+
+### Design principles
+
+Our goal is to design a GitLab Runner plugin system interface that is flexible
+and simple for the wider community to consume. As we cannot build plugins for
+all cloud platforms, we want to ensure a low entry barrier for anyone who needs
+to develop a plugin. We want to allow everyone to contribute.
+
+To achieve this goal, we will follow a few critical design principles. These
+principles will guide our development process for the new plugin system
+abstraction.
+
+#### General high-level principles
+
+- Design the new auto-scaling architecture aiming for having more choices and
+ flexibility in the future, instead of imposing new constraints.
+- Design the new auto-scaling architecture to experiment with running multiple
+ jobs in parallel, on a single machine.
+- Design the new provisioning architecture to replace Docker Machine in a way
+ that the wider community can easily build on top of the new abstractions.
+- New auto-scaling method should become a core component of GitLab Runner product so that
+ we can simplify maintenance, use the same tooling, test configuration and Go language
+ setup as we do in our other main products.
+- It should support multiple job execution environments - not only Docker containers
+ on Linux operating system.
+
+ The best design would be to bring auto-scaling as a feature wrapped around
+ our current executors like Docker or Shell.
+
+#### Principles for the new plugin system
+
+- Make the entry barrier for writing a new plugin low.
+- Developing a new plugin should be simple and require only basic knowledge of
+ a programming language and a cloud provider's API.
+- Strive for a balance between the plugin system's simplicity and flexibility.
+ These are not mutually exclusive.
+- Abstract away as many technical details as possible but do not hide them completely.
+- Build an abstraction that serves our community well but allows us to ship it quickly.
+- Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
+- When in doubts err on the side of making things more simple for the wider community.
+- Limit coupling between concerns to make the system more simple and extensible.
+- Concerns should live on one side of the plug or the other--not both, which
+ duplicates effort and increases coupling.
+
+#### The most important technical details
+
+- Favor gRPC communication between a plugin and GitLab Runner.
+- Make it possible to version communication interface and support many versions.
+- Make Go a primary language for writing plugins but accept other languages too.
+- Autoscaling mechanism should be fully owned by GitLab.
+
+ Cloud provider autoscalers don't know which VM to delete when scaling down so
+ they make sub-optimal decisions. Rather than teaching all autoscalers about GitLab
+ jobs, we prefer to have one, GitLab-owned autoscaler (not in the plugin).
+
+ It will also ensure that we can shape the future of the mechanism and make decisions
+ that fit our needs and requirements.
+
+## Plugin boundary proposals
+
+The following are proposals for where to draw the plugin boundary. We will evaluate
+these proposals and others by the design principles and technical constraints
+listed above.
### Custom provider
@@ -204,43 +384,33 @@ document, define requirements and score the solution accordingly. This will
allow us to choose a solution that will work best for us and the wider
community.
-### Design principles
-
-Our goal is to design a GitLab Runner plugin system interface that is flexible
-and simple for the wider community to consume. As we cannot build plugins for
-all cloud platforms, we want to ensure a low entry barrier for anyone who needs
-to develop a plugin. We want to allow everyone to contribute.
+This proposal places VM lifecycle and autoscaling concerns as well as job to
+VM mapping (routing) into the plugin. The build need only ask for a VM and
+it will get one with all aspects of lifecycle and routing already accounted
+for by the plugin.
-To achieve this goal, we will follow a few critical design principles. These
-principles will guide our development process for the new plugin system
-abstraction.
+Rationale: [Description of the Custom Executor Provider proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823321515)
-#### General high-level principles
+### Fleeting VM provider
-1. Design the new auto-scaling architecture aiming for having more choices and
- flexibility in the future, instead of imposing new constraints.
-1. Design the new auto-scaling architecture to experiment with running multiple
- jobs in parallel, on a single machine.
-1. Design the new provisioning architecture to replace Docker Machine in a way
- that the wider community can easily build on top of the new abstractions.
+We can introduce a more simple version of the `Machine` abstraction in the
+form of a "Fleeting" interface. Fleeting provides a low-level interface to
+a homogenous VM group which allows increasing and decreasing the set size
+as well as consuming a VM from within the set.
-#### Principles for the new plugin system
+Plugins for cloud providers and other VM sources are implemented via the
+Hashicorp go-plugin library. This is in practice gRPC over STDIN/STDOUT
+but other wire protocols can be used also.
-1. Make the entry barrier for writing a new plugin low.
-1. Developing a new plugin should be simple and require only basic knowledge of
- a programming language and a cloud provider's API.
-1. Strive for a balance between the plugin system's simplicity and flexibility.
- These are not mutually exclusive.
-1. Abstract away as many technical details as possible but do not hide them completely.
-1. Build an abstraction that serves our community well but allows us to ship it quickly.
-1. Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
-1. When in doubts err on the side of making things more simple for the wider community.
+In order to make use of the new interface, the autoscaling logic is pulled
+out of the Docker Executor and placed into a new Taskscaler library.
-#### The most important technical details
+This places the concerns of VM lifecycle, VM shape and job routing within
+the plugin. It also places the conern of VM autoscaling into a separate
+component so it can be used by multiple Runner Executors (not just `docker+autoscaling`).
-1. Favor gRPC communication between a plugin and GitLab Runner.
-1. Make it possible to version communication interface and support many versions.
-1. Make Go a primary language for writing plugins but accept other languages too.
+Rationale: [Description of the InstanceGroup / Fleeting proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823430883)
+POC: [Merge request](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3315)
## Status
@@ -252,26 +422,26 @@ Proposal:
<!-- vale gitlab.Spelling = NO -->
-| Role | Who
-|------------------------------|------------------------------------------|
-| Authors | Grzegorz Bizon, Tomasz Maczukin |
-| Architecture Evolution Coach | Kamil Trzciński |
-| Engineering Leader | Elliot Rushton, Cheryl Li |
-| Product Manager | Darren Eastman, Jackie Porter |
-| Domain Expert / Runner | Arran Walker |
+| Role | Who |
+|------------------------------|-------------------------------------------------|
+| Authors | Grzegorz Bizon, Tomasz Maczukin, Joseph Burnett |
+| Architecture Evolution Coach | Kamil Trzciński |
+| Engineering Leader | Elliot Rushton, Cheryl Li |
+| Product Manager | Darren Eastman, Jackie Porter |
+| Domain Expert / Runner | Arran Walker |
DRIs:
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | Elliot Rushton |
-| Product | Darren Eastman |
-| Engineering | Tomasz Maczukin |
+| Role | Who |
+|-------------|-----------------|
+| Leadership | Elliot Rushton |
+| Product | Darren Eastman |
+| Engineering | Tomasz Maczukin |
Domain experts:
-| Area | Who
-|------------------------------|------------------------|
-| Domain Expert / Runner | Arran Walker |
+| Area | Who |
+|------------------------|--------------|
+| Domain Expert / Runner | Arran Walker |
<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/work_items/index.md b/doc/architecture/blueprints/work_items/index.md
new file mode 100644
index 00000000000..42864e7112e
--- /dev/null
+++ b/doc/architecture/blueprints/work_items/index.md
@@ -0,0 +1,130 @@
+---
+stage: Plan
+group: Project Management
+comments: false
+description: 'Work Items'
+---
+
+# Work Items
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
+
+This document is a work-in-progress. Some aspects are not documented, though we expect to add them in the future.
+
+## Summary
+
+Work Items is a new architecture created to support the various types of built and planned entities throughout the product, such as issues, requirements, and incidents. It will make these types easy to extend and customize while sharing the same core functionality.
+
+## Terminology
+
+We use the following terms to describe components and properties of the Work items architecture.
+
+### Work Item
+
+Base type for issue, requirement, test case, incident and task (this list is planned to extend in the future). Different work items have the same set of base properties but their [widgets](#work-item-widgets) list is different.
+
+### Work Item types
+
+A set of predefined types for different categories of work items. Currently, the available types are:
+
+- Issue
+- Incident
+- Test case
+- Requirement
+- Task
+
+#### Work Item properties
+
+Every Work Item type has the following common properties:
+
+- `id` - a unique Work Item global identifier;
+- `iid` - internal ID of the Work Item, relative to the parent workspace (currently workspace can only be a project)
+- Work Item type;
+- properties related to Work Item modification time: `createdAt`, `updatedAt`, `closedAt`;
+- title string;
+- Work Item confidentiality state;
+- Work Item state (can be open or closed);
+- lock version, incremented each time the work item is updated;
+- permissions for the current user on the resource
+- a list of [Work Item widgets](#work-item-widgets)
+
+### Work Item widgets
+
+All Work Item types share the same pool of predefined widgets and are customized by which widgets are active on a specific type. The list of widgets for any certain Work Item type is currently predefined and is not customizable. However, in the future we plan to allow users to create new Work Item types and define a set of widgets for them.
+
+### Work Item widget types (updating)
+
+- assignees
+- description
+- hierarchy
+- iteration
+- labels
+- start and due date
+- verification status
+- weight
+
+### Work Item view
+
+The new frontend view that renders Work Items of any type using global Work Item `id` as an identifier.
+
+### Task
+
+Task is a special Work Item type. Tasks can be added to issues as child items and can be displayed in the modal on the issue view.
+
+## Motivation
+
+Work Items main goal is to enhance the planning toolset to become the most popular collaboration tool for knowledge workers in any industry.
+
+- Puts all like-items (issues, incidents, epics, test cases etc.) on a standard platform to simplify maintenance and increase consistency in experience
+- Enables first-class support of common planning concepts to lower complexity and allow users to plan without learning GitLab-specific nuances.
+
+## Goals
+
+### Scalability
+
+Currently, different entities like issues, epics, merge requests etc share many similar features but these features are implemented separately for every entity type. This makes implementing new features or refactoring existing ones problematic: for example, if we plan to add new feature to issues and incidents, we would need to implement it separately on issue and incident types respectively. With work items, any new feature is implemented via widgets for all existing types which makes the architecture more scalable.
+
+### Flexibility
+
+With existing implementation, we have a rigid structure for issuables, merge requests, epics etc. This structure is defined on both backend and frontend, so any change requires a coordinated effort. Also, it would be very hard to make this structure customizable for the user without introducing a set of flags to enable/disable any existing feature. Work Item architecture allows frontend to display Work Item widgets in a flexible way: whatever is present in Work Item widgets, will be rendered on the page. This allows us to make changes fast and makes the structure way more flexible. For example, if we want to stop displaying labels on the Incident page, we remove labels widget from Incident Work Item type on the backend. Also, in the future this will allow users to define the set of widgets they want to see on custom Work Item types.
+
+### A consistent experience
+
+As much as we try to have consistent behavior for similar features on different entities, we still have differences in the implementation. For example, updating labels on merge request via GraphQL API can be done with dedicated `setMergeRequestLabels` mutation, while for the issue we call more coarse-grained `updateIssue`. This provides inconsistent experience for both frontend and external API users. As a result, epics, issues, requirements, and others all have similar but just subtle enough differences in common interactions that the user needs to hold a complicated mental model of how they each behave.
+
+Work Item architecture is designed with making all the features for all the types consistent, implemented as Work Item widgets.
+
+## High-level architecture problems to solve
+
+- how can we bypass groups and projects consolidation to migrate epics to Work Item type;
+- dealing with parent-child relationships for certain Work Item types: epic > issue > task, and to the same Work Item types: issue > issue.
+- [implementing custom Work Item types and custom widgets](https://gitlab.com/gitlab-org/gitlab/-/issues/335110)
+
+### Links
+
+- [Work items initiative epic](https://gitlab.com/groups/gitlab-org/-/epics/6033)
+- [Tasks roadmap](https://gitlab.com/groups/gitlab-org/-/epics/7103?_gl=1*zqatx*_ga*NzUyOTc3NTc1LjE2NjEzNDcwMDQ.*_ga_ENFH3X7M5Y*MTY2MjU0MDQ0MC43LjEuMTY2MjU0MDc2MC4wLjAuMA..)
+- [Work Item "Vision" Prototype](https://gitlab.com/gitlab-org/gitlab/-/issues/368607)
+- [Work Item Discussions](https://gitlab.com/groups/gitlab-org/-/epics/7060)
+
+### Who
+
+| Role | Who
+|------------------------------|-----------------------------|
+| Author | Natalia Tepluhina |
+| Architecture Evolution Coach | Kamil Trzciński |
+| Engineering Leader | TBD |
+| Product Manager | Gabe Weaver |
+| Domain Expert / Frontend | Natalia Tepluhina |
+| Domain Expert / Backend | Heinrich Lee Yu |
+| Domain Expert / Backend | Jan Provaznik |
+| Domain Expert / Backend | Mario Celi |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Leadership | TBD |
+| Product | Gabe Weaver |
+| Engineering | TBD |