6 files changed, 263 insertions, 42 deletions
diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md
index 3e9fbc534d5..092f8a7119a 100644
--- a/doc/architecture/blueprints/ci_scale/index.md
+++ b/doc/architecture/blueprints/ci_scale/index.md
@@ -5,7 +5,7 @@ comments: false
 description: 'Improve scalability of GitLab CI/CD'
 ---
 
-# Next CI/CD scale target: 20M builds per day by 2024
+# CI/CD Scaling
 
 ## Summary
 
@@ -20,13 +20,8 @@ store all the builds in PostgreSQL in `ci_builds` table, and because we are
 creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU),
 we are reaching database limits that are slowing our development velocity down.
 
-On February 1st, 2021, a billionth CI/CD job was created and the number of
-builds is growing exponentially. We will run out of the available primary keys
-for builds before December 2021 unless we improve the database model used to
-store CI/CD data.
-
-We expect to see 20M builds created daily on GitLab.com in the first half of
-2024.
+On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created and the number of
+builds continues to grow exponentially.
 
 ![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png)
 
@@ -60,8 +55,8 @@ that have the same problem.
 
 Primary keys problem will be tackled by our Database Team.
 
-Status: As of October 2021 the primary keys in CI tables have been migrated to
-big integers.
+**Status**: As of October 2021 the primary keys in CI tables have been migrated
+to big integers.
 
 ### The table is too large
 
@@ -84,6 +79,14 @@ seem fine in the development environment may not work on GitLab.com. The
 difference in the dataset size between the environments makes it difficult to
 predict the performance of even the most simple queries.
 
+Team members and the wider community members are struggling to contribute the
+Verify area, because we restricted the possibility of extending `ci_builds`
+even further. Our static analysis tools prevent adding more columns to this
+table. Adding new queries is unpredictable because of the size of the dataset
+and the amount of queries executed using the table. This significantly hinders
+the development velocity and contributes to incidents on the production
+environment.
+
 We also expect a significant, exponential growth in the upcoming years.
 
 One of the forecasts done using [Facebook's
@@ -94,6 +97,10 @@ sustain in upcoming years.
 
 ![CI builds daily forecast](ci_builds_daily_forecast.png)
 
+**Status**: As of October 2021 we reduced the growth rate of `ci_builds` table
+by writing build options and variables to `ci_builds_metadata` table. We plan
+to ship further improvements that will be described in a separate blueprint.
+
 ### Queuing mechanisms are using the large table
 
 Because of how large the table is, mechanisms that we use to build queues of
@@ -114,8 +121,8 @@ table that will accelerate SQL queries used to build
 queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) and we want to
 explore them.
 
-Status: the new architecture [has been implemented on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
-
+**Status**: As of October 2021 the new architecture [has been implemented on
+GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
 The following epic tracks making it generally available: [Make the new pending
 builds architecture generally available](
 https://gitlab.com/groups/gitlab-org/-/epics/6954).
@@ -136,17 +143,8 @@ columns, tables, partitions or database shards.
 
 Effort to improve background migrations will be owned by our Database Team.
 
-Status: In progress.
-
-### Development velocity is negatively affected
-
-Team members and the wider community members are struggling to contribute the
-Verify area, because we restricted the possibility of extending `ci_builds`
-even further. Our static analysis tools prevent adding more columns to this
-table. Adding new queries is unpredictable because of the size of the dataset
-and the amount of queries executed using the table. This significantly hinders
-the development velocity and contributes to incidents on the production
-environment.
+**Status**: In progress. We plan to ship further improvements that will be
+described in a separate architectural blueprint.
 
 ## Proposal
 
@@ -157,32 +155,34 @@ First, we want to focus on things that are urgently needed right now. We need
 to fix primary keys overflow risk and unblock other teams that are working on
 database partitioning and sharding.
 
-We want to improve situation around bottlenecks that are known already, like
-queuing mechanisms using the large table and things that are holding other
-teams back.
+We want to improve known bottlenecks, like
+builds queuing mechanisms that is using the large table, and other things that
+are holding other teams back.
 
 Extending CI/CD metrics is important to get a better sense of how the system
 performs and to what growth should we expect. This will make it easier for us
 to identify bottlenecks and perform more advanced capacity planning.
 
-As we work on first iterations we expect our Database Sharding team and
-Database Scalability Working Group to make progress on patterns we will be able
-to use to partition the large CI/CD dataset. We consider the strong time-decay
-effect, related to the diminishing importance of pipelines with time, as an
-opportunity we might want to seize.
+Next step is to better understand how we can leverage strong time-decay
+characteristic of CI/CD data. This might help us to partition CI/CD dataset to
+reduce the size of CI/CD database tables.
 
 ## Iterations
 
 Work required to achieve our next CI/CD scaling target is tracked in the
-[GitLab CI/CD 20M builds per day scaling
-target](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
+[CI/CD Scaling](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
+
+1. ✓ Migrate primary keys to big integers on GitLab.com.
+1. ✓ Implement the new architecture of builds queuing on GitLab.com.
+1. Make the new builds queuing architecture generally available.
+1. Partition CI/CD data using time-decay pattern.
 
 ## Status
 
 |-------------|--------------|
 | Created at  | 21.01.2021   |
 | Approved at | 26.04.2021   |
-| Updated at  | 28.10.2021   |
+| Updated at  | 06.12.2021   |
 
 Status: In progress.
 
@@ -215,6 +215,7 @@ Domain experts:
 | Area                         | Who
 |------------------------------|------------------------|
 | Domain Expert / Verify       | Fabio Pitino           |
+| Domain Expert / Verify       | Marius Bobin           |
 | Domain Expert / Database     | Jose Finotto           |
 | Domain Expert / PostgreSQL   | Nikolay Samokhvalov    |
 
diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
index 60ddfe8ce02..e545e8844ec 100644
--- a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
+++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
@@ -28,7 +28,7 @@ might be useful to understand why it is important, and what is the roadmap.
 ## How GitLab Pages Works
 
 GitLab Pages is a daemon designed to serve static content, written in
-[Go](https://golang.org/).
+[Go](https://go.dev/).
 
 Initially, GitLab Pages has been designed to store static content on a local
 shared block storage (NFS) in a hierarchical group > project directory
diff --git a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
index 53357220755..345160dc77f 100644
--- a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
+++ b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
@@ -133,7 +133,7 @@ The initial iteration will provide a framework to house features under `Namespac
 
 1. **Conceptual model**: What are the current and future state conceptual models of these features ([see object modeling for designers](https://hpadkisson.medium.com/object-modeling-for-designers-an-introduction-7871bdcf8baf))? These should be documented in Pajamas (example: [Merge Requests](https://design.gitlab.com/objects/merge-request)).
 1. **Merge conflicts**: What inconsistencies are there across project, group, and admin levels? How might these be addressed? For an example of how we rationalized this for labels, please see [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/338820).
-1. **Inheritence & information flow**: How is information inherited across our container hierarchy currently? How might this be impacted if complying with the new [inheritence behavior](https://gitlab.com/gitlab-org/gitlab/-/issues/343316) framework?
+1. **Inheritance & information flow**: How is information inherited across our container hierarchy currently? How might this be impacted if complying with the new [inheritance behavior](https://gitlab.com/gitlab-org/gitlab/-/issues/343316) framework?
 1. **Settings**: Where can settings for this feature be found currently? How will these be impacted by `Namespaces`?
 1. **Access**: Who can access this feature and is that impacted by the new container structure? Are there any role or privacy considerations?
 1. **Tier**: Is there any tier functionality that is differentiated by projects and groups?
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index 7bbaefb8e1e..a38a8727dc4 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -18,7 +18,7 @@ For GitLab.com and for GitLab customers, the Container Registry is a critical co
 
 ## Current Architecture
 
-The Container Registry is a single [Go](https://golang.org/) application. Its only dependency is the storage backend on which images and metadata are stored.
+The Container Registry is a single [Go](https://go.dev/) application. Its only dependency is the storage backend on which images and metadata are stored.
 
 ```mermaid
 graph LR
@@ -146,7 +146,7 @@ The interaction between the registry and its clients, including GitLab Rails and
 
 ### Database
 
-Following the GitLab [Go standards and style guidelines](../../../development/go_guide), no ORM is used to manage the database, only the [`database/sql`](https://golang.org/pkg/database/sql/) package from the Go standard library, a PostgreSQL driver ([`lib/pq`](https://pkg.go.dev/github.com/lib/pq?tab=doc)) and raw SQL queries, over a TCP connection pool.
+Following the GitLab [Go standards and style guidelines](../../../development/go_guide), no ORM is used to manage the database, only the [`database/sql`](https://pkg.go.dev/database/sql) package from the Go standard library, a PostgreSQL driver ([`lib/pq`](https://pkg.go.dev/github.com/lib/pq?tab=doc)) and raw SQL queries, over a TCP connection pool.
 
 The design and development of the registry database adhere to the GitLab [database guidelines](../../../development/database/). Being a Go application, the required tooling to support the database will have to be developed, such as for running database migrations.
 
diff --git a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
index fb71707c146..754988487de 100644
--- a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
+++ b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
@@ -9,7 +9,7 @@ description: 'GitLab to Kubernetes communication'
 # GitLab to Kubernetes communication **(FREE)**
 
 The goal of this document is to define how GitLab can communicate with Kubernetes
-and in-cluster services through the GitLab Kubernetes Agent.
+and in-cluster services through the GitLab Agent.
 
 ## Challenges
 
@@ -48,7 +48,7 @@ are stored on the GitLab side and this is yet another security concern for our c
 For more discussion on these issues, read
 [issue #212810](https://gitlab.com/gitlab-org/gitlab/-/issues/212810).
 
-## GitLab Kubernetes Agent epic
+## GitLab Agent epic
 
 To address these challenges and provide some new features, the Configure group
 is building an active in-cluster component that inverts the
@@ -62,12 +62,12 @@ The customer does not need to provide any credentials to GitLab, and
 is in full control of what permissions the agent has.
 
 For more information, visit the
-[GitLab Kubernetes Agent repository](https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent) or
+[GitLab Agent repository](https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent) or
 [the epic](https://gitlab.com/groups/gitlab-org/-/epics/3329).
 
 ### Request routing
 
-Agents connect to the server-side component called GitLab Kubernetes Agent Server
+Agents connect to the server-side component called GitLab Agent Server
 (`gitlab-kas`) and keep an open connection that waits for commands. The
 difficulty with the approach is in routing requests from GitLab to the correct agent.
 Each cluster may contain multiple logical agents, and each may be running as multiple
diff --git a/doc/architecture/blueprints/object_storage/index.md b/doc/architecture/blueprints/object_storage/index.md
new file mode 100644
index 00000000000..a79374d60bd
--- /dev/null
+++ b/doc/architecture/blueprints/object_storage/index.md
@@ -0,0 +1,220 @@
+---
+stage: none
+group: unassigned
+comments: false
+description: 'Object storage: direct_upload consolidation - architecture blueprint.'
+---
+
+# Object storage: `direct_upload` consolidation
+
+## Abstract
+
+GitLab stores three classes of user data: database records, Git
+repositories, and user-uploaded files (which are referred to as
+file storage throughout the blueprint).
+
+The user and contributor experience for our file
+storage has room for significant improvement:
+
+- Initial GitLab setup experience requires creation and setup of 13
+  buckets, instead of just 1.
+- Features using file storage require contributors to think about both local
+  storage and object storage, which leads to friction and
+  complexity. This often results in broken features and security issues.
+- Contributors who work on file storage often also have to write code
+  for Workhorse, Omnibus, and cloud native GitLab (CNG).
+
+## Problem definition
+
+Object storage is a fundamental component of GitLab, providing the
+underlying implementation for shared, distributed, highly-available
+(HA) file storage.
+
+Over time, we have built support for object storage across the
+application, solving specific problems in a [multitude of
+iterations](https://about.gitlab.com/company/team/structure/working-groups/object-storage/#company-efforts-on-uploads). This
+has led to increased complexity across the board, from development
+(new features and bug fixes) to installation:
+
+- New GitLab installations require the creation and configuration of
+  several object storage buckets instead of just one, as each group of
+  features requires its own. This has an impact on the installation
+  experience and new feature adoption, and takes us further away from
+  boring solutions.
+- The release of cloud native GitLab required the removal of NFS
+  shared storage and the development of direct upload, a feature that
+  was expanded, milestone after milestone, to several type of uploads,
+  but never enabled globally.
+- Today, GitLab supports both local storage and object storage. Local
+  storage only works on single box installations or with a NFS, which
+  [we no longer recommend](../../../administration/nfs.md) to our
+  users and is no longer in use on GitLab.com.
+- Understanding all the moving parts and the flow is extremely
+  complicated: we have CarrierWave, Fog, Golang S3/Azure SDKs, all
+  being used, and that complicates testing as well.
+- Fog and CarrierWave are not maintained to the level of the native
+  SDKs (for example, AWS S3 SDK), so we have to maintain or monkey
+  patch those tools to support requested customer features
+  (for example, [issue #242245](https://gitlab.com/gitlab-org/gitlab/-/issues/242245))
+  that would normally be "free".
+- In many cases, we copy around object storage files needlessly
+  (for example, [issue #285597](https://gitlab.com/gitlab-org/gitlab/-/issues/285597)).
+  Large files (LFS, packages, and so on) are slow to finalize or don't work
+  at all as a result.
+
+## Improvements over the current situation
+
+The following is a brief description of the main directions we can take to
+remove the pain points affecting our object storage implementation.
+
+This is also available as [a YouTube
+video](https://youtu.be/X9V_w8hsM8E) recorded for the [Object Storage
+Working
+Group](https://about.gitlab.com/company/team/structure/working-groups/object-storage/).
+
+### Simplify GitLab architecture by shipping MinIO
+
+In the beginning, object storage support was a Premium feature, not
+part of our CE distribution. Because of that, we had to support both
+local storage and object storage.
+
+With local storage, there is the assumption of a shared storage
+between components. This can be achieved by having a single box
+installation, without HA, or with a NFS, which [we no longer
+recommend](../../../administration/nfs.md).
+
+We have a testing gap on object storage. It also requires Workhorse
+and MinIO, which are not present in our pipelines, so too much is
+replaced by a mock implementation. Furthermore, the presence of a
+shared disk, both in CI and in local development, often hides broken
+implementations until we deploy on an HA environment.
+
+Shipping MinIO as part of the product will reduce the differences
+between a cloud and a local installation, standardizing our file
+storage on a single technology.
+
+The removal of local disk operations will reduce the complexity of
+development as well as mitigate several security attack vectors as
+we no longer write user-provided data on the local storage.
+
+It will also reduce human errors as we will always run a local object
+storage in development mode and any local file disk access should
+raise a red flag during the merge request review.
+
+This effort is described in [this epic](https://gitlab.com/groups/gitlab-org/-/epics/6099).
+
+### Enable direct upload by default on every upload
+
+Because every group of features requires its own bucket, we don't have
+direct upload enabled everywhere. Contributing a new upload requires
+coding it in both Ruby on Rails and Go.
+
+Implementing a new feature that does not yet have a dedicated bucket
+requires the developer to also create a merge request in Omnibus
+and CNG, as well as coordinate with SREs to configure the new bucket
+for our own environments.
+
+This also slows down feature adoptions, because our users need to
+reconfigure GitLab and prepare a new bucket in their
+infrastructure. It also makes the initial installation more complex
+feature after feature.
+
+Implementing a direct upload by default, with a
+[consolidated object storage configuration](../../../administration/object_storage.md#consolidated-object-storage-configuration)
+will reduce the number of merge requests needed to ship a new feature
+from four to only one. It will also remove the need for SRE
+intervention as the bucket will always be the same.
+
+This will simplify our development and review processes, as well as
+the GitLab configuration file. And every user will immediately have
+access to new features without infrastructure chores.
+
+### Simplify object storage code
+
+Our implementation is built on top of a 3rd-party framework where
+every object storage client is a 3rd-party library. Unfortunately some
+of them are unmaintained. [We have customers who cannot push 5GB Git
+LFS objects](https://gitlab.com/gitlab-org/gitlab/-/issues/216442),
+but with such a vital feature implemented in 3rd-party libraries we
+are slowed down in fixing it, and we also rely on external maintainers
+to merge and release fixes.
+
+Before the introduction of direct upload, using the
+[CarrierWave](https://github.com/carrierwaveuploader/carrierwave)
+library, _"a gem that provides a simple and extremely flexible way to
+upload files from Ruby applications."_, was the boring solution.
+However this is no longer our use-case, as we upload files from
+Workhorse, and we had to [patch CarrierWave's
+internals](https://gitlab.com/gitlab-org/gitlab/-/issues/285597#note_452696638)
+to support direct upload.
+
+A brief proposal covering CarrierWave removal and a new streamlined
+internal upload API is described
+[in this issue comment](https://gitlab.com/gitlab-org/gitlab/-/issues/213288#note_325358026).
+
+Ideally, we wouldn't need to duplicate object storage clients in Go
+and Ruby. By removing CarrierWave, we can make use of the officially
+supported native clients when the provider S3 compatibility level is
+not sufficient.
+
+## Iterations
+
+In this section we list some possible iterations. This is not
+intended to be the final roadmap, but is a conversation started for the
+Object Storage Working Group.
+
+1. Create a new catchall bucket and a unified internal API for
+   authorization without CarrierWave.
+1. Ship MinIO with Omnibus (CNG images already include it).
+1. Expand GitLab-QA to cover all the supported configurations.
+1. Deprecate local disk access.
+1. Deprecate configurations with multiple buckets.
+1. Implement a bucket-to-bucket migration.
+1. Migrate the current CarrierWave uploads to the new implementation.
+1. On the next major release: Remove support for local disk access and
+   configurations with multiple buckets.
+
+### Benefits of the current iteration plan
+
+The current plan is designed to provide tangible benefits from the
+first step.
+
+With the introduction of the catchall bucket, every upload currently
+not subject to direct upload will get its benefits, and new features
+could be shipped with a single merge request.
+
+Shipping MinIO with Omnibus will allow us to default new installations
+to object storage, and Omnibus could take care of creating
+buckets. This will simplify HA installation outside of Kubernetes.
+
+Then we can migrate each CarrierWave uploader to the new
+implementation, up to a point where GitLab installation will only
+require one bucket.
+
+## Additional reading materials
+
+- [Uploads development documentation: The problem description](../../../development/uploads.md#the-problem-description).
+- [Speed up the monolith, building a smart reverse proxy in Go](https://archive.fosdem.org/2020/schedule/event/speedupmonolith/): a presentation explaining a bit of workhorse history and the challenge we faced in releasing the first cloud-native installation.
+- [Object Storage improvements epic](https://gitlab.com/groups/gitlab-org/-/epics/483).
+- We are moving to GraphQL API, but [we do not support direct upload](https://gitlab.com/gitlab-org/gitlab/-/issues/280819).
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role                           | Who                     |
+|--------------------------------|-------------------------|
+| Author                         | Alessio Caiazza         |
+| Architecture Evolution Coach   | Gerardo Lopez-Fernandez |
+| Engineering Leader             | Marin Jankovski         |
+| Domain Expert / Object storage | Stan Hu                 |
+| Domain Expert / Security       | Joern Schneeweisz       |
+
+DRIs:
+
+The DRI for this blueprint is the [Object Storage Working
+Group](https://about.gitlab.com/company/team/structure/working-groups/object-storage/).
+
+<!-- vale gitlab.Spelling = YES -->