Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/architecture/blueprints/cells/index.md')
-rw-r--r--doc/architecture/blueprints/cells/index.md360
1 files changed, 360 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cells/index.md b/doc/architecture/blueprints/cells/index.md
new file mode 100644
index 00000000000..9938875adb6
--- /dev/null
+++ b/doc/architecture/blueprints/cells/index.md
@@ -0,0 +1,360 @@
+---
+status: accepted
+creation-date: "2022-09-07"
+authors: [ "@ayufan", "@fzimmer", "@DylanGriffith" ]
+coach: "@ayufan"
+approvers: [ "@fzimmer" ]
+owning-stage: "~devops::enablement"
+participating-stages: []
+---
+
+<!-- vale gitlab.FutureTense = NO -->
+
+# Cells
+
+This document is a work-in-progress and represents a very early state of the Cells design. Significant aspects are not documented, though we expect to add them in the future.
+
+Cells is a new architecture for our Software as a Service platform. This architecture is horizontally-scalable, resilient, and provides a more consistent user experience. It may also provide additional features in the future, such as data residency control (regions) and federated features.
+
+For more information about Cells, see also:
+
+- [Glossary](glossary.md)
+- [Goals](goals.md)
+- [Cross-section impact](impact.md)
+
+## Work streams
+
+We can't ship the entire Cells architecture in one go - it is too large.
+Instead, we are defining key work streams required by the project.
+
+Not all objectives need to be fulfilled to reach production readiness.
+It is expected that some objectives will not be completed for General Availability (GA),
+but will be enough to run Cells in production.
+
+### 1. Data access layer
+
+Before Cells can be run in production we need to prepare the codebase to accept the Cells architecture.
+This preparation involves:
+
+- Allowing data sharing between Cells.
+- Updating the tooling for discovering cross-Cell data traversal.
+- Defining code practices for cross-Cell data traversal.
+- Analyzing the data model to define the data affinity.
+
+Under this objective the following steps are expected:
+
+1. **Allow to share cluster-wide data with database-level data access layer.**
+
+ Cells can connect to a database containing shared data. For example:
+ application settings, users, or routing information.
+
+1. **Evaluate the efficiency of database-level access vs. API-oriented access layer.**
+
+ Reconsider the consequences of database-level data access for data migration, resiliency of updates and of interconnected systems when we share only a subset of data.
+
+1. **Cluster-unique identifiers**
+
+ Every object has a unique identifier that can be used to access data across the cluster. The IDs for allocated projects, issues and any other objects are cluster-unique.
+
+1. **Cluster-wide deletions**
+
+ If entities deleted in Cell 2 are cross-referenced, they are properly deleted or nullified across clusters. We will likely re-use existing [loose foreign keys](../../../development/database/loose_foreign_keys.md) to extend it with cross-Cells data removal.
+
+1. **Data access layer**
+
+ Ensure that a stable data-access (versioned) layer that allows to share cluster-wide data is implemented.
+
+1. **Database migration**
+
+ Ensure that migrations can be run independently between Cells, and we safely handle migrations of shared data in a way that does not impact other Cells.
+
+### 2. Essential workflows
+
+To make Cells viable we require to define and support
+essential workflows before we can consider the Cells
+to be of Beta quality. Essential workflows are meant
+to cover the majority of application functionality
+that makes the product mostly useable, but with some caveats.
+
+The current approach is to define workflows from top to bottom.
+The order defines the presumed priority of the items.
+This list is not exhaustive as we would be expecting
+other teams to help and fix their workflows after
+the initial phase, in which we fix the fundamental ones.
+
+To consider a project ready for the Beta phase, it is expected
+that all features defined below are supported by Cells.
+In the cases listed below, the workflows define a set of tables
+to be properly attributed to the feature. In some cases,
+a table with an ambiguous usage has to be broken down.
+For example: `uploads` are used to store user avatars,
+as well as uploaded attachments for comments. It would be expected
+that `uploads` is split into `uploads` (describing group/project-level attachments)
+and `global_uploads` (describing, for example, user avatars).
+
+Except for initial 2-3 quarters this work is highly parallel.
+It would be expected that **group::tenant scale** would help other
+teams to fix their feature set to work with Cells. The first 2-3 quarters
+would be required to define a general split of data and build required tooling.
+
+1. **Instance-wide settings are shared across cluster.**
+
+ The Admin Area section for most part is shared across a cluster.
+
+1. **User accounts are shared across cluster.**
+
+ The purpose is to make `users` cluster-wide.
+
+1. **User can create group.**
+
+ The purpose is to perform a targeted decomposition of `users` and `namespaces`, because the `namespaces` will be stored locally in the Cell.
+
+1. **User can create project.**
+
+ The purpose is to perform a targeted decomposition of `users` and `projects`, because the `projects` will be stored locally in the Cell.
+
+1. **User can change profile avatar that is shared in cluster.**
+
+ The purpose is to fix global uploads that are shared in cluster.
+
+1. **User can push to Git repository.**
+
+ The purpose is to ensure that essential joins from the projects table are properly attributed to be
+ Cell-local, and as a result the essential Git workflow is supported.
+
+1. **User can run CI pipeline.**
+
+ The purpose is that `ci_pipelines` (like `ci_stages`, `ci_builds`, `ci_job_artifacts`) and adjacent tables are properly attributed to be Cell-local.
+
+1. **User can create issue, merge request, and merge it after it is green.**
+
+ The purpose is to ensure that `issues` and `merge requests` are properly attributed to be `Cell-local`.
+
+1. **User can manage group and project members.**
+
+ The `members` table is properly attributed to be either `Cell-local` or `cluster-wide`.
+
+1. **User can manage instance-wide runners.**
+
+ The purpose is to scope all CI Runners to be Cell-local. Instance-wide runners in fact become Cell-local runners. The expectation is to provide a user interface view and manage all runners per Cell, instead of per cluster.
+
+1. **User is part of organization and can only see information from the organization.**
+
+ The purpose is to have many organizations per Cell, but never have a single organization spanning across many Cells. This is required to ensure that information shown within an organization is isolated, and does not require fetching information from other Cells.
+
+### 3. Additional workflows
+
+Some of these additional workflows might need to be supported, depending on the group decision.
+This list is not exhaustive of work needed to be done.
+
+1. **User can use all group-level features.**
+1. **User can use all project-level features.**
+1. **User can share groups with other groups in an organization.**
+1. **User can create system webhook.**
+1. **User can upload and manage packages.**
+1. **User can manage security detection features.**
+1. **User can manage Kubernetes integration.**
+1. TBD
+
+### 4. Routing layer
+
+The routing layer is meant to offer a consistent user experience where all Cells are presented
+under a single domain (for example, `gitlab.com`), instead of
+having to navigate to separate domains.
+
+The user will able to use `https://gitlab.com` to access Cell-enabled GitLab. Depending
+on the URL access, it will be transparently proxied to the correct Cell that can serve this particular
+information. For example:
+
+- All requests going to `https://gitlab.com/users/sign_in` are randomly distributed to all Cells.
+- All requests going to `https://gitlab.com/gitlab-org/gitlab/-/tree/master` are always directed to Cell 5, for example.
+- All requests going to `https://gitlab.com/my-username/my-project` are always directed to Cell 1.
+
+1. **Technology.**
+
+ We decide what technology the routing service is written in.
+ The choice is dependent on the best performing language, and the expected way
+ and place of deployment of the routing layer. If it is required to make
+ the service multi-cloud it might be required to deploy it to the CDN provider.
+ Then the service needs to be written using a technology compatible with the CDN provider.
+
+1. **Cell discovery.**
+
+ The routing service needs to be able to discover and monitor the health of all Cells.
+
+1. **Router endpoints classification.**
+
+ The stateless routing service will fetch and cache information about endpoints
+ from one of the Cells. We need to implement a protocol that will allow us to
+ accurately describe the incoming request (its fingerprint), so it can be classified
+ by one of the Cells, and the results of that can be cached. We also need to implement
+ a mechanism for negative cache and cache eviction.
+
+1. **GraphQL and other ambigious endpoints.**
+
+ Most endpoints have a unique sharding key: the organization, which directly
+ or indirectly (via a group or project) can be used to classify endpoints.
+ Some endpoints are ambiguous in their usage (they don't encode the sharding key),
+ or the sharding key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like `/api/graphql`.
+
+### 5. Cell deployment
+
+We will run many Cells. To manage them easier, we need to have consistent
+deployment procedures for Cells, including a way to deploy, manage, migrate,
+and monitor.
+
+We are very likely to use tooling made for [GitLab Dedicated](https://about.gitlab.com/dedicated/)
+with its control planes.
+
+1. **Extend GitLab Dedicated to support GCP.**
+1. TBD
+
+### 6. Migration
+
+When we reach production and are able to store new organizations on new Cells, we need
+to be able to divide big Cells into many smaller ones.
+
+1. **Use GitLab Geo to clone Cells.**
+
+ The purpose is to use GitLab Geo to clone Cells.
+
+1. **Split Cells by cloning them.**
+
+ Once Cell is cloned we change routing information for organizations.
+ Organization will encode `cell_id`. When we update `cell_id` it will automatically
+ make the given Cell to be authoritative to handle the traffic for the given organization.
+
+1. **Delete redundant data from previous Cells.**
+
+ Since the organization is now stored on many Cells, once we change `cell_id`
+ we will have to remove data from all other Cells based on `organization_id`.
+
+## Availability of the feature
+
+We are following the [Support for Experiment, Beta, and Generally Available features](../../../policy/alpha-beta-support.md).
+
+### 1. Experiment
+
+Expectations:
+
+- We can deploy a Cell on staging or another testing environment by using a separate domain (ex. `cell2.staging.gitlab.com`)
+ using [Cell deployment](#5-cell-deployment) tooling.
+- User can create organization, group and project, and run some of the [essential workflows](#2-essential-workflows).
+- It is not expected to be able to run a router to serve all requests under a single domain.
+- We expect data-loss of data stored on additional Cells.
+- We expect to tear down and create many new Cells to validate tooling.
+
+### 2. Beta
+
+Expectations:
+
+- We can run many Cells under a single domain (ex. `staging.gitlab.com`).
+- All features defined in [essential workflows](#2-essential-workflows) are supported.
+- Not all aspects of [Routing layer](#4-routing-layer) are finalized.
+- We expect additional Cells to be stable with minimal data loss.
+
+### 3. GA
+
+Expectations:
+
+- We can run many Cells under a single domain (for example, `staging.gitlab.com`).
+- All features defined in [essential workflows](#2-essential-workflows) are supported.
+- All features of [routing layer](#4-routing-layer) are supported.
+- Most of [additional workflows](#3-additional-workflows) are supported.
+- We don't expect to support any of [migration](#6-migration) aspects.
+
+### 4. Post GA
+
+Expectations:
+
+- We support all [additional workflows](#3-additional-workflows).
+- We can [migrate](#6-migration) existing organizations onto new Cells.
+
+## Iteration plan
+
+The delivered iterations will focus on solving particular steps of a given
+key work stream.
+
+It is expected that initial iterations will rather
+be slow, because they require substantially more
+changes to prepare the codebase for data split.
+
+One iteration describes one quarter's worth of work.
+
+1. Iteration 1 - FY24Q1
+
+ - Data access layer: Initial Admin Area settings are shared across cluster.
+ - Essential workflows: Allow to share cluster-wide data with database-level data access layer
+
+1. Iteration 2 - FY24Q2
+
+ - Essential workflows: User accounts are shared across cluster.
+ - Essential workflows: User can create group.
+
+1. Iteration 3 - FY24Q3
+
+ - Essential workflows: User can create project.
+ - Essential workflows: User can push to Git repository.
+ - Cell deployment: Extend GitLab Dedicated to support GCP
+ - Routing: Technology.
+
+1. Iteration 4 - FY24Q4
+
+ - Essential workflows: User can run CI pipeline.
+ - Essential workflows: User can create issue, merge request, and merge it after it is green.
+ - Data access layer: Evaluate the efficiency of database-level access vs. API-oriented access layer
+ - Data access layer: Cluster-unique identifiers.
+ - Routing: Cell discovery.
+ - Routing: Router endpoints classification.
+
+1. Iteration 5 - FY25Q1
+
+ - TBD
+
+## Technical Proposals
+
+The Cells architecture do have long lasting implications to data processing, location, scalability and the GitLab architecture.
+This section links all different technical proposals that are being evaluated.
+
+- [Stateless Router That Uses a Cache to Pick Cell and Is Redirected When Wrong Cell Is Reached](proposal-stateless-router-with-buffering-requests.md)
+
+- [Stateless Router That Uses a Cache to Pick Cell and pre-flight `/api/v4/cells/learn`](proposal-stateless-router-with-routes-learning.md)
+
+## Impacted features
+
+The Cells architecture will impact many features requiring some of them to be rewritten, or changed significantly.
+This is the list of known affected features with the proposed solutions.
+
+- [Cells: Git Access](cells-feature-git-access.md)
+- [Cells: Data Migration](cells-feature-data-migration.md)
+- [Cells: Database Sequences](cells-feature-database-sequences.md)
+- [Cells: GraphQL](cells-feature-graphql.md)
+- [Cells: Organizations](cells-feature-organizations.md)
+- [Cells: Router Endpoints Classification](cells-feature-router-endpoints-classification.md)
+- [Cells: Schema changes (Postgres and Elasticsearch migrations)](cells-feature-schema-changes.md)
+- [Cells: Backups](cells-feature-backups.md)
+- [Cells: Global Search](cells-feature-global-search.md)
+- [Cells: CI Runners](cells-feature-ci-runners.md)
+- [Cells: Admin Area](cells-feature-admin-area.md)
+- [Cells: Secrets](cells-feature-secrets.md)
+- [Cells: Container Registry](cells-feature-container-registry.md)
+- [Cells: Contributions: Forks](cells-feature-contributions-forks.md)
+- [Cells: Personal Namespaces](cells-feature-personal-namespaces.md)
+- [Cells: Dashboard: Projects, Todos, Issues, Merge Requests, Activity, ...](cells-feature-dashboard.md)
+- [Cells: Snippets](cells-feature-snippets.md)
+- [Cells: Uploads](cells-feature-uploads.md)
+- [Cells: GitLab Pages](cells-feature-gitlab-pages.md)
+- [Cells: Agent for Kubernetes](cells-feature-agent-for-kubernetes.md)
+
+## Decision log
+
+- 2022-03-15: Google Cloud as the cloud service. For details, see [issue 396641](https://gitlab.com/gitlab-org/gitlab/-/issues/396641#note_1314932272).
+
+## Links
+
+- [Internal Pods presentation](https://docs.google.com/presentation/d/1x1uIiN8FR9fhL7pzFh9juHOVcSxEY7d2_q4uiKKGD44/edit#slide=id.ge7acbdc97a_0_155)
+- [Internal link to all diagrams](https://drive.google.com/file/d/13NHzbTrmhUM-z_Bf0RjatUEGw5jWHSLt/view?usp=sharing)
+- [Cells Epic](https://gitlab.com/groups/gitlab-org/-/epics/7582)
+- [Database Group investigation](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html)
+- [Shopify Pods architecture](https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale)
+- [Opstrace architecture](https://gitlab.com/gitlab-org/opstrace/opstrace/-/blob/main/docs/architecture/overview.md)