1 files changed, 196 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cells/routing-service.md b/doc/architecture/blueprints/cells/routing-service.md
new file mode 100644
index 00000000000..9efdbdf3f91
--- /dev/null
+++ b/doc/architecture/blueprints/cells/routing-service.md
@@ -0,0 +1,196 @@
+---
+stage: core platform
+group: Tenant Scale
+description: 'Cells: Routing Service'
+---
+
+# Cells: Routing Service
+
+This document describes design goals and architecture of Routing Service
+used by Cells. To better understand where the Routing Service fits
+into architecture take a look at [Deployment Architecture](deployment-architecture.md).
+
+## Goals
+
+The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, `gitlab.com`), instead of having to navigate to separate domains.
+
+The user will be able to use `https://gitlab.com` to access Cell-enabled GitLab.
+Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information.
+For example:
+
+- All requests going to `https://gitlab.com/users/sign_in` are randomly distributed to all Cells.
+- All requests going to `https://gitlab.com/gitlab-org/gitlab/-/tree/master` are always directed to Cell 5, for example.
+- All requests going to `https://gitlab.com/my-username/my-project` are always directed to Cell 1.
+
+1. **Technology.**
+
+    We decide what technology the routing service is written in.
+    The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer.
+    If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider.
+    Then the service needs to be written using a technology compatible with the CDN provider.
+
+1. **Cell discovery.**
+
+    The routing service needs to be able to discover and monitor the health of all Cells.
+
+1. **User can use single domain to interact with many Cells.**
+
+    The routing service will intelligently route all requests to Cells based on the resource being
+    accessed versus the Cell containing the data.
+
+1. **Router endpoints classification.**
+
+    The stateless routing service will fetch and cache information about endpoints from one of the Cells.
+    We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached.
+    We also need to implement a mechanism for negative cache and cache eviction.
+
+1. **GraphQL and other ambiguous endpoints.**
+
+    Most endpoints have a unique sharding key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints.
+    Some endpoints are ambiguous in their usage (they don't encode the sharding key), or the sharding key is stored deep in the payload.
+    In these cases, we need to decide how to handle endpoints like `/api/graphql`.
+
+1. **Small.**
+
+    The Routing Service is configuration-driven and rules-driven, and does not implement any business logic.
+    The maximum size of the project source code in initial phase is 1_000 lines without tests.
+    The reason for the hard limit is to make the Routing Service to not have any special logic,
+    and could be rewritten into any technology in a matter of a few days.
+
+## Requirements
+
+| Requirement   | Description                                                       | Priority |
+|---------------|-------------------------------------------------------------------|----------|
+| Discovery     | needs to be able to discover and monitor the health of all Cells. | high     |
+| Security      | only authorized cells can be routed to                            | high     |
+| Single domain | e.g. GitLab.com                                                   | high     |
+| Caching       | can cache routing information for performance                     | high     |
+| [50 ms of increased latency](#low-latency) |                                       | high     |
+| Path-based    | can make routing decision based on path                           | high     |
+| Complexity    | the routing service should be configuration-driven and small      | high     |
+| Stateless     | does not need database, Cells provide all routing information     | medium   |
+| Secrets-based | can make routing decision based on secret (e.g. JWT)              | medium   |
+| Observability | can use existing observability tooling                            | low      |
+| Self-managed  | can be eventually used by [self-managed](goals.md#self-managed)   | low      |
+| Regional      | can route requests to different [regions](goals.md#regions)       | low       |
+
+### Low Latency
+
+The target latency for routing service **should be less than 50 _ms_**.
+
+Looking at the `urgency: high` request we don't have a lot of headroom on the p50.
+Adding an extra 50 _ms_ allows us to still be in or SLO on the p95 level.
+
+There is 3 primary entry points for the application; [`web`](https://gitlab.com/gitlab-com/runbooks/-/blob/5d8248314b343bef15a4c021ac33978525f809e3/services/service-catalog.yml#L492-537), [`api`](https://gitlab.com/gitlab-com/runbooks/-/blob/5d8248314b343bef15a4c021ac33978525f809e3/services/service-catalog.yml#L18-62), and [`git`](https://gitlab.com/gitlab-com/runbooks/-/blob/5d8248314b343bef15a4c021ac33978525f809e3/services/service-catalog.yml#L589-638).
+Each service is assigned a Service Level Indicator (SLI) based on latency using the [apdex](https://www.apdex.org/wp-content/uploads/2020/09/ApdexTechnicalSpecificationV11_000.pdf) standard.
+The corresponding Service Level Objectives (SLOs) for these SLIs require low latencies for large amount of requests.
+It's crucial to ensure that the addition of the routing layer in front of these services does not impact the SLIs.
+The routing layer is a proxy for these services, and we lack a comprehensive SLI monitoring system for the entire request flow (including components like the Edge network and Load Balancers) we use the SLIs for `web`, `git`, and `api` as a target.
+
+The main SLI we use is the [rails requests](../../../development/application_slis/rails_request.md).
+It has multiple `satisfied` targets (apdex) depending on the [request urgency](../../../development/application_slis/rails_request.md#how-to-adjust-the-urgency):
+
+| Urgency    | Duration in ms |
+|------------|----------------|
+| `:high`    | 250 _ms_       |
+| `:medium`  | 500 _ms_       |
+| `:default` | 1000 _ms_      |
+| `:low`     | 5000 _ms_      |
+
+#### Analysis
+
+The way we calculate the headroom we have is by using the following:
+
+```math
+\mathrm{Headroom}\ {ms} = \mathrm{Satisfied}\ {ms} - \mathrm{Duration}\ {ms}
+```
+
+**`web`**:
+
+| Target Duration | Percentile | Headroom  |
+|-----------------|------------|-----------|
+| 5000 _ms_       | p99        | 4000 _ms_ |
+| 5000 _ms_       | p95        | 4500 _ms_ |
+| 5000 _ms_       | p90        | 4600 _ms_ |
+| 5000 _ms_       | p50        | 4900 _ms_ |
+| 1000 _ms_       | p99        | 500 _ms_  |
+| 1000 _ms_       | p95        | 740 _ms_  |
+| 1000 _ms_       | p90        | 840 _ms_  |
+| 1000 _ms_       | p50        | 900 _ms_  |
+| 500 _ms_        | p99        | 0 _ms_    |
+| 500 _ms_        | p95        | 60 _ms_   |
+| 500 _ms_        | p90        | 100 _ms_  |
+| 500 _ms_        | p50        | 400 _ms_  |
+| 250 _ms_        | p99        | 140 _ms_  |
+| 250 _ms_        | p95        | 170 _ms_  |
+| 250 _ms_        | p90        | 180 _ms_  |
+| 250 _ms_        | p50        | 200 _ms_  |
+
+_Analysis was done in <https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1667993089>_
+
+**`api`**:
+
+| Target Duration | Percentile | Headroom  |
+|-----------------|------------|-----------|
+| 5000 _ms_       | p99        | 3500 _ms_ |
+| 5000 _ms_       | p95        | 4300 _ms_ |
+| 5000 _ms_       | p90        | 4600 _ms_ |
+| 5000 _ms_       | p50        | 4900 _ms_ |
+| 1000 _ms_       | p99        | 440 _ms_  |
+| 1000 _ms_       | p95        | 750 _ms_  |
+| 1000 _ms_       | p90        | 830 _ms_  |
+| 1000 _ms_       | p50        | 950 _ms_  |
+| 500 _ms_        | p99        | 450 _ms_  |
+| 500 _ms_        | p95        | 480 _ms_  |
+| 500 _ms_        | p90        | 490 _ms_  |
+| 500 _ms_        | p50        | 490 _ms_  |
+| 250 _ms_        | p99        | 90 _ms_   |
+| 250 _ms_        | p95        | 170 _ms_  |
+| 250 _ms_        | p90        | 210 _ms_  |
+| 250 _ms_        | p50        | 230 _ms_  |
+
+_Analysis was done in <https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1669995479>_
+
+**`git`**:
+
+| Target Duration | Percentile | Headroom  |
+|-----------------|------------|-----------|
+| 5000 _ms_       | p99        | 3760 _ms_ |
+| 5000 _ms_       | p95        | 4280 _ms_ |
+| 5000 _ms_       | p90        | 4430 _ms_ |
+| 5000 _ms_       | p50        | 4900 _ms_ |
+| 1000 _ms_       | p99        | 500 _ms_  |
+| 1000 _ms_       | p95        | 750 _ms_  |
+| 1000 _ms_       | p90        | 800 _ms_  |
+| 1000 _ms_       | p50        | 900 _ms_  |
+| 500 _ms_        | p99        | 280 _ms_  |
+| 500 _ms_        | p95        | 370 _ms_  |
+| 500 _ms_        | p90        | 400 _ms_  |
+| 500 _ms_        | p50        | 430 _ms_  |
+| 250 _ms_        | p99        | 200 _ms_  |
+| 250 _ms_        | p95        | 230 _ms_  |
+| 250 _ms_        | p90        | 240 _ms_  |
+| 250 _ms_        | p50        | 240 _ms_  |
+
+_Analysis was done in <https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1671385680>_
+
+## Non-Goals
+
+Not yet defined.
+
+## Proposal
+
+TBD
+
+## Technology
+
+TBD
+
+## Alternatives
+
+TBD
+
+## Links
+
+- [Cells - Routing: Technology](https://gitlab.com/groups/gitlab-org/-/epics/11002)
+- [Classify endpoints](https://gitlab.com/gitlab-org/gitlab/-/issues/430330)