diff options
Diffstat (limited to 'doc/architecture/blueprints/cloud_connector/index.md')
-rw-r--r-- | doc/architecture/blueprints/cloud_connector/index.md | 274 |
1 files changed, 274 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cloud_connector/index.md b/doc/architecture/blueprints/cloud_connector/index.md new file mode 100644 index 00000000000..840e17a438a --- /dev/null +++ b/doc/architecture/blueprints/cloud_connector/index.md @@ -0,0 +1,274 @@ +--- +status: proposed +creation-date: "2023-09-28" +authors: [ "@mkaeppler" ] +coach: "@ayufan" +approvers: [ "@rogerwoo", "@pjphillips" ] +owning-stage: "~devops::data stores" +participating-stages: ["~devops::fulfillment", "~devops::ai-powered"] +--- + +# Cloud Connector gateway service + +## Summary + +This design doc proposes a new GitLab-hosted edge service for our +[Cloud Connector product offering](https://gitlab.com/groups/gitlab-org/-/epics/308), which would act as a public +gateway into all features offered under the Cloud Connector umbrella. + +## Motivation + +We currently serve only AI related features to Cloud Connector customers, and our +[current architecture](../../../development/cloud_connector/code_suggestions_for_sm.md) +is a direct reflection of that. +Both SaaS and Self-managed/Dedicated GitLab instances (SM hereafter) talk to the [AI gateway](../ai_gateway/index.md) +directly, which also implements an `Access Layer` to verify that a given request is allowed +to access the respective AI feature endpoint. The mechanism through which this verification happens +for SM instances is detailed in [the CustomersDot architecture documentation](https://gitlab.com/gitlab-org/customers-gitlab-com/-/blob/main/doc/architecture/add_ons/code_suggestions/authorization_for_self_managed.md). + +This approach has served us well because it: + +- Required minimal changes from an architectural standpoint to allow SM users to consume AI features hosted by us. +- Caused minimal friction with ongoing development on SaaS. +- Reduced time to market. + +It is clear that the AI gateway alone does not sufficiently abstract over a wider variety of features, as by definition it is designed to serve AI features only. +Adding non-AI features to the Cloud Connector offering would leave us with +three choices: + +1. Evolve the AI gateway into something that also hosts non-AI features. +1. Expose new Cloud Connector offerings by creating new publicly available services next to the AI-gateway. +1. Introduce a new Cloud Connector public gateway service (CC gateway hereafter) that fronts all feature gateways. + Feature gateways would become privately routed instead. This approach follows the North/South traffic pattern established + by the AI gateway. + +Option 3 is the primary focus of this blueprint. We briefly explore options 1 and 2 in [Alternative solutions](#alternative-solutions). + +### Goals + +Introducing a dedicated edge service for Cloud Connector serves the following goals: + +- **Provide single access point for customers.** We found that customers are not keen on configuring their web proxies and firewalls + to allow outbound traffic to an ever growing list of GitLab-hosted services. While we investigated ways to + [minimize the amount of configuration](https://gitlab.com/gitlab-org/gitlab/-/issues/424780) required, + a satisfying solution has yet to be found. Ideally, we would have _one host only_ that is configured and contacted by a GitLab instance. +- **Reduce risk surface.** With a single entry point facing the public internet, we reduce the attack surface to + malicious users and the necessity to guard each internal service individually from potential abuse. In face of security issues + with a particular milestone release, we could guard against this in the single CC gateway service rather than in + feature gateways individually, improving the pace at which we can respond to security incidents. +- **Provide CC specific telemetry.** User telemetry was added hastily for current Cloud Connector features and was originally + designed for SaaS, which is directly hooked up to Snowplow; that is not true for SM instances. + In order to track usage telemetry specific to CC use cases, it could be valuable to have a dedicated place to collect it and that can be connected + to GitLab-internal data pipelines. +- **Reduce duplication of efforts.** Certain tasks such as instance authorization and "clearing requests" against CustomersDot + that currently live in the AI gateway would have to be duplicated to other services without a central gateway. +- **Improve control over rate limits.** With all requests going to a single AI gateway currently, be it from SM or SaaS, rate + limiting gets more complicated because we need to inspect request metadata to understand where a request originated from. + Moreover, having a dedicated service would allow us, if desired, to implement application-level request budgets, something + we do not currently support. +- **Independently scalable.** For reasons of fault tolerance and scalability, it is beneficial to have all SM traffic go + through a separate service. For example, if an excess of unexpected requests arrive from SM instances due to a bug + in a milestone release, this traffic could be absorbed at the CC gateway level without cascading downstream, thus leaving + SaaS users unaffected. + +### Non-goals + +- **We are not proposing to build a new feature service.** We consider Cloud Connector to run orthogonal to the + various stage groups efforts that build end user features. We would not want actual end user feature development + to happen in this service, but rather provide a vehicle through which these features can be delivered in a consistent manner + across all deployments (SaaS, SM and Dedicated). +- **Changing the existing mechanism by which we authenticate instances and verify permissions.** We intend to keep + the current mechanism in place that emits access tokens from CustomersDot that are subsequently verified in + other systems using public key cryptographic checks. We may move some of the code around that currently implements this, + however. + +## Proposal + +We propose to make two major changes to the current architecture: + +1. Build and deploy a new Cloud Connector edge service that acts as a gateway into all features included + in our Cloud Connector product offering. +1. Make the AI gateway a GitLab-internal service so it does not face the public internet anymore. The new + edge service will front the AI gateway instead. + +At a high level, the new architecture would look as follows: + +```plantuml +@startuml +node "GitLab Inc. infrastructure" { + package "Private services" { + [AI gateway] as AI + [Other feature gateway] as OF + } + + package "Public services" { + [GitLab (SaaS)] as SAAS + [Cloud Connector gateway] as CC #yellow + [Customers Portal] as CDOT + } +} + +node "Customer/Dedicated infrastructure" { + [GitLab] as SM + [Sidekiq] as SK +} + +SAAS --> CC : " access CC feature" +CC --> AI: " access AI feature" +CC --> OF: " access non-AI feature" +CC -> SAAS : "fetch JWKS" + +SM --> CC : "access CC feature" +SK -> CDOT : "sync CC access token" +CC -> CDOT : "fetch JWKS" + +@enduml +``` + +## Design and implementation details + +### CC gateway roles & responsibilities + +The new service would be made available at `cloud.gitlab.com` and act as a "smart router". +It will have the following responsibilities: + +1. **Request handling.** The service will make decisions about whether a particular request is handled + in the service itself or forwarded to a downstream service. For example, a request to `/ai/code_suggestions/completions` + could be handled by forwarding this request to an appropriate endpoint in the AI gateway unchanged, while a request + to `/-/metrics` could be handled by the service itself. As mentioned in [non-goals](#non-goals), the latter would not + include domain logic as it pertains to an end user feature, but rather cross-cutting logic such as telemetry, or + code that is necessary to make an existing feature implementation accessible to end users. + + When handling requests, the service should be unopinionated about which protocol is used, to the extent possible. + Reasons for injecting custom logic could be setting additional HTTP header fields. A design principle should be + to not require CC service deployments if a downstream service merely changes request payload or endpoint definitions. However, + supporting more protocols on top of HTTP may require adding support in the CC service itself. +1. **Authentication/authorization.** The service will be the first point of contact for authenticating clients and verifying + they are authorized to use a particular CC feature. This will include fetching and caching public keys served from GitLab SaaS + and CustomersDot to decode JWT access tokens sent by GitLab instances, including matching token scopes to feature endpoints + to ensure an instance is eligible to consume this feature. This functionality will largely be lifted out of the AI gateway + where it currently lives. To maintain a ZeroTrust environment, the service will implement a more lightweight auth/z protocol + with internal services downstream that merely performs general authenticity checks but forgoes billing and permission + related scoping checks. How this protocol will look like is to be decided, and might be further explored in + [Discussion: Standardized Authentication and Authorization between internal services and GitLab Rails](https://gitlab.com/gitlab-org/gitlab/-/issues/421983). +1. **Organization-level rate limits.** It is to be decided if this is needed, but there could be value in having application-level rate limits + and or "pressure relief valves" that operate at the customer/organization level rather than the network level, the latter of which + Cloudflare already affords us with. These controls would most likely be controlled by the Cloud Connector team, not SREs or + infra engineers. We should also be careful to not simply extend the existing rate limiting configuration that is mainly concerned with GitLab SaaS. +1. **Recording telemetry.** In any cases where telemetry is specific to Cloud Connector feature usage or would result in + duplication of efforts when tracked further down the stack (for example, counting unique users), it should be recorded here instead. + To record usage/business telemetry, the service will talk directly to Snowplow. For operational telemetry, it will provide + a Prometheus metrics endpoint. We may decide to also route Service Ping telemetry through the CC service because this + currently goes to [`version-gitlab-com`](https://gitlab.com/gitlab-services/version-gitlab-com/). + +### Implementation choices + +We suggest to use one of the following language stacks: + +1. **Go.** There is substantial organizational knowledge in writing and running +Go systems at GitLab, and it is a great systems language that gives us efficient ways to handle requests where +they merely need to be forwarded (request proxying) and a powerful concurrency mechanism through goroutines. This makes the +service easier to scale and cheaper to run than Ruby or Python, which scale largely at the process level due to their use +of Global Interpreter Locks, and use inefficient memory models especially as regards byte stream handling and manipulation. +A drawback of Go is that resource requirements such as memory use are less predictable because Go is a garbage collected language. +1. **Rust.** We are starting to build up knowledge in Rust at GitLab. Like Go, it is a great systems language that is +also starting to see wider adoption in the Ruby ecosystem to write CRuby extensions. A major benefit is more predictable +resource consumption because it is not garbage collected and allows for finer control of memory use. +It is also very fast; we found that the Rust implementation for `prometheus-client-mmap` outperformed the original +extension written in C. + +## Alternative solutions + +### Cloudflare Worker + +One promising alternative to writing and deploying a service from scratch is to use +[Cloudflare Workers](https://developers.cloudflare.com/workers/), +a serverless solution to deploying application code that: + +- Is auto-scaled through Cloudflare's service infrastructure. +- Supports any language that compiles to Webassembly, including Rust. +- Supports various options for [cloud storage](https://developers.cloudflare.com/workers/learning/storage-options/) + including a [key-value store](https://developers.cloudflare.com/kv/) we can use to cache data. +- Supports a wide range of [network protocols](https://developers.cloudflare.com/workers/learning/protocols/) + including WebSockets. + +We are exploring this option in issue [#427726](https://gitlab.com/gitlab-org/gitlab/-/issues/427726). + +### Per-feature public gateways + +This approach would be a direct extrapolation of what we're doing now. Because we only host AI features for +Cloud Connector at the moment, we have a single publicly routed gateway that acts as the entry point for +Cloud Connector features and implements all the necessary auth/z and telemetry logic. + +Were we to introduce any non-AI features, each of these would receive their own gateway service, all +publicly routed and be accessed by GitLab instances through individual host names. For example: + +- `ai.gitlab.com`: Services AI features for GitLab instances +- `cicd.gitlab.com`: Services CI/CD features for GitLab instances +- `foo.gitlab.com`: Services foo features for GitLab instances + +A benefit of this approach is that in the absence of an additional layer of indirection, latency +may be improved. + +A major question is how shared concerns are handled because duplicating auth/z, telemetry, rate limits +etc. across all such services may mean re-inventing the wheel for different language stacks (the AI gateway was +written in Python; a non-AI feature gateway will most likely be written in Ruby or Go, which are far more popular +at GitLab). + +One solution to this could be to extract shared concerns into libraries, although these too, would have to be +made available in different languages. This is what we do with `labkit` (we have 3 versions already, for Go, Ruby and Python), +which creates organizational challenges because we are already struggling as an organization to properly allocate +people to maintaining foundational libraries, which is often handled on a best-effort, crowd-sourced basis. + +Another solution could be to extract services that handle some of these concerns. One pattern I have seen used +with multiple edge services is for them to contact a single auth/z service that maps user identity and clears permissions +before handling the actual request, thus reducing code duplication between feature services. + +Other drawbacks of this approach: + +- Increases risk surface by number of feature domains we pull into Cloud Connector because we need to deploy + and secure these services on the public internet. +- Higher coupling of GitLab to feature services. Where and how a particular feature is made + available as an implementation detail. By coupling GitLab to specific network endpoints like `ai.gitlab.com` + we reduce our flexibility to shuffle around both our service architecture but also how we map technology to features + and customer plans/tiers because some customers stay on older GitLab + versions for a very long time. This would necessitate putting special routing/DNS rules in place to address any + larger changes we make to this topology. +- Higher config overhead for customers. Because they may have to configure web proxies and firewalls, they need to + permit-list every single host/IP-range we expose this way. + +### Envoy + +[Envoy](https://www.envoyproxy.io/docs/envoy/v1.27.0/) is a Layer 7 proxy and communication bus that allows +us to overlay a service mesh to solve cross-cutting +problems with multi-service access such as service discovery and rate limiting. Envoy runs as a process sidecar +to the actual application service it manages traffic for. +A single LB could be deployed as Ingress to this service mesh so we can reach it at `cloud.gitlab.com`. + +A benefit of this approach would be that we can use an off-the-shelves solution to solve common networking +and scaling problems. + +A major drawback of this approach is that it leaves no room to run custom application code, which would be necessary +to validate access tokens or implement request budgets at the customer or organization level. In this solution, +these functions would have to be factored out into libraries or other shared services instead, so it shares +other drawbacks with the [per-feature public gateways alternative](#per-feature-public-gateways). + +### Evolving the AI gateway into a CC gateway + +This was the original idea behind the first iteration of the [AI gateway](../ai_gateway/index.md) architecture, +which defined the AI gateway as a "prospective GitLab Plus" service (GitLab Plus was the WIP name for +Cloud Connector.) + +This is our least favorite option for several reasons: + +- Low code cohesion. This would lead us to build another mini-monolith with wildly unrelated responsibilities + that would span various feature domains (AI, CI/CD, secrets management, observability etc.) and teams + having to coordinate when contributing to this service, introducing friction. +- Written in Python. We chose Python for the AI gateway because it seemed a sensible choice, considering the AI + landscape has a Python bias. However, Python is almost non-existent at GitLab outside of this space, and most + of our engineers are Ruby or Go developers, with years of expertise built up in these stacks. We would either + have to rewrite the AI gateway in Ruby or Go to make it more broadly accessible, or invest heavily into Python + training and hiring as an organization. + Furthermore, Python has poor scaling characteristics because like CRuby it uses a Global Interpreter Lock and + therefore primarily scales through processes, not threads. +- Ownership. The AI gateway is currently owned by the AI framework team. This would not make sense if we evolved this into a CC gateway, which should be owned by the Cloud Connector group instead. |