diff options
Diffstat (limited to 'doc/architecture/blueprints/rate_limiting/index.md')
-rw-r--r-- | doc/architecture/blueprints/rate_limiting/index.md | 411 |
1 files changed, 411 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md new file mode 100644 index 00000000000..692cef4b11d --- /dev/null +++ b/doc/architecture/blueprints/rate_limiting/index.md @@ -0,0 +1,411 @@ +--- +stage: none +group: unassigned +comments: false +description: 'Next Rate Limiting Architecture' +--- + +# Next Rate Limiting Architecture + +## Summary + +Introducing reasonable application limits is a very important step in any SaaS +platform scaling strategy. The more users a SaaS platform has, the more +important it is to introduce sensible rate limiting and policies enforcement +that will help to achieve availability goals, reduce the problem of noisy +neighbours for users and ensure that they can keep using a platform +successfully. + +This is especially true for GitLab.com. Our goal is to have a reasonable and +transparent strategy for enforcing application limits, which will become a +definition of a responsible usage, to help us with keeping our availability and +user satisfaction at a desired level. + +We've been introducing various application limits for many years already, but +we've never had a consistent strategy for doing it. What we want to build now is +a consistent framework used by engineers and product managers, across entire +application stack, to define, expose and enforce limits and policies. + +Lack of consistency in defining limits, not being able to expose them to our +users, support engineers and satellite services, has negative impact on our +productivity, makes it difficult to introduce new limits and eventually +prevents us from enforcing responsible usage on all layers of our application +stack. + +This blueprint has been written to consolidate our limits and to describe the +vision of our next rate limiting and policies enforcement architecture. + +_Disclaimer: The following contains information related to upcoming products, +features, and functionality._ + +_It is important to note that the information presented is for informational +purposes only. Please do not rely on this information for purchasing or +planning purposes._ + +_As with all projects, the items mentioned in this document and linked pages are +subject to change or delay. The development, release and timing of any +products, features, or functionality remain at the sole discretion of GitLab +Inc._ + +## Goals + +**Implement a next architecture for rate limiting and policies definition.** + +## Challenges + +- We have many ways to define application limits, in many different places. +- It is difficult to understand what limits have been applied to a request. +- It is difficult to introduce new limits, even more to define policies. +- Finding what limits are defined requires performing a codebase audit. +- We don't have a good way to expose limits to satellite services like Registry. +- We enforce a number of different policies via opaque external systems + (Pipeline Validation Service, Bouncer, Watchtower, Cloudflare, Haproxy). +- There is not standardized way to define policies in a way consistent with defining limits. +- It is difficult to understand when a user is approaching a limit threshold. +- There is no way to automatically notify a user when they are approaching thresholds. +- There is no single way to change limits for a namespace / project / user / customer. +- There is no single way to monitor limits through real-time metrics. +- There is no framework for hierarchical limit configuration (instance / namespace / sub-group / project). +- We allow disabling rate-limiting for some marquee SaaS customers, but this + increases a risk for those same customers. We should instead be able to set + higher limits. + +## Opportunity + +We want to build a new framework, making it easier to define limits, quotas and +policies, and to enforce / adjust them in a controlled way, through robust +monitoring capabilities. + +<!-- markdownlint-disable MD029 --> + +1. Build a framework to define and enforce limits in GitLab Rails. +2. Build an API to consume limits in satellite service and expose them to users. +3. Extract parts of this framework into a dedicated GitLab Limits Service. + +<!-- markdownlint-enable MD029 --> + +The most important opportunity here is consolidation happening on multiple +levels: + +1. Consolidate on the application limits tooling used in GitLab Rails. +1. Consolidate on the process of adding and managing application limits. +1. Consolidate on the behavior of hierarchical cascade of limits and overrides. +1. Consolidate on the application limits tooling used across entire application stack. +1. Consolidate on the policies enforcement tooling used across entire company. + +Once we do that we will unlock another opportunity: to ship the new framework / +tooling as a GitLab feature to unlock these consolidation benefits for our +users, customers and entire wider community audience. + +### Limits, quotas and policies + +This document aims to describe our technical vision for building the next rate +limiting architecture for GitLab.com. We refer to this architectural evolution +as "the next rate limiting architecture", but this is a mental shortcut, +because we actually want to build a better framework that will make it easier +for us to manage not only rate limits, but also quotas and policies. + +Below you can find a short definition of what we understand by a limit, by a +quota and by a policy. + +- **Limit:** A constraint on application usage, typically used to mitigate + risks to performance, stability, and security. + - _Example:_ API calls per second for a given IP address + - _Example:_ `git clone` events per minute for a given user + - _Example:_ maximum artifact upload size of 1GB +- **Quota:** A global constraint in application usage that is aggregated across an + entire namespace over the duration of their billing cycle. + - _Example:_ 400 CI/CD minutes per namespace per month + - _Example:_ 10GB transfer per namespace per month +- **Policy:** A representation of business logic that is decoupled from application + code. Decoupled policy definitions allow logic to be shared across multiple services + and/or "hot-loaded" at runtime without releasing a new version of the application. + - _Example:_ decode and verify a JWT, determine whether the user has access to the + given resource based on the JWT's scopes and claims + - _Example:_ deny access based on group-level constraints + (such as IP allowlist, SSO, and 2FA) across all services + +Technically, all of these are limits, because rate limiting is still +"limiting", quota is usually a business limit, and policy limits what you can +do with the application to enforce specific rules. By referring to a "limit" in +this document we mean a limit that is defined to protect business, availability +and security. + +### Framework to define and enforce limits + +First we want to build a new framework that will allow us to define and enforce +application limits, in the GitLab Rails project context, in a more consistent +and established way. In order to do that, we will need to build a new +abstraction that will tell engineers how to define a limit in a structured way +(presumably using YAML or Cue format) and then how to consume the limit in the +application itself. + +We already do have many limits defined in the application, we can use them to +triangulate to find a reasonable abstraction that will consolidate how we +define, use and enforce limits. + +We envision building a simple Ruby library here (we can add it to LabKit) that +will make it trivial for engineers to check if a certain limit has been +exceeded or not. + +```yaml +name: my_limit_name +actors: user +context: project, group, pipeline +type: rate / second +group: pipeline::execution +limits: + warn: 2B / day + soft: 100k / s + hard: 500k / s +``` + +```ruby +Gitlab::Limits::RateThreshold.enforce(:my_limit_name) do |threshold| + actor = current_user + context = current_project + + threshold.available do |limit| + # ... + end + + threshold.approaching do |limit| + # ... + end + + threshold.exceeded do |limit| + # ... + end +end +``` + +In the example above, when `my_limit_name` is defined in YAML, engineers will +be check the current state and execute appropriate code block depending on the +past usage / resource consumption. + +Things we want to build and support by default: + +1. Comprehensive dashboards showing how often limits are being hit. +1. Notifications about the risk of hitting limits. +1. Automation checking if limits definitions are being enforced properly. +1. Different types of limits - time bound / number per resource etc. +1. A panel that makes it easy to override limits per plan / namespace. +1. Logging that will expose limits applied in Kibana. +1. An automatically generated documentation page describing all the limits. + +### API to expose limits and policies + +Once we have an established a consistent way to define application limits we +can build a few API endpoints that will allow us to expose them to our users, +customers and other satellite services that may want to consume them. + +Users will be able to ask the API about the limits / thresholds that have been +set for them, how often they are hitting them, and what impact those might have +on their business. This kind of transparency can help them with communicating +their needs to customer success team at GitLab, and we will be able to +communicate how the responsible usage is defined at a given moment. + +Because of how GitLab architecture has been built, GitLab Rails application, in +most cases, behaves as a central enterprise service bus (ESB) and there are a +few satellite services communicating with it. Services like Container Registry, +GitLab Runners, Gitaly, Workhorse, KAS could use the API to receive a set of +application limits those are supposed to enforce. This will still allow us to +define all of them in a single place. + +We should, however, avoid the possible negative-feedback-loop, that will put +additional strain on the Rails application when there is a sudden increase in +usage happening. This might be a big customer starting a new automation that +traverses our API or a Denial of Service attack. In such cases, the additional +traffic will reach GitLab Rails and subsequently also other satellite services. +Then the satellite services may need to consult Rails again to obtain new +instructions / policies around rate limiting the increased traffic. This can +put additional strain on Rails application and eventually degrade performance +even more. In order to avoid this problem, we should extract the API endpoints +to separate service (see the section below) if the request rate to those +endpoints depends on the volume of incoming traffic. Alternatively we can keep +those endpoints in Rails if the increased traffic will not translate into +increase of requests rate or increase in resources consumption on these API +endpoints on the Rails side. + +#### Decoupled Limits Service + +At some point we may decide that it is time to extract a stateful backend +responsible for storing metadata around limits, all the counters and state +required, and exposing API, out of Rails. + +It is impossible to make a decision about extracting such a decoupled limits +service yet, because we will need to ship more proof-of-concept work, and +concrete iterations to inform us better about when and how we should do that. We +will depend on the Evolution Architecture practice to guide us towards either +extracting Decoupled Limits Service or not doing that at all. + +As we evolve this blueprint, we will document our findings and insights about +how this service should look like, in this section of the document. + +### GitLab Policy Service + +_Disclaimer_: Extracting a GitLab Policy Service might be out of scope +of the current workstream organized around implementing this blueprint. + +Not all limits can be easily described in YAML. There are some more complex +policies that require a bit more sophisticated approach and a declarative +programming language used to enforce them. One example of such a language might be +[Rego](https://www.openpolicyagent.org/docs/latest/policy-language/) language. +It is a standardized way to define policies in +[OPA - Open Policy Agent](https://www.openpolicyagent.org/). At GitLab we are +already using OPA in some departments. We envision the need to additional +consolidation to not only consolidate on the tooling we are using internally at +GitLab, but to also transform the Next Rate Limiting Architecture into +something we can make a part of the product itself. + +Today, we already do have a policy service we are using to decide whether a +pipeline can be created or not. There are many policies defined in +[Pipeline Validation Service](https://gitlab.com/gitlab-org/modelops/anti-abuse/pipeline-validation-service). +There is a significant opportunity here in transforming Pipeline Validation +Service into a general purpose GitLab Policy Service / GitLab Policy Agent that +will be well integrated into the GitLab product itself. + +Generalizing Pipeline Validation Service into GitLab Policy Service can bring a +few interesting benefits: + +1. Consolidate on our tooling across the company to improve efficiency. +1. Integrate our GitLab Rails limits framework to resolve policies using the policy service. +1. Do not struggle to define complex policies in YAML and hack evaluating them in Ruby. +1. Build a policy for GraphQL queries limiting using query execution cost estimation. +1. Make it easier to resolve policies that do not need "hierarchical limits" structure. +1. Make GitLab Policy Service part of the product and integrate it into the single application. + +We envision using GitLab Policy Service to be place to define policies that do +not require knowing anything about the hierarchical structure of the limits. +There are limits that do not need this, like IP addresses allow-list, spam +checks, configuration validation etc. + +We defined "Policy" as a stateless, functional-style, limit. It takes input +arguments and evaluates to either true or false. It should not require a global +counter or any other volatile global state to get evaluated. It may still +require to have a globally defined rules / configuration, but this state is not +volatile in a same way a rate limiting counter may be, or a megabytes consumed +to evaluate quota limit. + +#### Policies used internally and externally + +The GitLab Policy Service might be used in two different ways: + +1. Rails limits framework will use it as a source of policies enforced internally. +1. The policy service feature will be used as a backend to store policies defined by users. + +These are two slightly different use-cases: first one is about using +internally-defined policies to ensure the stability / availably of a GitLab +instance (GitLab.com or self-managed instance). The second use-case is about +making GitLab Policy Service a feature that users will be able to build on top +of. + +Both use-cases are valid but we will need to make technical decision about how +to separate them. Even if we decide to implement them both in a single service, +we will need to draw a strong boundary between the two. + +The same principle might apply to Decouple Limits Service described in one of +the sections of this document above. + +#### The two limits / policy services + +It is possible that GitLab Policy Service and Decoupled Limits Service can +actually be the same thing. It, however, depends on the implementation details +that we can't predict yet, and the decision about merging these services +together will need to be informed by subsequent interations' feedback. + +## Hierarchical limits + +GitLab application aggregates users, projects, groups and namespaces in a +hierarchical way. This hierarchical structure has been designed to make it +easier to manage permissions, streamline workflows, and allow users and +customers to store related projects, repositories, and other artifacts, +together. + +It is important to design the new rate limiting framework in a way that it +built on top of this hierarchical structure and engineers, customers, SREs and +other stakeholders can understand how limits are being applied, enforced and +overridden within the hierarchy of namespaces, groups and projects. + +We want to reduce the cognitive load required to understand how limits are +being managed within the existing permissions structure. We might need to build +a simple and easy-to-understand formula for how our application decides which +limits and thresholds to apply for a given request and a given actor: + +> GitLab will read default limits for every operation, all overrides configured +> and will choose a limit with the highest precedence configured. A limit +> precedence needs to be explicitly configured for every override, a default +> limit has precedence 100. + +One way in which we can simplify limits management in general is to: + +1. Have default limits / thresholds defined in YAML files with a default precedence 100. +1. Allow limits to be overridden through the API, store overrides in the database. +1. Every limit / threshold override needs to have an integer precedence value provided. +1. Build an API that will take an actor and expose limits applicable for it. +1. Build a dashboard showing actors with non-standard limits / overrides. +1. Build a observability around this showing in Kibana when non-standard limits are being used. + +The points above represent an idea to use precedence score (or Z-Index for +limits), but there may be better solutions, like just defining a direction of +overrides - a lower limit might always override a limit defined higher in the +hierarchy. Choosing a proper solution will require a thoughtful research. + +## Principles + +1. Try to avoid building rate limiting framework in a tightly coupled way. +1. Build application limits API in a way that it can be easily extracted to a separate service. +1. Build application limits definition in a way that is independent from the Rails application. +1. Build tooling that produce consistent behavior and results across programming languages. +1. Build the new framework in a way that we can extend to allow self-managed admins to customize limits. +1. Maintain consistent features and behavior across SaaS and self-managed codebase. +1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it. + +## Status + +Request For Comments. + +## Timeline + +- 2022-04-27: [Rate Limit Architecture Working Group](https://about.gitlab.com/company/team/structure/working-groups/rate-limit-architecture/) started. +- 2022-06-07: Working Group members [started submitting technical proposals](https://gitlab.com/gitlab-org/gitlab/-/issues/364524) for the next rate limiting architecture. +- 2022-06-15: We started [scoring proposals](https://docs.google.com/spreadsheets/d/1DFHU1kSdTnpydwM5P2RK8NhVBNWgEHvzT72eOhB8F9E) submitted by Working Group members. +- 2022-07-06: A fourth, [consolidated proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/364524#note_1017640650), has been submitted. +- 2022-07-12: Started working on the design document following [Architecture Evolution Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/). +- 2022-09-08: The initial version of the blueprint has been merged. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Author | Fabio Pitino | +| Author | Marshall Cottrell | +| Author | Hayley Swimelar | +| Engineering Leader | Sam Goldstein | +| Product Manager | | +| Architecture Evolution Coach | | +| Recommender | | +| Recommender | | +| Recommender | | +| Recommender | | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | | +| Product | | +| Engineering | | + +Domain experts: + +| Area | Who +|------------------------------|------------------------| +| | | + +<!-- vale gitlab.Spelling = YES --> |