1 files changed, 411 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md
new file mode 100644
index 00000000000..692cef4b11d
--- /dev/null
+++ b/doc/architecture/blueprints/rate_limiting/index.md
@@ -0,0 +1,411 @@
+---
+stage: none
+group: unassigned
+comments: false
+description: 'Next Rate Limiting Architecture'
+---
+
+# Next Rate Limiting Architecture
+
+## Summary
+
+Introducing reasonable application limits is a very important step in any SaaS
+platform scaling strategy. The more users a SaaS platform has, the more
+important it is to introduce sensible rate limiting and policies enforcement
+that will help to achieve availability goals, reduce the problem of noisy
+neighbours for users and ensure that they can keep using a platform
+successfully.
+
+This is especially true for GitLab.com. Our goal is to have a reasonable and
+transparent strategy for enforcing application limits, which will become a
+definition of a responsible usage, to help us with keeping our availability and
+user satisfaction at a desired level.
+
+We've been introducing various application limits for many years already, but
+we've never had a consistent strategy for doing it. What we want to build now is
+a consistent framework used by engineers and product managers, across entire
+application stack, to define, expose and enforce limits and policies.
+
+Lack of consistency in defining limits, not being able to expose them to our
+users, support engineers and satellite services, has negative impact on our
+productivity, makes it difficult to introduce new limits and eventually
+prevents us from enforcing responsible usage on all layers of our application
+stack.
+
+This blueprint has been written to consolidate our limits and to describe the
+vision of our next rate limiting and policies enforcement architecture.
+
+_Disclaimer: The following contains information related to upcoming products,
+features, and functionality._
+
+_It is important to note that the information presented is for informational
+purposes only. Please do not rely on this information for purchasing or
+planning purposes._
+
+_As with all projects, the items mentioned in this document and linked pages are
+subject to change or delay. The development, release and timing of any
+products, features, or functionality remain at the sole discretion of GitLab
+Inc._
+
+## Goals
+
+**Implement a next architecture for rate limiting and policies definition.**
+
+## Challenges
+
+- We have many ways to define application limits, in many different places.
+- It is difficult to understand what limits have been applied to a request.
+- It is difficult to introduce new limits, even more to define policies.
+- Finding what limits are defined requires performing a codebase audit.
+- We don't have a good way to expose limits to satellite services like Registry.
+- We enforce a number of different policies via opaque external systems
+  (Pipeline Validation Service, Bouncer, Watchtower, Cloudflare, Haproxy).
+- There is not standardized way to define policies in a way consistent with defining limits.
+- It is difficult to understand when a user is approaching a limit threshold.
+- There is no way to automatically notify a user when they are approaching thresholds.
+- There is no single way to change limits for a namespace / project / user / customer.
+- There is no single way to monitor limits through real-time metrics.
+- There is no framework for hierarchical limit configuration (instance / namespace / sub-group / project).
+- We allow disabling rate-limiting for some marquee SaaS customers, but this
+  increases a risk for those same customers. We should instead be able to set
+  higher limits.
+
+## Opportunity
+
+We want to build a new framework, making it easier to define limits, quotas and
+policies, and to enforce / adjust them in a controlled way, through robust
+monitoring capabilities.
+
+<!-- markdownlint-disable MD029 -->
+
+1. Build a framework to define and enforce limits in GitLab Rails.
+2. Build an API to consume limits in satellite service and expose them to users.
+3. Extract parts of this framework into a dedicated GitLab Limits Service.
+
+<!-- markdownlint-enable MD029 -->
+
+The most important opportunity here is consolidation happening on multiple
+levels:
+
+1. Consolidate on the application limits tooling used in GitLab Rails.
+1. Consolidate on the process of adding and managing application limits.
+1. Consolidate on the behavior of hierarchical cascade of limits and overrides.
+1. Consolidate on the application limits tooling used across entire application stack.
+1. Consolidate on the policies enforcement tooling used across entire company.
+
+Once we do that we will unlock another opportunity: to ship the new framework /
+tooling as a GitLab feature to unlock these consolidation benefits for our
+users, customers and entire wider community audience.
+
+### Limits, quotas and policies
+
+This document aims to describe our technical vision for building the next rate
+limiting architecture for GitLab.com. We refer to this architectural evolution
+as "the next rate limiting architecture", but this is a mental shortcut,
+because we actually want to build a better framework that will make it easier
+for us to manage not only rate limits, but also quotas and policies.
+
+Below you can find a short definition of what we understand by a limit, by a
+quota and by a policy.
+
+- **Limit:** A constraint on application usage, typically used to mitigate
+  risks to performance, stability, and security.
+  - _Example:_ API calls per second for a given IP address
+  - _Example:_ `git clone` events per minute for a given user
+  - _Example:_ maximum artifact upload size of 1GB
+- **Quota:** A global constraint in application usage that is aggregated across an
+  entire namespace over the duration of their billing cycle.
+  - _Example:_ 400 CI/CD minutes per namespace per month
+  - _Example:_ 10GB transfer per namespace per month
+- **Policy:** A representation of business logic that is decoupled from application
+  code. Decoupled policy definitions allow logic to be shared across multiple services
+  and/or "hot-loaded" at runtime without releasing a new version of the application.
+  - _Example:_ decode and verify a JWT, determine whether the user has access to the
+    given resource based on the JWT's scopes and claims
+  - _Example:_ deny access based on group-level constraints
+    (such as IP allowlist, SSO, and 2FA) across all services
+
+Technically, all of these are limits, because rate limiting is still
+"limiting", quota is usually a business limit, and policy limits what you can
+do with the application to enforce specific rules. By referring to a "limit" in
+this document we mean a limit that is defined to protect business, availability
+and security.
+
+### Framework to define and enforce limits
+
+First we want to build a new framework that will allow us to define and enforce
+application limits, in the GitLab Rails project context, in a more consistent
+and established way. In order to do that, we will need to build a new
+abstraction that will tell engineers how to define a limit in a structured way
+(presumably using YAML or Cue format) and then how to consume the limit in the
+application itself.
+
+We already do have many limits defined in the application, we can use them to
+triangulate to find a reasonable abstraction that will consolidate how we
+define, use and enforce limits.
+
+We envision building a simple Ruby library here (we can add it to LabKit) that
+will make it trivial for engineers to check if a certain limit has been
+exceeded or not.
+
+```yaml
+name: my_limit_name
+actors: user
+context: project, group, pipeline
+type: rate / second
+group: pipeline::execution
+limits:
+  warn: 2B / day
+  soft: 100k / s
+  hard: 500k / s
+```
+
+```ruby
+Gitlab::Limits::RateThreshold.enforce(:my_limit_name) do |threshold|
+  actor   = current_user
+  context = current_project
+
+  threshold.available do |limit|
+    # ...
+  end
+
+  threshold.approaching do |limit|
+    # ...
+  end
+
+  threshold.exceeded do |limit|
+    # ...
+  end
+end
+```
+
+In the example above, when `my_limit_name` is defined in YAML, engineers will
+be check the current state and execute appropriate code block depending on the
+past usage / resource consumption.
+
+Things we want to build and support by default:
+
+1. Comprehensive dashboards showing how often limits are being hit.
+1. Notifications about the risk of hitting limits.
+1. Automation checking if limits definitions are being enforced properly.
+1. Different types of limits - time bound / number per resource etc.
+1. A panel that makes it easy to override limits per plan / namespace.
+1. Logging that will expose limits applied in Kibana.
+1. An automatically generated documentation page describing all the limits.
+
+### API to expose limits and policies
+
+Once we have an established a consistent way to define application limits we
+can build a few API endpoints that will allow us to expose them to our users,
+customers and other satellite services that may want to consume them.
+
+Users will be able to ask the API about the limits / thresholds that have been
+set for them, how often they are hitting them, and what impact those might have
+on their business. This kind of transparency can help them with communicating
+their needs to customer success team at GitLab, and we will be able to
+communicate how the responsible usage is defined at a given moment.
+
+Because of how GitLab architecture has been built, GitLab Rails application, in
+most cases, behaves as a central enterprise service bus (ESB) and there are a
+few satellite services communicating with it. Services like Container Registry,
+GitLab Runners, Gitaly, Workhorse, KAS could use the API to receive a set of
+application limits those are supposed to enforce. This will still allow us to
+define all of them in a single place.
+
+We should, however, avoid the possible negative-feedback-loop, that will put
+additional strain on the Rails application when there is a sudden increase in
+usage happening. This might be a big customer starting a new automation that
+traverses our API or a Denial of Service attack. In such cases, the additional
+traffic will reach GitLab Rails and subsequently also other satellite services.
+Then the satellite services may need to consult Rails again to obtain new
+instructions / policies around rate limiting the increased traffic. This can
+put additional strain on Rails application and eventually degrade performance
+even more. In order to avoid this problem, we should extract the API endpoints
+to separate service (see the section below) if the request rate to those
+endpoints depends on the volume of incoming traffic. Alternatively we can keep
+those endpoints in Rails if the increased traffic will not translate into
+increase of requests rate or increase in resources consumption on these API
+endpoints on the Rails side.
+
+#### Decoupled Limits Service
+
+At some point we may decide that it is time to extract a stateful backend
+responsible for storing metadata around limits, all the counters and state
+required, and exposing API, out of Rails.
+
+It is impossible to make a decision about extracting such a decoupled limits
+service yet, because we will need to ship more proof-of-concept work, and
+concrete iterations to inform us better about when and how we should do that. We
+will depend on the Evolution Architecture practice to guide us towards either
+extracting Decoupled Limits Service or not doing that at all.
+
+As we evolve this blueprint, we will document our findings and insights about
+how this service should look like, in this section of the document.
+
+### GitLab Policy Service
+
+_Disclaimer_: Extracting a GitLab Policy Service might be out of scope
+of the current workstream organized around implementing this blueprint.
+
+Not all limits can be easily described in YAML. There are some more complex
+policies that require a bit more sophisticated approach and a declarative
+programming language used to enforce them. One example of such a language might be
+[Rego](https://www.openpolicyagent.org/docs/latest/policy-language/) language.
+It is a standardized way to define policies in
+[OPA - Open Policy Agent](https://www.openpolicyagent.org/). At GitLab we are
+already using OPA in some departments. We envision the need to additional
+consolidation to not only consolidate on the tooling we are using internally at
+GitLab, but to also transform the Next Rate Limiting Architecture into
+something we can make a part of the product itself.
+
+Today, we already do have a policy service we are using to decide whether a
+pipeline can be created or not. There are many policies defined in
+[Pipeline Validation Service](https://gitlab.com/gitlab-org/modelops/anti-abuse/pipeline-validation-service).
+There is a significant opportunity here in transforming Pipeline Validation
+Service into a general purpose GitLab Policy Service / GitLab Policy Agent that
+will be well integrated into the GitLab product itself.
+
+Generalizing Pipeline Validation Service into GitLab Policy Service can bring a
+few interesting benefits:
+
+1. Consolidate on our tooling across the company to improve efficiency.
+1. Integrate our GitLab Rails limits framework to resolve policies using the policy service.
+1. Do not struggle to define complex policies in YAML and hack evaluating them in Ruby.
+1. Build a policy for GraphQL queries limiting using query execution cost estimation.
+1. Make it easier to resolve policies that do not need "hierarchical limits" structure.
+1. Make GitLab Policy Service part of the product and integrate it into the single application.
+
+We envision using GitLab Policy Service to be place to define policies that do
+not require knowing anything about the hierarchical structure of the limits.
+There are limits that do not need this, like IP addresses allow-list, spam
+checks, configuration validation etc.
+
+We defined "Policy" as a stateless, functional-style, limit. It takes input
+arguments and evaluates to either true or false. It should not require a global
+counter or any other volatile global state to get evaluated. It may still
+require to have a globally defined rules / configuration, but this state is not
+volatile in a same way a rate limiting counter may be, or a megabytes consumed
+to evaluate quota limit.
+
+#### Policies used internally and externally
+
+The GitLab Policy Service might be used in two different ways:
+
+1. Rails limits framework will use it as a source of policies enforced internally.
+1. The policy service feature will be used as a backend to store policies defined by users.
+
+These are two slightly different use-cases: first one is about using
+internally-defined policies to ensure the stability / availably of a GitLab
+instance (GitLab.com or self-managed instance). The second use-case is about
+making GitLab Policy Service a feature that users will be able to build on top
+of.
+
+Both use-cases are valid but we will need to make technical decision about how
+to separate them. Even if we decide to implement them both in a single service,
+we will need to draw a strong boundary between the two.
+
+The same principle might apply to Decouple Limits Service described in one of
+the sections of this document above.
+
+#### The two limits / policy services
+
+It is possible that GitLab Policy Service and Decoupled Limits Service can
+actually be the same thing. It, however, depends on the implementation details
+that we can't predict yet, and the decision about merging these services
+together will need to be informed by subsequent interations' feedback.
+
+## Hierarchical limits
+
+GitLab application aggregates users, projects, groups and namespaces in a
+hierarchical way. This hierarchical structure has been designed to make it
+easier to manage permissions, streamline workflows, and allow users and
+customers to store related projects, repositories, and other artifacts,
+together.
+
+It is important to design the new rate limiting framework in a way that it
+built on top of this hierarchical structure and engineers, customers, SREs and
+other stakeholders can understand how limits are being applied, enforced and
+overridden within the hierarchy of namespaces, groups and projects.
+
+We want to reduce the cognitive load required to understand how limits are
+being managed within the existing permissions structure. We might need to build
+a simple and easy-to-understand formula for how our application decides which
+limits and thresholds to apply for a given request and a given actor:
+
+> GitLab will read default limits for every operation, all overrides configured
+> and will choose a limit with the highest precedence configured. A limit
+> precedence needs to be explicitly configured for every override, a default
+> limit has precedence 100.
+
+One way in which we can simplify limits management in general is to:
+
+1. Have default limits / thresholds defined in YAML files with a default precedence 100.
+1. Allow limits to be overridden through the API, store overrides in the database.
+1. Every limit / threshold override needs to have an integer precedence value provided.
+1. Build an API that will take an actor and expose limits applicable for it.
+1. Build a dashboard showing actors with non-standard limits / overrides.
+1. Build a observability around this showing in Kibana when non-standard limits are being used.
+
+The points above represent an idea to use precedence score (or Z-Index for
+limits), but there may be better solutions, like just defining a direction of
+overrides - a lower limit might always override a limit defined higher in the
+hierarchy. Choosing a proper solution will require a thoughtful research.
+
+## Principles
+
+1. Try to avoid building rate limiting framework in a tightly coupled way.
+1. Build application limits API in a way that it can be easily extracted to a separate service.
+1. Build application limits definition in a way that is independent from the Rails application.
+1. Build tooling that produce consistent behavior and results across programming languages.
+1. Build the new framework in a way that we can extend to allow self-managed admins to customize limits.
+1. Maintain consistent features and behavior across SaaS and self-managed codebase.
+1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it.
+
+## Status
+
+Request For Comments.
+
+## Timeline
+
+- 2022-04-27: [Rate Limit Architecture Working Group](https://about.gitlab.com/company/team/structure/working-groups/rate-limit-architecture/) started.
+- 2022-06-07: Working Group members [started submitting technical proposals](https://gitlab.com/gitlab-org/gitlab/-/issues/364524) for the next rate limiting architecture.
+- 2022-06-15: We started [scoring proposals](https://docs.google.com/spreadsheets/d/1DFHU1kSdTnpydwM5P2RK8NhVBNWgEHvzT72eOhB8F9E) submitted by Working Group members.
+- 2022-07-06: A fourth, [consolidated proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/364524#note_1017640650), has been submitted.
+- 2022-07-12: Started working on the design document following [Architecture Evolution Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/).
+- 2022-09-08: The initial version of the blueprint has been merged.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       | Grzegorz Bizon          |
+| Author                       | Fabio Pitino            |
+| Author                       | Marshall Cottrell       |
+| Author                       | Hayley Swimelar         |
+| Engineering Leader           | Sam Goldstein           |
+| Product Manager              |                         |
+| Architecture Evolution Coach |                         |
+| Recommender                  |                         |
+| Recommender                  |                         |
+| Recommender                  |                         |
+| Recommender                  |                         |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Leadership                   |                        |
+| Product                      |                        |
+| Engineering                  |                        |
+
+Domain experts:
+
+| Area                         | Who
+|------------------------------|------------------------|
+|                              |                        |
+
+<!-- vale gitlab.Spelling = YES -->