rfc: Resource limits through cgroups

To circulate knowledge and organize my own mind, I've written down a fairly high level document on why cgroups might be a good idea for Gitaly. Capabilities to limit the blast radius of resource hoarding, be it intentional or unintential, are rather limited in Gitaly, and we might need alternatives.
author: Zeger-Jan van de Weg <git@zjvandeweg.nl> 2020-09-25 15:25:43 +0300
committer: Zeger-Jan van de Weg <git@zjvandeweg.nl> 2020-09-25 15:37:02 +0300
commit: e0ae490c769009b17d652f4383c6e405fad8daba (patch)
tree: 6f088bf94209e3a9dc267079a2358dbebfc5cc32
parent: a4ef89fca31c8e55b41ed1417387c78b1d03e6ee (diff)
2 files changed, 228 insertions, 0 deletions
diff --git a/doc/README.md b/doc/README.md
index 694c13de9..e56a4a0bb 100644
--- a/doc/README.md
+++ b/doc/README.md
@@ -40,4 +40,5 @@ For configuration please read [praefects configuration documentation](doc/config
 #### RFCs
 
 - [Praefect Queue storage](rfcs/praefect-queue-storage.md)
+- [Resource limits through cgroups](rfcs/resource_limit_through_cgroups.md)
 - [Snapshot storage](rfcs/snapshot-storage.md)
diff --git a/doc/rfcs/resource_limit_through_cgroups.md b/doc/rfcs/resource_limit_through_cgroups.md
new file mode 100644
index 000000000..999723d1f
--- /dev/null
+++ b/doc/rfcs/resource_limit_through_cgroups.md
@@ -0,0 +1,227 @@
+# RFC: Resource limits through control groups
+
+## Problem statement
+
+Gitaly is designed and operated with a large number of repositories co-located
+on the same system. Gitaly has no control over the number of repositories it
+hosts, and little control over the number of request to be handled concurrently.
+These properties lead to competition for resources, and Gitaly currently relies
+on the operating system to manage these competing interests.
+
+While Gitaly, and Git, manage to perform well within the constraints of their
+host system, there's been a number of incidents where a limited subset of
+requests obtained the vast majority of resources requiring the operating system
+to free up resources. Meanwhile a degraded service is experienced for end-users.
+
+When a Gitaly host runs out of memory, the operating system will first use swap,
+than free file mapped resources, but ultimately it will have to use the OOM Killer.
+More or less randomly killing processes randomly until enough memory is
+available again. Gitaly does not control this process, and cannot hint to the
+OOM Killer which processes might be worth killing.
+
+These situations are most common when `git-upload-pack(1)` invokes
+`git-pack-objects(1)` due to a client executing `git-receive-pack(1)`, that is
+a fetch or clone. While GitLab and Gitaly try to be prepared for these requests,
+there's both pathological cases as well as inherent costs to such requests it's
+straining resources, and degrading performance for all concurrent requests
+handled by Gitaly. Also known as the [noisy neighbour][wiki-neighbour] problem.
+
+For this document the scope is limited to memory+swap usage, as well as CPU
+utilization. On what entity to limits are to be applied will later discussed.
+
+[wiki-neighbour]: https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors
+
+## Potential solution: control groups
+
+Linux kernels since 2.6 expose an API commonly referred to as
+[cgroups][cgroup-man7]. It allows resource distribution by creating a grouping
+and adding resource constraints to this group, for example a limit share of CPU
+capacity, and a maximum of available memory and/or swap. To this group processes
+are added, to which from that point on the limits apply. When a process spawns
+child processes, these limits are inherited. 
+
+Currently GitLab.com executes Gitaly, and adds it to its own cgroup, ensuring
+that at no point in time all resources are consumed and auxiliary processes
+continue to run. This didn't require any application changes, and is the choice
+of an operator. This RFC is to expand on these with sub-control groups.
+Subgroups again inherit the limits of their parent group, though are interesting
+as they can be used to divide the limits down further to newly created groups.
+There's a need for application changes to create subgroups and maintain them, 
+so concurrent requests that are assigned to the correct subgroup. Essentially
+requiring the application to create, maintain, and delete buckets, and analyzing
+incoming requests and assigning them to these buckets. To limit the scope of the
+first iterations of changes and this document, maintaining subgroups by varying
+their resource allocation at runtime is considered to problem for another day.
+
+The most prominent entity that can be assigned to subgroups are processes. Given
+`git(1)` is the main method to query a repository, and `git-upload-pack(1)` the
+main culprit of high memory and CPU consumption these should be constraint to
+prevent virtually unbounded hoarding of resources.
+
+[cgroup-man7]: https://www.man7.org/linux/man-pages/man7/cgroups.7.html
+
+### Assigning sub groups
+
+Creating and deleting subgroups are cheap, and while there's obviously always
+limits to the amount of subgroups one could create, these are so high that for
+the sake of discussion these are considered unlimited.
+
+The goal of these subgroups if to create an guaranteed upper-bound for each
+subgroup for it's obtainable resources, which in turn should create an safe
+expectation of resources for it's peers, and fairness.
+
+#### Per process
+
+One naive solution would be to put each process Gitaly spawns into their own
+subgroup. A group get created when executing a sub process, and each sub group
+has the same limits. Let's consider this grouping for both memory and CPU.
+
+In the case of a limitation on a maximum of allocated bytes this strategy is
+viable. Even when the total of of maximum allocatable bytes of each subgroup
+exceeds the parent group maximum of allocatable bytes, this works in cases where
+the parents limit vastly exceeds each subgroups limit. When the parent group limit
+is exceeded, out of memory events will be send to subgroups, much like the
+OOM Killer works. The key difference, is that each process had the
+same upper bound of memory, and in this situation it's not one process that
+hoards the vast majority of resources. Decreasing the impact of
+the problem this RFC aims to resolve.
+
+Considering the CPU, per process subgroups will likely have the adverse effect.
+CPU limitations are based on the notion that one doesn't have a maximum of
+operations or time but the CPU has shares, that is; a relative amount of 
+available CPU. CPU has to be divided based on shares as the resource has
+different capacity throughout its operating time as opposed to memory. Consider
+the case where after peak load the CPU will be automatically down-tuned to
+reduce the energy consumed. And while it's possible to create new shares of CPU,
+new shares would dilute the current pool and as such not resolve the noisy
+neighbour problem. While it could be argued that dilution will always happen
+during peak load, the impact should be contained, which this solution now
+doesn't provide. 
+
+#### Per repository
+
+Nearly all operations Gitaly does, are scoped to repositories. And while Gitaly
+itself doesn't know what repositories are currently being stored without crawling
+the `git-data` directory, it's told in each RPC what the path is of each
+repository. On top of that, RPCs are annotated if these are
+[repository scoped RPCs][../../proto/lint.proto].
+
+Per repository subgroups still run into issues described in per process
+subgrouping. But given N processes are serving R repositories, it holds that
+R <= N. So the effect is dampened. To extend this, for larger installations
+in particular, the number of users that can and will interact with a repository
+varies per repository. That means that repositories can act as a heuristic for
+users.
+
+The caveat is that some RPCs operate on multiple repositories, in which case a
+repository of the these has to be chosen to determine the subgroup.
+
+Note also that subgroups other than per process subgrouping incur a slightly
+higher cost in complexity, as well as runtime costs, due to the fact one can
+only delete a subgroup if there's no (zombie) processes member of the group.
+
+#### Per user
+
+Like repositories, users are unknown to Gitaly. But unlike repositories, there's
+currently no fast lookup possible to determine who the end user is for each RPC
+request. A client could provide this information, but usually it won't as
+there's currently no need for it, as Gitaly doesn't handle either authentication
+or authorization. Requiring this data would involve a lot of changes to Gitaly,
+and have cascading changes for each of Gitaly's clients. Futhermore, there's
+operations that aren't triggered by users, but by GitLab. Considering these two
+arguments, user based subgroups have limited viability without a much broader
+redesign of GitLab.
+
+### Subgroup resource allocation
+
+TODO: Currently thinking about a first iteration where each subgroup gets equal
+shares CPU and an equal hard memory maximum. Limiting initial complexity and
+allowing for iterations and experiments later.
+
+## Benefits of control groups
+
+Cgroups exist since Linux kernel versions 2.6.24, meaning it's well over 10 years
+old. While in those 10 years there's been additions and iterations, it's
+considered mature and boring technology. There's widespread industry adoption and
+support in all major Linux distributions, usually through `systemd`. Cgroups
+perform well in a large number of companies, and allow for example cloud
+providers to proof they meet SLAs.
+
+Additionally, using kernel managed resource management removes the need for
+complexity in the Gitaly code base, or at least minimizes it. Than it could be
+argued that the kernel will always be a natural fit for resource management,
+while it's hard to argue that Gitaly should reimplement logic already provided.
+
+While this doesn't resolve a class of bugs, it might be fair to state a class of
+user facing issues are resolved with one feature.
+
+## Risks and downsides
+
+### Linux only
+
+`CGROUPS(7)` are available on Linux kernels only. After 2020-11-22 all supported
+Linux distributions will have V1 support for cgroups. GitLab officially only
+supports Linux distributions, though the application is known to also run on
+MacOS (many GitLab team members run MacOS), as well as FreeBSD.
+
+These platforms currently enjoy an near on-par experience, while supporting
+`cgroups` will create first and second class experiences. This split is created
+throughout the Gitaly team. Currently that means that half of the Gitaly team
+members cannot contribute through the GDK, but need additional tooling.
+
+### Tight coupling with distribution methods
+
+To create a `cgroup` elevated privileges are needed, than to manage it, the
+cgroup needs to be owned by the same user that gitaly runs under, usually `git`.
+This can best be achieved at the time when packages are installed, thus
+requiring more coupling between gitaly and for example Omnibus-GitLab.
+
+### V2 roll out
+
+The cgroup API previously discussed is the V1 API, but a new API is implemented
+too, V2. Currently all major platforms only support V1 without administrator
+changes. The two versions have different interfaces, and are incompatible, hence
+the major version bump. This might create a situation where some platforms
+default to v2 in the future and Gitaly needs to add support for it, while
+maintaining v1 support.
+
+### Gitaly bugs might impact users more
+
+When new behaviours are rolled out with increased memory consumption there's no
+effect on users, as the nodes have plenty resources for day-to-day operations.
+In effect buying time to remove the overhead which is now absorbed by over
+provisioning. It's equally fair to flip this on it's head to reason bugs are 
+sometimes not found as these are absorbed by over provisioning.
+
+However, with cgroups it could create a process that now runs into a memory
+limit and an OOM event is triggered, meaning the user action won't be completed.
+
+## Alternatives
+
+### Better GitLab wide rate limiting
+
+Gitaly is by no means the front door for clients. Each request is first handled
+by either gitlab-workhorse, or gitlab-shell. These provide opportunities to rate
+limit with higher granularity than Gitaly has. While an option, it solves only
+part of what cgroups could solve, and would be orthogonal to cgroups.
+
+### Improve Git/LibGit2
+
+Using and tuning the usage of Git is what Gitaly developers are trained to do,
+and are comfortable with. Making Git perform better for Gitaly administrators as
+well as clients is being pursued regardless of the adoption of cgroups. However
+this will never provide fine-grained over resource usage, and resolving
+issues will mostly be a reactive action, while there's an interest in prevention
+of the noisy neighbours to meet a service level as expected by GitLab users.
+
+### Out of scope
+
+The Ruby sidecar for Gitaly, Gitaly-Ruby, has soft limits applied to the workers,
+usually around 300MB. These workers are managed by a
+[supervisor][../internal/supervisor], which can remove workers from the load balancer
+if these are consuming too much memory. Than each process will be killed after
+60 seconds to reclaim the memory, and allow the currently handled requests to
+finish.
+
+Each of these workers could be added to their own subgroup to limit their memory
+too. Though there's considerable investments being made
author	Zeger-Jan van de Weg <git@zjvandeweg.nl>	2020-09-25 15:25:43 +0300
committer	Zeger-Jan van de Weg <git@zjvandeweg.nl>	2020-09-25 15:37:02 +0300
commit	e0ae490c769009b17d652f4383c6e405fad8daba (patch)
tree	6f088bf94209e3a9dc267079a2358dbebfc5cc32
parent	a4ef89fca31c8e55b41ed1417387c78b1d03e6ee (diff)