Add latest changes from gitlab-org/gitlab@15-5-stable-eev15.5.0-rc42

author: GitLab Bot <gitlab-bot@gitlab.com> 2022-10-20 12:40:42 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2022-10-20 12:40:42 +0300
commit: ee664acb356f8123f4f6b00b73c1e1cf0866c7fb (patch)
tree: f8479f94a28f66654c6a4f6fb99bad6b4e86a40e /doc/architecture/blueprints/runner_scaling
parent: 62f7d5c5b69180e82ae8196b7b429eeffc8e7b4f (diff)
1 files changed, 225 insertions, 55 deletions
diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md
index 8f7062a1148..415884449ed 100644
--- a/doc/architecture/blueprints/runner_scaling/index.md
+++ b/doc/architecture/blueprints/runner_scaling/index.md
@@ -43,10 +43,10 @@ to be able to keep using this and ship fixes and updates needed for our use case
 and the documentation for it has been removed from the official page. This
 means that the original reason to use Docker Machine is no longer valid too.
 
-To keep supporting our customers and the wider community we need to design a
-new mechanism for GitLab Runner auto-scaling. It not only needs to support
-auto-scaling, but it also needs to do that in the way to enable us to build on
-top of it to improve efficiency, reliability and availability.
+To keep supporting our customers and the wider community and to improve our SaaS runners
+maintenance we need to design a new mechanism for GitLab Runner auto-scaling. It not only
+needs to support auto-scaling, but it also needs to do that in the way to enable us to
+build on top of it to improve efficiency, reliability and availability.
 
 We call this new mechanism the "next GitLab Runner Scaling architecture".
 
@@ -62,6 +62,66 @@ subject to change or delay. The development, release and timing of any
 products, features, or functionality remain at the sole discretion of GitLab
 Inc._
 
+## Continuing building on Docker Machine
+
+At this moment one of our core products - GitLab Runner - and one of its most
+important features - ability to auto-scale job execution environments - depends
+on an external product that is abandoned.
+
+Docker Machine project itself is also hard to maintain. Its design starts to
+show its age, which makes it hard to bring new features and fixes. A huge
+codebase that it brings with a lack of internal knowledge about it makes it
+hard for our maintainers to support and properly handle incoming feature
+requests and community contributions.
+
+Docker Machine and it integrated 20+ drivers for cloud and virtualization
+providers creates also another subset of problems, like:
+
+- Each cloud/virtualization environment brings features that come and go
+  and we would need to maintain support for them (add new features, fix
+  bugs).
+
+- We basically need to become experts for each of the virtualization/cloud
+  provider to properly support integration with their API,
+
+- Every single provider that Docker Machine integrates with has its
+  bugs, security releases, vulnerabilities - to maintain the project properly
+  we would need to be on top of all of that and handle updates whenever
+  they are needed.
+
+Another problem is the fact that Docker Machine, from its beginnings, was
+focused on managing Linux based instances only. Despite that at some moment
+Docker got official and native integration on Windows, Docker Machine never
+followed this step. Nor its designed to make such integration easy.
+
+There is also no support for MacOS. This one is obvious - Docker Machine is a
+tool to maintain hosts for Docker Engine and there is no native Docker Engine
+for MacOS. And by native we mean MacOS containers executed within MacOS
+operating system. Docker for MacOS product is not a native support - it's just
+a tooling and a virtualized Linux instance installed with it that makes it
+easier to develop **Linux containers** on MacOS development instances.
+
+This means that only one of three of our officially supported platforms -
+Linux, Windows and MacOS - have a fully-featured support for CI/CD
+auto-scaling. For Windows there is a possibility to use Kubernetes (which in
+some cases have limitations) and maybe with a lot of effort we could bring
+support for Windows into Docker Machine. But for MacOS, there is no
+auto-scaling solution provided natively by GitLab Runner.
+
+This is a huge limitation for our users and a frequently requested feature.
+It's also a limitation for our SaaS runners offering. We've maintained to
+create some sort of auto-scaling for our SaaS Windows and SaaS MacOS runners
+hacking around Custom executor. But experiences from past three years show
+that it's not the best way of doing this. And yet, after this time, Windows
+and MacOS runners autoscaling lacks a lot of performance and feature support
+that we have with our SaaS Linux runners.
+
+To keep supporting our customers and the wider community and to improve our
+SaaS runners maintenance we need to design a new mechanism for GitLab Runner
+auto-scaling. It not only needs to support auto-scaling, but it also needs to
+do that in the way to enable us to build on top of it to improve efficiency,
+reliability and availability.
+
 ## Proposal
 
 Currently, GitLab Runner auto-scaling can be configured in a few ways. Some
@@ -94,7 +154,7 @@ data that can be shared between job runs.
 Because there is no viable replacement and we might be unable to support all
 cloud providers that Docker Machine used to support, the key design requirement
 is to make it really simple and easy for the wider community to write a custom
-GitLab auto-scaling plugin, whatever cloud provider they might be using. We
+GitLab plugin for whatever cloud provider they might be using. We
 want to design a simple abstraction that users will be able to build on top, as
 will we to support existing workflows on GitLab.com.
 
@@ -129,12 +189,11 @@ the need of rebuilding GitLab Runner whenever it happens.
 
 ### 💡 Write a solid documentation about how to build your own plugin
 
-It is important to show users how to build an auto-scaling plugin, so that they
+It is important to show users how to build a plugin, so that they
 can implement support for their own cloud infrastructure.
 
-Building new plugins should be simple, and with the support of great
-documentation it should not require advanced skills, like understanding how
-gRPC works. We want to design the plugin system in a way that the entry barrier
+Building new plugins should be simple and supported with great
+documentation. We want to design the plugin system in a way that the entry barrier
 for contributing new plugins is very low.
 
 ### 💡 Build a PoC to run multiple builds on a single machine
@@ -171,7 +230,128 @@ configures the Docker daemon there to allow external authenticated requests. It
 stores credentials to such ephemeral Docker environments on disk. Once a
 machine has been provisioned and made available for GitLab Runner Manager to
 run builds, it is using one of the existing executors to run a user-provided
-script. In auto-scaling, this is typically done using Docker executor.
+script. In auto-scaling, this is typically done using the Docker executor.
+
+### Separation of concerns
+
+There are several concerns represented in the current architecture. They are
+coupled in the current implementation so we will break them out here to consider
+them each separately.
+
+- **Virtual Machine (VM) shape**. The underlying provider of a VM requires configuration to
+  know what kind of machine to create. E.g. Cores, memory, failure domain,
+  etc... This information is very provider specific.
+- **VM lifecycle management**. Multiple machines will be created and a
+  system must keep track of which machines belong to this executor. Typically
+  a cloud provider will have a way to manage a set of homogenous machines.
+  E.g. GCE Instance Group. The basic operations are increase, decrease and
+  usually delete a specific machine.
+- **VM autoscaling**. In addition to low-level lifecycle management,
+  job-aware capacity decisions must be made to the set of machines to provide
+  capacity when it is needed but not maintain excess capacity for cost reasons.
+- **Job to VM mapping (routing)**. Currently the system assigns only one job to a
+  given a machine. A machine may be reused based on the specific executor
+  configuration.
+- **In-VM job execution**. Within each VM a job must be driven through
+  various pre-defined stages and results and trace information returned
+  to the Runner system. These details are highly dependent on the VM
+  architecture and operating system as well as Executor type.
+
+The current architecture has several points of coupling between concerns.
+Coupling reduces opportunities for abstraction (e.g. community supported
+plugins) and increases complexity, making the code harder to understand,
+test, maintain and extend.
+
+A primary design decision will be which concerns to externalize to the plugin
+and which should remain with the runner system. The current implementation
+has several abstractions internally which could be used as cut points for a
+new abstraction.
+
+For example the [`Build`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L125)
+type uses the [`GetExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L171)
+function to get an executor provider based on a dispatching executor string.
+Various executor types register with the system by being imported and calling
+[`RegisterExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L154)
+during initialization. Here the abstractions are the [`ExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L80)
+and [`Executor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L59)
+interfaces.
+
+Within the `docker+autoscaling` executor the [`machineExecutor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L19)
+type has a [`Machine`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/helpers/docker/machine.go#L7)
+interface which it uses to aquire a VM during the common [`Prepare`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L71)
+phase. This abstraction primarily creates, accesses and deletes VMs.
+
+There is no current abstraction for the VM autoscaling logic. It is tightly
+coupled with the VM lifecycle and job routing logic. Creating idle capacity
+happens as a side-effect of calling [`Acquire`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/provider.go#L449) on the `machineProvider` while binding a job to a VM.
+
+There is also no current abstraction for in-VM job execution. VM-specific
+commands are generated by the Runner Manager using the [`GenerateShellScript`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L336)
+function and [injected](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L373)
+into the VM as the manager drives the job execution stages.
+
+### Design principles
+
+Our goal is to design a GitLab Runner plugin system interface that is flexible
+and simple for the wider community to consume. As we cannot build plugins for
+all cloud platforms, we want to ensure a low entry barrier for anyone who needs
+to develop a plugin. We want to allow everyone to contribute.
+
+To achieve this goal, we will follow a few critical design principles. These
+principles will guide our development process for the new plugin system
+abstraction.
+
+#### General high-level principles
+
+- Design the new auto-scaling architecture aiming for having more choices and
+  flexibility in the future, instead of imposing new constraints.
+- Design the new auto-scaling architecture to experiment with running multiple
+  jobs in parallel, on a single machine.
+- Design the new provisioning architecture to replace Docker Machine in a way
+  that the wider community can easily build on top of the new abstractions.
+- New auto-scaling method should become a core component of GitLab Runner product so that
+  we can simplify maintenance, use the same tooling, test configuration and Go language
+  setup as we do in our other main products.
+- It should support multiple job execution environments - not only Docker containers
+  on Linux operating system.
+
+    The best design would be to bring auto-scaling as a feature wrapped around
+    our current executors like Docker or Shell.
+
+#### Principles for the new plugin system
+
+- Make the entry barrier for writing a new plugin low.
+- Developing a new plugin should be simple and require only basic knowledge of
+  a programming language and a cloud provider's API.
+- Strive for a balance between the plugin system's simplicity and flexibility.
+  These are not mutually exclusive.
+- Abstract away as many technical details as possible but do not hide them completely.
+- Build an abstraction that serves our community well but allows us to ship it quickly.
+- Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
+- When in doubts err on the side of making things more simple for the wider community.
+- Limit coupling between concerns to make the system more simple and extensible.
+- Concerns should live on one side of the plug or the other--not both, which
+  duplicates effort and increases coupling.
+
+#### The most important technical details
+
+- Favor gRPC communication between a plugin and GitLab Runner.
+- Make it possible to version communication interface and support many versions.
+- Make Go a primary language for writing plugins but accept other languages too.
+- Autoscaling mechanism should be fully owned by GitLab.
+
+    Cloud provider autoscalers don't know which VM to delete when scaling down so
+    they make sub-optimal decisions. Rather than teaching all autoscalers about GitLab
+    jobs, we prefer to have one, GitLab-owned autoscaler (not in the plugin).
+
+    It will also ensure that we can shape the future of the mechanism and make decisions
+    that fit our needs and requirements.
+
+## Plugin boundary proposals
+
+The following are proposals for where to draw the plugin boundary. We will evaluate
+these proposals and others by the design principles and technical constraints
+listed above.
 
 ### Custom provider
 
@@ -204,43 +384,33 @@ document, define requirements and score the solution accordingly. This will
 allow us to choose a solution that will work best for us and the wider
 community.
 
-### Design principles
-
-Our goal is to design a GitLab Runner plugin system interface that is flexible
-and simple for the wider community to consume. As we cannot build plugins for
-all cloud platforms, we want to ensure a low entry barrier for anyone who needs
-to develop a plugin. We want to allow everyone to contribute.
+This proposal places VM lifecycle and autoscaling concerns as well as job to
+VM mapping (routing) into the plugin. The build need only ask for a VM and
+it will get one with all aspects of lifecycle and routing already accounted
+for by the plugin.
 
-To achieve this goal, we will follow a few critical design principles. These
-principles will guide our development process for the new plugin system
-abstraction.
+Rationale: [Description of the Custom Executor Provider proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823321515)
 
-#### General high-level principles
+### Fleeting VM provider
 
-1. Design the new auto-scaling architecture aiming for having more choices and
-   flexibility in the future, instead of imposing new constraints.
-1. Design the new auto-scaling architecture to experiment with running multiple
-   jobs in parallel, on a single machine.
-1. Design the new provisioning architecture to replace Docker Machine in a way
-   that the wider community can easily build on top of the new abstractions.
+We can introduce a more simple version of the `Machine` abstraction in the
+form of a "Fleeting" interface. Fleeting provides a low-level interface to
+a homogenous VM group which allows increasing and decreasing the set size
+as well as consuming a VM from within the set.
 
-#### Principles for the new plugin system
+Plugins for cloud providers and other VM sources are implemented via the
+Hashicorp go-plugin library. This is in practice gRPC over STDIN/STDOUT
+but other wire protocols can be used also.
 
-1. Make the entry barrier for writing a new plugin low.
-1. Developing a new plugin should be simple and require only basic knowledge of
-   a programming language and a cloud provider's API.
-1. Strive for a balance between the plugin system's simplicity and flexibility.
-   These are not mutually exclusive.
-1. Abstract away as many technical details as possible but do not hide them completely.
-1. Build an abstraction that serves our community well but allows us to ship it quickly.
-1. Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
-1. When in doubts err on the side of making things more simple for the wider community.
+In order to make use of the new interface, the autoscaling logic is pulled
+out of the Docker Executor and placed into a new Taskscaler library.
 
-#### The most important technical details
+This places the concerns of VM lifecycle, VM shape and job routing within
+the plugin. It also places the conern of VM autoscaling into a separate
+component so it can be used by multiple Runner Executors (not just `docker+autoscaling`).
 
-1. Favor gRPC communication between a plugin and GitLab Runner.
-1. Make it possible to version communication interface and support many versions.
-1. Make Go a primary language for writing plugins but accept other languages too.
+Rationale: [Description of the InstanceGroup / Fleeting proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823430883)
+POC: [Merge request](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3315)
 
 ## Status
 
@@ -252,26 +422,26 @@ Proposal:
 
 <!-- vale gitlab.Spelling = NO -->
 
-| Role                         | Who
-|------------------------------|------------------------------------------|
-| Authors                      | Grzegorz Bizon, Tomasz Maczukin          |
-| Architecture Evolution Coach | Kamil Trzciński                          |
-| Engineering Leader           | Elliot Rushton, Cheryl Li                |
-| Product Manager              | Darren Eastman, Jackie Porter            |
-| Domain Expert / Runner       | Arran Walker                             |
+| Role                         | Who                                             |
+|------------------------------|-------------------------------------------------|
+| Authors                      | Grzegorz Bizon, Tomasz Maczukin, Joseph Burnett |
+| Architecture Evolution Coach | Kamil Trzciński                                 |
+| Engineering Leader           | Elliot Rushton, Cheryl Li                       |
+| Product Manager              | Darren Eastman, Jackie Porter                   |
+| Domain Expert / Runner       | Arran Walker                                    |
 
 DRIs:
 
-| Role                         | Who
-|------------------------------|------------------------|
-| Leadership                   | Elliot Rushton         |
-| Product                      | Darren Eastman         |
-| Engineering                  | Tomasz Maczukin        |
+| Role        | Who             |
+|-------------|-----------------|
+| Leadership  | Elliot Rushton  |
+| Product     | Darren Eastman  |
+| Engineering | Tomasz Maczukin |
 
 Domain experts:
 
-| Area                         | Who
-|------------------------------|------------------------|
-| Domain Expert / Runner       | Arran Walker           |
+| Area                   | Who          |
+|------------------------|--------------|
+| Domain Expert / Runner | Arran Walker |
 
 <!-- vale gitlab.Spelling = YES -->
author	GitLab Bot <gitlab-bot@gitlab.com>	2022-10-20 12:40:42 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2022-10-20 12:40:42 +0300
commit	ee664acb356f8123f4f6b00b73c1e1cf0866c7fb (patch)
tree	f8479f94a28f66654c6a4f6fb99bad6b4e86a40e /doc/architecture/blueprints/runner_scaling
parent	62f7d5c5b69180e82ae8196b7b429eeffc8e7b4f (diff)