doc/architecture/blueprints/runner_scaling/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277

---
stage: none
group: unassigned
comments: false
description: 'Next Runner Auto-scaling Architecture'
---

# Next Runner Auto-scaling Architecture

## Summary

GitLab Runner is a core component of GitLab CI/CD. It makes it possible to run
CI/CD jobs in a reliable and concurrent environment. It has been initially
introduced by Kamil Trzciński in early 2015 to replace a Ruby version of the
same service. GitLab Runner written in Go turned out to be easier to use by the
wider community, it was more efficient and reliable than the previous,
Ruby-based, version.

In February 2016 Kamil Trzciński [implemented an auto-scaling feature](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/53)
to leverage cloud infrastructure to run many CI/CD jobs in parallel. This
feature has become a foundation supporting CI/CD adoption on GitLab.com over
the years, where we now run around 4 million builds per day at peak.

During the initial implementation a decision was made to use Docker Machine:

> Is easy to use. Is well documented. Is well supported and constantly
> extended. It supports almost any cloud provider or virtualization
> infrastructure. We need minimal amount of changes to support Docker Machine:
> machine enumeration and inspection. We don't need to implement any "cloud
> specific" features.

This design choice was crucial for the GitLab Runner success. Since that time
the auto-scaling feature has been used by many users and customers and enabled
rapid growth of CI/CD adoption on GitLab.com.

We can not, however, continue using Docker Machine. Work on that project 
[was paused in July 2018](https://github.com/docker/machine/issues/4537) and there
was no development made since that time (except for some highly important
security fixes). In 2018, after Docker Machine entered the "maintenance mode",
we decided to create [our own fork](https://gitlab.com/gitlab-org/ci-cd/docker-machine)
to be able to keep using this and ship fixes and updates needed for our use case.
[On September 26th, 2021 the project got archived](https://github.com/docker/docker.github.io/commit/2dc8b49dcbe85686cc7230e17aff8e9944cb47a5)
and the documentation for it has been removed from the official page. This
means that the original reason to use Docker Machine is no longer valid too.

To keep supporting our customers and the wider community we need to design a
new mechanism for GitLab Runner auto-scaling. It not only needs to support
auto-scaling, but it also needs to do that in the way to enable us to build on
top of it to improve efficiency, reliability and availability.

We call this new mechanism the "next GitLab Runner Scaling architecture".

_Disclaimer The following contain information related to upcoming products,
features, and functionality._

_It is important to note that the information presented is for informational
purposes only. Please do not rely on this information for purchasing or
planning purposes._

_As with all projects, the items mentioned in this document and linked pages are
subject to change or delay. The development, release and timing of any
products, features, or functionality remain at the sole discretion of GitLab
Inc._

## Proposal

Currently, GitLab Runner auto-scaling can be configured in a few ways. Some
customers are successfully using an auto-scaled environment in Kubernetes. We
know that a custom and unofficial GitLab Runner version has been built to make
auto-scaling on Kubernetes more reliable. We recognize the importance of having
a really good Kubernetes solution for running multiple jobs in parallel, but
refinements in this area are out of scope for this architectural initiative.

We want to focus on resolving problems with Docker Machine and replacing this
mechanism with a reliable and flexible mechanism. We might be unable to build a
drop-in replacement for Docker Machine, as there are presumably many reasons
why it has been deprecated. It is very difficult to maintain compatibility with
so many cloud providers, and it seems that Docker Machine has been deprecated
in favor of Docker Desktop, which is not a viable replacement for us. 
[This issue](https://github.com/docker/roadmap/issues/245) contains a discussion
about how people are using Docker Machine right now, and it seems that GitLab
CI is one of the most frequent reasons for people to keep using Docker Machine.

There is also an opportunity in being able to optionally run multiple jobs in a
single, larger virtual machine. We can't do that today, but we know that this
can potentially significantly improve efficiency. We might want to build a new
architecture that makes it easier and allows us to test how efficient it is
with PoCs. Running multiple jobs on a single machine can also make it possible
to reuse what we call a "sticky context" - a space for build artifacts / user
data that can be shared between job runs.

### 💡 Design a simple abstraction that users will be able to build on top of

Because there is no viable replacement and we might be unable to support all
cloud providers that Docker Machine used to support, the key design requirement
is to make it really simple and easy for the wider community to write a custom
GitLab auto-scaling plugin, whatever cloud provider they might be using. We
want to design a simple abstraction that users will be able to build on top, as
will we to support existing workflows on GitLab.com.

The designed mechanism should abstract what Docker Machine executor has been doing:
providing a way to create an external Docker environment, waiting to execute
jobs by provisioning this environment and returning credentials required to
perform these operations.

The new plugin system should be available for all major platforms: Linux,
Windows, MacOS.

### 💡 Migrate existing Docker Machine solution to a plugin

Once we design and implement the new abstraction, we should be able to migrate
existing Docker Machine mechanisms to a plugin. This will make it possible for
users and customers to immediately start using the new architecture, but still
keep their existing workflows and configuration for Docker Machine. This will
give everyone time to migrate to the new architecture before we drop support
for the legacy auto-scaling entirely.

### 💡 Build plugins for AWS, Google Cloud Platform and Azure

Although we might be unable to add support for all the cloud providers that
Docker Machine used to support, it seems to be important to provide
GitLab-maintained plugins for the major cloud providers like AWS, Google Cloud
Platform and Azure.

We should build them, presumably in separate repositories, in a way that they
are easy to contribute to, fork, modify for certain needs the wider community
team members might have. It should be also easy to install a new plugin without
the need of rebuilding GitLab Runner whenever it happens.

### 💡 Write a solid documentation about how to build your own plugin

It is important to show users how to build an auto-scaling plugin, so that they
can implement support for their own cloud infrastructure.

Building new plugins should be simple, and with the support of great
documentation it should not require advanced skills, like understanding how
gRPC works. We want to design the plugin system in a way that the entry barrier
for contributing new plugins is very low.

### 💡 Build a PoC to run multiple builds on a single machine

We want to better understand what kind of efficiency can running multiple jobs
on a single machine bring. It is difficult to predict that, so ideally we
should build a PoC that will help us to better understand what we can expect
from this.

To run this experiment we most likely we will need to build an experimental
plugin, that not only allows us to schedule running multiple builds on a single
machine, but also has a set of comprehensive metrics built into it, to make it
easier to understand how it performs.

## Details

How the abstraction for the custom provider will look exactly is something that
we will need to prototype, PoC and decide in a data-informed way. There are a
few proposals that we should describe in detail, develop requirements for, PoC
and score. We will choose the solution that seems to support our goals the
most.

In order to describe the proposals we first need to better explain what part of
the GitLab Runner needs to be abstracted away. To make this easier to grasp
these concepts, let's take a look at the current auto-scaling architecture and
sequence diagram.

![GitLab Runner Autoscaling Overview](gitlab-autoscaling-overview.png)

On the diagrams above we see that currently a GitLab Runner Manager runs on a
machine that has access to a cloud provider's API. It is using Docker Machine
to provision new Virtual Machines with Docker Engine installed and it
configures the Docker daemon there to allow external authenticated requests. It
stores credentials to such ephemeral Docker environments on disk. Once a
machine has been provisioned and made available for GitLab Runner Manager to
run builds, it is using one of the existing executors to run a user-provided
script. In auto-scaling, this is typically done using Docker executor.

### Custom provider

In order to reduce the scope of work, we only want to introduce the new
abstraction layer in one place.

A few years ago we introduced the [Custom Executor](https://docs.gitlab.com/runner/executors/custom.html)
feature in GitLab Runner. It allows users to design custom build execution
methods. The custom executor driver can be implemented in any way - from a
simple shell script to a dedicated binary - that is then used by a Runner
through os/exec system calls.

Thanks to the custom executor abstraction there is no more need to implement
new executors internally in Runner. Users who have specific needs can implement
their own drivers and don't need to wait for us to make their work part of the
"official" GitLab Runner. As each driver is a separate project, it also makes
it easier to create communities around them, where interested people can
collaborate together on improvements and bug fixes.

We want to design the new Custom Provider to replicate the success of the
Custom Executor. It will make it easier for users to build their own ways to
provide a context and an environment in which a build will be executed by one
of the Custom Executors.

There are multiple solutions to implementing a custom provider abstraction. We
can use raw Go plugins, Hashcorp's Go Plugin, HTTP interface or gRPC based
facade service. There are many solutions, and we want to choose the most
optimal one. In order to do that, we will describe the solutions in a separate
document, define requirements and score the solution accordingly. This will
allow us to choose a solution that will work best for us and the wider
community.

### Design principles

Our goal is to design a GitLab Runner plugin system interface that is flexible
and simple for the wider community to consume. As we cannot build plugins for
all cloud platforms, we want to ensure a low entry barrier for anyone who needs
to develop a plugin. We want to allow everyone to contribute.

To achieve this goal, we will follow a few critical design principles. These
principles will guide our development process for the new plugin system
abstraction.

#### General high-level principles

1. Design the new auto-scaling architecture aiming for having more choices and
   flexibility in the future, instead of imposing new constraints.
1. Design the new auto-scaling architecture to experiment with running multiple
   jobs in parallel, on a single machine.
1. Design the new provisioning architecture to replace Docker Machine in a way
   that the wider community can easily build on top of the new abstractions.

#### Principles for the new plugin system

1. Make the entry barrier for writing a new plugin low.
1. Developing a new plugin should be simple and require only basic knowledge of
   a programming language and a cloud provider's API.
1. Strive for a balance between the plugin system's simplicity and flexibility.
   These are not mutually exclusive.
1. Abstract away as many technical details as possible but do not hide them completely.
1. Build an abstraction that serves our community well but allows us to ship it quickly.
1. Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
1. When in doubts err on the side of making things more simple for the wider community.

#### The most important technical details

1. Favor gRPC communication between a plugin and GitLab Runner.
1. Make it possible to version communication interface and support many versions.
1. Make Go a primary language for writing plugins but accept other languages too.

## Status

Status: RFC.

## Who

Proposal:

<!-- vale gitlab.Spelling = NO -->

| Role                         | Who
|------------------------------|------------------------------------------|
| Authors                      | Grzegorz Bizon, Tomasz Maczukin          |
| Architecture Evolution Coach | Kamil Trzciński                          |
| Engineering Leader           | Elliot Rushton, Cheryl Li                |
| Product Manager              | Darren Eastman, Jackie Porter            |
| Domain Expert / Runner       | Arran Walker                             |

DRIs:

| Role                         | Who
|------------------------------|------------------------|
| Leadership                   | Elliot Rushton         |
| Product                      | Darren Eastman         |
| Engineering                  | Tomasz Maczukin        |

Domain experts:

| Area                         | Who
|------------------------------|------------------------|
| Domain Expert / Runner       | Arran Walker           |

<!-- vale gitlab.Spelling = YES -->