doc/development/uploads/background.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154

---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Uploads guide: Why GitLab uses custom upload logic

This page is for developers trying to better understand the history behind GitLab uploads and the
technical challenges associated with uploads.

## Problem description

GitLab and [GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse) use special rules for handling file uploads,
because in an ordinary Rails application file uploads can become expensive as files grow in size.
Rails often sacrifices performance to provide a better developer experience, including how it handles
`multipart/form-post` uploads. In any Rack server, Rails applications included, when such a request arrives at the application server,
several things happen:

1. A [Rack middleware](https://github.com/rack/rack/blob/main/lib/rack/multipart.rb) intercepts the request and parses the request body.
1. The middleware writes each file in the multipart request to a temporary directory on disk.
1. A `params` hash is constructed with entries pointing to the respective files on disk.
1. A Rails controller acts on the file contents.

While this is convenient for developers, it is costly for the Ruby server process to buffer large files on disk.
Because of Ruby's [global interpreter lock](https://en.wikipedia.org/wiki/Global_interpreter_lock),
only a single thread of execution of a given Ruby process can be on CPU. This means the amount of CPU
time spent doing this is not available to other worker threads serving user requests.
Buffering files to disk also means spending more time in I/O routines and mode switches, which are expensive operations.

The following diagram shows how GitLab handled such a request prior to putting optimizations in place.

```mermaid
graph TB
    subgraph "load balancers"
      LB(Proxy)
    end

    subgraph "Shared storage"
       nfs(NFS)
    end

    subgraph "redis cluster"
       r(persisted redis)
    end
    LB-- 1 -->Workhorse

    subgraph "web or API fleet"
      Workhorse-- 2 -->rails
    end
    rails-- "3 (write files)" -->nfs
    rails-- "4 (schedule a job)" -->r

    subgraph sidekiq
      s(sidekiq)
    end
    s-- "5 (fetch a job)" -->r
    s-- "6 (read files)" -->nfs
```

We went through two major iterations of our uploads architecture to improve on these problems:

1. [Moving disk buffering to Workhorse.](#moving-disk-buffering-to-workhorse)
1. [Uploading to Object Storage from Workhorse.](#moving-to-object-storage-and-direct-uploads)

### Moving disk buffering to Workhorse

To address the performance issues resulting from buffering files in Ruby, we moved this logic to Workhorse instead,
our reverse proxy fronting the GitLab Rails application.
Workhorse is written in Go, and is much better at dealing with stream processing and I/O than Rails.

There are two parts to this implementation:

1. In Workhorse, a request handler detects `multipart/form-data` content in an incoming user request.
   If such a request is detected, Workhorse hijacks the request body before forwarding it to Rails.
   Workhorse writes all files to disk, rewrites the multipart form fields to point to the new locations, signs the
   request, then forwards it to Rails.
1. In Rails, a [custom multipart Rack middleware](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/middleware/multipart.rb)
   identifies any signed multipart requests coming from Workhorse and prepares the `params` hash Rails
   would expect, now pointing to the files cached by Workhorse. This makes it a drop-in replacement for `Rack::Multipart`.

The diagram below shows how GitLab handles such a request today:

```mermaid
graph TB
    subgraph "load balancers"
      LB(HA Proxy)
    end

    subgraph "Shared storage"
       nfs(NFS)
    end

    subgraph "redis cluster"
       r(persisted redis)
    end
    LB-- 1 -->Workhorse

    subgraph "web or API fleet"
      Workhorse-- "3 (without files)" -->rails
    end
    Workhorse -- "2 (write files)" -->nfs
    rails-- "4 (schedule a job)" -->r

    subgraph sidekiq
      s(sidekiq)
    end
    s-- "5 (fetch a job)" -->r
    s-- "6 (read files)" -->nfs
```

While this "one-size-fits-all" solution greatly improves performance for multipart uploads without compromising
developer ergonomics, it severely limits GitLab [availability](#availability-challenges)
and [scalability](#scalability-challenges).

#### Availability challenges

Moving file buffering to Workhorse addresses the immediate performance problems stemming from Ruby not being good at
handling large file uploads. However, a remaining issue of this solution is its reliance on attached storage,
whether via ordinary hard drives or network attached storage like NFS.
NFS is a [single point of failure](https://en.wikipedia.org/wiki/Single_point_of_failure), and is unsuitable for
deploying GitLab in highly available, cloud native environments.

#### Scalability challenges

NFS is not a part of cloud native installations, such as those running in Kubernetes.
In Kubernetes, machine boundaries translate to pods, and without network-attached storage, disk-buffered uploads
must be written directly to the pod's file system.

Using disk buffering presents us with a scalability challenge here. If Workhorse can only
write files to a pod's private file system, then these files are inaccessible outside of this particular pod.
With disk buffering, a Rails controller will accept a file upload and enqueue it for upload in a Sidekiq
background job. Therefore, Sidekiq requires access to these files.
However, in a cloud native environment all Sidekiq instances run on separate pods, so they are
not able to access files buffered to disk on a web server pod.

Therefore, all features that involve Sidekiq uploading disk-buffered files severely limit the scalability of GitLab.

## Moving to object storage and direct uploads

To address these availability and scalability problems,
instead of buffering files to disk, we have added support for uploading files directly
from Workhorse to a given destination. While it remains possible to upload to local or network-attached storage
this way, you should use a highly available
[object store](https://en.wikipedia.org/wiki/Object_storage),
such as AWS S3, Google GCS, or Azure, for scalability reasons.

With direct uploads, Workhorse does not buffer files to disk. Instead, it first authorizes the request with
the Rails application to find out where to upload it, then streams the file directly to its ultimate destination.

To learn more about how disk buffering and direct uploads are implemented, see:

- [How uploads work technically](implementation.md)
- [Adding new uploads](working_with_uploads.md)