Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/uploads/implementation.md')
-rw-r--r--doc/development/uploads/implementation.md190
1 files changed, 190 insertions, 0 deletions
diff --git a/doc/development/uploads/implementation.md b/doc/development/uploads/implementation.md
new file mode 100644
index 00000000000..13a875cd1af
--- /dev/null
+++ b/doc/development/uploads/implementation.md
@@ -0,0 +1,190 @@
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Uploads guide: How uploads work technically
+
+This page is for developers trying to better understand what kinds of uploads exist in GitLab and how they are implemented.
+
+## Kinds of uploads and how to choose between them
+
+We can identify three major use-cases for an upload:
+
+1. **storage:** if we are uploading for storing a file (like artifacts, packages, or discussion attachments). In this case [direct upload](#direct-upload) is the proper level as it's the less resource-intensive operation. Additional information can be found on [File Storage in GitLab](../file_storage.md).
+1. **in-controller/synchronous processing:** if we allow processing **small files** synchronously, using [disk buffered upload](#disk-buffered-upload) may speed up development.
+1. **Sidekiq/asynchronous processing:** Asynchronous processing must implement [direct upload](#direct-upload), the reason being that it's the only way to support Cloud Native deployments without a shared NFS.
+
+Selecting the proper acceleration is a tradeoff between speed of development and operational costs.
+
+For more details about currently broken feature see [epic &1802](https://gitlab.com/groups/gitlab-org/-/epics/1802).
+
+### Handling repository uploads
+
+Some features involves Git repository uploads without using a regular Git client.
+Some examples are uploading a repository file from the web interface and [design management](../../user/project/issues/design_management.md).
+
+Those uploads requires the rails controller to act as a Git client in lieu of the user.
+Those operation falls into _in-controller/synchronous processing_ category, but we have no warranties on the file size.
+
+In case of a LFS upload, the file pointer is committed synchronously, but file upload to object storage is performed asynchronously with Sidekiq.
+
+## Upload encodings
+
+By upload encoding we mean how the file is included within the incoming request.
+
+We have three kinds of file encoding in our uploads:
+
+1. <i class="fa fa-check-circle"></i> **multipart**: `multipart/form-data` is the most common, a file is encoded as a part of a multipart encoded request.
+1. <i class="fa fa-check-circle"></i> **body**: some APIs uploads files as the whole request body.
+1. <i class="fa fa-times-circle"></i> **JSON**: some JSON APIs upload files as base64-encoded strings. This requires a change to GitLab Workhorse,
+ which is tracked [in this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/325068).
+
+## Uploading technologies
+
+By uploading technologies we mean how all the involved services interact with each other.
+
+GitLab supports 3 kinds of uploading technologies, here follows a brief description with a sequence diagram for each one. Diagrams are not meant to be exhaustive.
+
+### Rack Multipart upload
+
+This is the default kind of upload, and it's the most expensive in terms of resources.
+
+In this case, Workhorse is unaware of files being uploaded and acts as a regular proxy.
+
+When a multipart request reaches the rails application, `Rack::Multipart` leaves behind temporary files in `/tmp` and uses valuable Ruby process time to copy files around.
+
+```mermaid
+sequenceDiagram
+ participant c as Client
+ participant w as Workhorse
+ participant r as Rails
+
+ activate c
+ c ->>+w: POST /some/url/upload
+ w->>+r: POST /some/url/upload
+
+ r->>r: save the incoming file on /tmp
+ r->>r: read the file for processing
+
+ r-->>-c: request result
+ deactivate c
+ deactivate w
+```
+
+### Disk buffered upload
+
+This kind of upload avoids wasting resources caused by handling upload writes to `/tmp` in rails.
+
+This optimization is not active by default on REST API requests.
+
+When enabled, Workhorse looks for files in multipart MIME requests, uploading
+any it finds to a temporary file on shared storage. The MIME data in the request
+is replaced with the path to the corresponding file before it is forwarded to
+Rails.
+
+To prevent abuse of this feature, Workhorse signs the modified request with a
+special header, stating which entries it modified. Rails ignores any
+unsigned path entries.
+
+```mermaid
+sequenceDiagram
+ participant c as Client
+ participant w as Workhorse
+ participant r as Rails
+ participant s as NFS
+
+ activate c
+ c ->>+w: POST /some/url/upload
+
+ w->>+s: save the incoming file on a temporary location
+ s-->>-w: request result
+
+ w->>+r: POST /some/url/upload
+ Note over w,r: file was replaced with its location<br>and other metadata
+
+ opt requires async processing
+ r->>+redis: schedule a job
+ redis-->>-r: job is scheduled
+ end
+
+ r-->>-c: request result
+ deactivate c
+ w->>-w: cleanup
+
+ opt requires async processing
+ activate sidekiq
+ sidekiq->>+redis: fetch a job
+ redis-->>-sidekiq: job
+
+ sidekiq->>+s: read file
+ s-->>-sidekiq: file
+
+ sidekiq->>sidekiq: process file
+
+ deactivate sidekiq
+ end
+```
+
+### Direct upload
+
+This is the more advanced acceleration technique we have in place.
+
+Workhorse asks Rails for temporary pre-signed object storage URLs and directly uploads to object storage.
+
+In this setup, an extra Rails route must be implemented in order to handle authorization. Examples of this can be found in:
+
+- [`Projects::LfsStorageController`](https://gitlab.com/gitlab-org/gitlab/-/blob/cc723071ad337573e0360a879cbf99bc4fb7adb9/app/controllers/projects/lfs_storage_controller.rb)
+ and [its routes](https://gitlab.com/gitlab-org/gitlab/-/blob/cc723071ad337573e0360a879cbf99bc4fb7adb9/config/routes/git_http.rb#L31-32).
+- [API endpoints for uploading packages](../packages.md#file-uploads).
+
+Direct upload falls back to _disk buffered upload_ when `direct_upload` is disabled inside the [object storage setting](../../administration/uploads.md#object-storage-settings).
+The answer to the `/authorize` call contains only a file system path.
+
+```mermaid
+sequenceDiagram
+ participant c as Client
+ participant w as Workhorse
+ participant r as Rails
+ participant os as Object Storage
+
+ activate c
+ c ->>+w: POST /some/url/upload
+
+ w ->>+r: POST /some/url/upload/authorize
+ Note over w,r: this request has an empty body
+ r-->>-w: presigned OS URL
+
+ w->>+os: PUT file
+ Note over w,os: file is stored on a temporary location. Rails select the destination
+ os-->>-w: request result
+
+ w->>+r: POST /some/url/upload
+ Note over w,r: file was replaced with its location<br>and other metadata
+
+ r->>+os: move object to final destination
+ os-->>-r: request result
+
+ opt requires async processing
+ r->>+redis: schedule a job
+ redis-->>-r: job is scheduled
+ end
+
+ r-->>-c: request result
+ deactivate c
+ w->>-w: cleanup
+
+ opt requires async processing
+ activate sidekiq
+ sidekiq->>+redis: fetch a job
+ redis-->>-sidekiq: job
+
+ sidekiq->>+os: get object
+ os-->>-sidekiq: file
+
+ sidekiq->>sidekiq: process file
+
+ deactivate sidekiq
+ end
+```