Welcome to mirror list, hosted at ThFree Co, Russian Federation.

design_diskcache.md « doc - gitlab.com/gitlab-org/gitaly.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: ccfb6b8fe92d52eb9489f96c2ab9daa1df5553be (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# Disk Cache Design

Gitaly utilizes a disk-based cache for efficiently serving some RPC responses
(at time of writing, only the `SmartHTTPService.InfoRefUploadPack` RPC). This
cache is intended to be used for serving large responses not suitable for a RAM
based cache.

## Cache Invalidation

The mechanisms that enable the invalidation of the disk cache for a repo depend
on special annotations made to the Gitaly gRPC methods. Each method that has
scope "repository" and is operation type "mutator" will cause the specified
repository to be invalidated. For more information on the annotation system,
see the Gitaly protobuf definition [contributing guide].

[contributing guide]: https://gitlab.com/gitlab-org/gitaly/tree/4c27a7f71ba1d91edbc9d321919620887d6a30d3/proto#rpc-annotations

## Repository State

For every repository using the disk cache, a special set of files is maintained
to indicate which cached responses are still valid. These files are stored
in a dedicated **state directory** for each repository:

	${STATE_DIR} = ${STORAGE_PATH}/+gitaly/state/${REPO_RELATIVE_PATH}

Before a mutating RPC handler is invoked, a gRPC middleware creates a "lease"
file in the state directory that signifies a mutating operation is in-flight.
These lease files reside at the following path:

	${STATE_DIR}/pending/${RANDOM_FILENAME}

Upon the completion of the mutating RPC, the lease file will be removed and
the "latest" file will be updated with a random value to reflect the new
"state" of the repository.

	${STATE_DIR}/latest

The contents of latest are used along with several other values to form an
aggregate key that addresses a specific request for a specific repository at a
specific repository state:

```
                               ─────┐
                                    │
      latest         (file contents)│
      RPC request    (digest)       │     ┌──────┐
      Gitaly version (string)       ├─────│SHA256│─────▶ Cache key
      RPC Method     (string)       │     └──────┘
      Feature flags  (string)       │
                                    │
                               ─────┘
```

## Cache State Machine

The repository state files are used to determine whether the repository is in
a deterministic state (i.e. no mutating RPCs in-flight) and how to find the
valid cached responses for the current repository state. The state machine
diagram follows:

```mermaid
graph TD;
    A[Are there lease files?]-->|Yes|B;
    A-->|No|C;
    B[Are any lease files stale?]-->|Yes|D;
    B-->|No|E;
    C[Does non-stale latest file exist?]-->|Yes|F;
    C-->|No|G;
    D[Remove stale lease files]-->A;
    E[Mutator RPC In-Flight: Cache state indeterministic]
    F[No mutator RPCs In-Flight: Cache state deterministic]
    G[Create/Truncate latest file]-->F

    classDef nonfinal fill:#ccf,stroke-width;
    classDef final fill:#f9f,stroke-dasharray: 5, 5;

    class A,B,C,D,G nonfinal;
    class E,F final;
```

**Note:** There are momentary race conditions where an RPC may become in flight
between the time the lease files are checked and the latest file is inspected,
but this is allowed by the cache design in order to avoid distributed locking.
This means that a stale cached response might be served momentarily, but this
slight delay in fresh responses is a small tradeoff necessary to keep the cache
lockless. The lockless quality is highly desired since Gitaly is often operated on NFS
mounts where file locks are not advisable.

## Cached Responses

When the repository is determined to be in a deterministic state (i.e. no
in-flight mutator RPCs), it is safe to cache responses and retrieve cached
responses. The aggregate key digest is used to form a hexadecimal path to the
cached response in this format:

	${STORAGE_PATH}/+gitaly/cache/${DIGEST:0:2}/${DIGEST:2}

**Note:** The first two characters of the digest are used as a subdirectory to
allow the random distribution of the digest algorithm (SHA256) to evenly
distribute the response files. This way, the digest files are evenly
distributed across 256 folders.

## File Cleanup

Since the disk cache introduces a number of new filesystem constructs, both
state files and cached responses, there needs to be a way to clean up these
files when the normal processes are not adequate.

Gitaly runs background workers that periodically remove stale (>1 hour old)
state files and cached responses. Additionally, Gitaly will remove the cached
responses on program start to guard against any chance that the cache
invalidator was not working in a previous run.

## Considerations

- Note: this feature is available by default in **Omnibus GitLab 12.10.0** and
  above
- The cache will use extra disk on the Gitaly storage locations. This should be
  actively monitored. [Node exporter] is recommended for tracking resource
  usage.
- There may be initial latency spikes when enabling this feature for large/busy
  GitLab instances until the cache is warmed up. On a busy site like gitlab.com,
  this may last as long as several seconds to a minute.

The following Prometheus queries (adapted from [GitLab's dashboards])
will give you insight into the performance and behavior of the cache:

- [Cache invalidation behavior]
    - `sum(rate(gitaly_cacheinvalidator_optype_total[1m])) by (type)`
    - Shows the Gitaly RPC types (mutator or accessor). The cache benefits from
      Gitaly requests that are more often accessors than mutators.
- [Cache Throughput Bytes]
    - `sum(rate(gitaly_diskcache_bytes_fetched_total[1m]))`
    - `sum(rate(gitaly_diskcache_bytes_stored_total[1m]))`
    - Shows the cache's throughput at the byte level. Ideally, the throughput
      should correlate to the cache invalidation behavior.
- [Cache Effectiveness]
    - `(sum(rate(gitaly_diskcache_requests_total[1m])) - sum(rate(gitaly_diskcache_miss_total[1m]))) / sum(rate(gitaly_diskcache_requests_total[1m]))`
    - Shows how often the cache is invoked for a hit vs a miss. A value close to
      100% is desirable.
- [Cache Errors]
    - `sum(rate(gitaly_diskcache_errors_total[1m])) by (error)`
    - Shows edge case errors experienced by the cache. The following errors can
      be ignored:
        - `ErrMissingLeaseFile`
        - `ErrPendingExists`

[GitLab's dashboards]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1
[Cache invalidation behavior]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=2
[Cache Throughput Bytes]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=6
[Cache Effectiveness]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=8
[Cache Errors]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=12
[Node exporter]: https://docs.gitlab.com/ee/administration/monitoring/prometheus/node_exporter.html
[storage location]: https://docs.gitlab.com/ee/administration/repository_storage_paths.html