# Disk Cache Design Gitaly utilizes a disk-based cache for efficiently serving some RPC responses (at time of writing, only the `SmartHTTPService.InfoRefUploadPack` RPC). This cache is intended to be used for serving large responses not suitable for a RAM based cache. ## Cache Invalidation The mechanisms that enable the invalidation of the disk cache for a repo depend on special annotations made to the Gitaly gRPC methods. Each method that has scope "repository" and is operation type "mutator" will cause the specified repository to be invalidated. For more information on the annotation system, see the Gitaly protobuf definition [contributing guide]. [contributing guide]: https://gitlab.com/gitlab-org/gitaly/tree/4c27a7f71ba1d91edbc9d321919620887d6a30d3/proto#rpc-annotations ## Repository State For every repository using the disk cache, a special set of files is maintained to indicate which cached responses are still valid. These files are stored in a dedicated **state directory** for each repository: ${STATE_DIR} = ${STORAGE_PATH}/+gitaly/state/${REPO_RELATIVE_PATH} Before a mutating RPC handler is invoked, a gRPC middleware creates a "lease" file in the state directory that signifies a mutating operation is in-flight. These lease files reside at the following path: ${STATE_DIR}/pending/${RANDOM_FILENAME} Upon the completion of the mutating RPC, the lease file will be removed and the "latest" file will be updated with a random value to reflect the new "state" of the repository. ${STATE_DIR}/latest The contents of latest are used along with several other values to form an aggregate key that addresses a specific request for a specific repository at a specific repository state: ``` ─────┐ │ latest (file contents)│ RPC request (digest) │ ┌──────┐ Gitaly version (string) ├─────│SHA256│─────▶ Cache key RPC Method (string) │ └──────┘ │ ─────┘ ``` ## Cache State Machine The repository state files are used to determine whether the repository is in a deterministic state (i.e. no mutating RPCs in-flight) and how to find the valid cached responses for the current repository state. The state machine diagram follows: ```mermaid graph TD; A[Are there lease files?]-->|Yes|B; A-->|No|C; B[Are any lease files stale?]-->|Yes|D; B-->|No|E; C[Does non-stale latest file exist?]-->|Yes|F; C-->|No|G; D[Remove stale lease files]-->A; E[Mutator RPC In-Flight: Cache state indeterministic] F[No mutator RPCs In-Flight: Cache state deterministic] G[Create/Truncate latest file]-->F classDef nonfinal fill:#ccf,stroke-width; classDef final fill:#f9f,stroke-dasharray: 5, 5; class A,B,C,D,G nonfinal; class E,F final; ``` **Note:** There are momentary race conditions where an RPC may become in flight between the time the lease files are checked and the latest file is inspected, but this is allowed by the cache design in order to avoid distributed locking. This means that a stale cached response might be served momentarily, but this slight delay in fresh responses is a small tradeoff necessary to keep the cache lockless. The lockless quality is highly desired since Gitaly is often operated on NFS mounts where file locks are not advisable. ## Cached Responses When the repository is determined to be in a deterministic state (i.e. no in-flight mutator RPCs), it is safe to cache responses and retrieve cached responses. The aggregate key digest is used to form a hexadecimal path to the cached response in this format: ${STORAGE_PATH}/+gitaly/cache/${DIGEST:0:2}/${DIGEST:2} **Note:** The first two characters of the digest are used as a subdirectory to allow the random distribution of the digest algorithm (SHA256) to evenly distribute the response files. This way, the digest files are evenly distributed across 256 folders. ## File Cleanup Since the disk cache introduces a number of new filesystem constructs, both state files and cached responses, there needs to be a way to clean up these files when the normal processes are not adequate. Gitaly runs background workers that periodically remove stale (>1 hour old) state files and cached responses. Additionally, Gitaly will remove the cached responses on program start to guard against any chance that the cache invalidator was not working in a previous run. ## Enabling and Observing The actual caching of info ref advertisements is guarded by a feature flag. Before enabling on a production system, ensure you understand the following requirements and risks: - **All Gitaly servers must be v1.71.0 or higher** - Note: this version is available in **Omnibus GitLab 12.5.0** and above - In order for the cache entries to be properly invalidated, all Gitaly nodes serving a [storage location] must support the same cache invalidation feature found in v1.71.0+. Custom Gitaly deployments with mixed versions may serve stale info ref advertisements. - The cache will use extra disk on the Gitaly storage locations. This should be actively monitored. [Node exporter] is recommended for tracking resource usage. - There may be initial latency spikes when enabling this feature for large/busy GitLab instances until the cache is warmed up. On a busy site like gitlab.com, this may last as long as several seconds to a minute. This flag can be enabled in one of two ways: - HTTP API via curl command: `curl --data "value=true" --header "PRIVATE-TOKEN: " https://gitlab.example.com/api/v4/features/gitaly_inforef_uploadpack_cache` - Rails console command: `Feature.enable(:gitaly_inforef_uploadpack_cache)` Once enabled, the following Prometheus queries (adapted from [GitLab's dashboards]) will give you insight into the performance and behavior of the cache: - [Cache invalidation behavior] - `sum(rate(gitaly_cacheinvalidator_optype_total[1m])) by (type)` - Shows the Gitaly RPC types (mutator or accessor). The cache benefits from Gitaly requests that are more often accessors than mutators. - [Cache Throughput Bytes] - `sum(rate(gitaly_diskcache_bytes_fetched_total[1m]))` - `sum(rate(gitaly_diskcache_bytes_stored_total[1m]))` - Shows the cache's throughput at the byte level. Ideally, the throughput should correlate to the cache invalidation behavior. - [Cache Effectiveness] - `(sum(rate(gitaly_diskcache_requests_total[1m])) - sum(rate(gitaly_diskcache_miss_total[1m]))) / sum(rate(gitaly_diskcache_requests_total[1m]))` - Shows how often the cache is invoked for a hit vs a miss. A value close to 100% is desirable. - [Cache Errors] - `sum(rate(gitaly_diskcache_errors_total[1m])) by (error)` - Shows edge case errors experienced by the cache. The following errors can be ignored: - `ErrMissingLeaseFile` - `ErrPendingExists` [GitLab's dashboards]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1 [Cache invalidation behavior]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=2 [Cache Throughput Bytes]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=6 [Cache Effectiveness]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=8 [Cache Errors]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=12 [Node exporter]: https://docs.gitlab.com/ee/administration/monitoring/prometheus/node_exporter.html [storage location]: https://docs.gitlab.com/ee/administration/repository_storage_paths.html