Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database/clickhouse')
-rw-r--r--doc/development/database/clickhouse/index.md62
-rw-r--r--doc/development/database/clickhouse/merge_request_analytics.md355
-rw-r--r--doc/development/database/clickhouse/tiered_storage.md138
3 files changed, 555 insertions, 0 deletions
diff --git a/doc/development/database/clickhouse/index.md b/doc/development/database/clickhouse/index.md
index a26bac261fd..032e4f5f6ee 100644
--- a/doc/development/database/clickhouse/index.md
+++ b/doc/development/database/clickhouse/index.md
@@ -83,3 +83,65 @@ Quoting the [documentation](https://clickhouse.com/docs/en/sql-reference/stateme
> If there's some aggregation in the view query, it's applied only to the batch
> of freshly inserted data. Any changes to existing data of the source table
> (like update, delete, drop a partition, etc.) do not change the materialized view.
+
+## Secure and sensible defaults
+
+ClickHouse instances should follow these security recommendations:
+
+### Users
+
+Files: `users.xml` and `config.xml`.
+
+| Topic | Security Requirement | Reason |
+| ----- | -------------------- | ------ |
+| [`user_name/password`](https://clickhouse.com/docs/en/operations/settings/settings-users/#user-namepassword) | Usernames **must not** be blank. Passwords **must** use `password_sha256_hex` and **must not** be blank. | `plaintext` and `password_double_sha1_hex` are insecure. If username isn't specified, [`default` is used with no password](https://clickhouse.com/docs/en/operations/settings/settings-users/). |
+| [`access_management`](https://clickhouse.com/docs/en/operations/settings/settings-users/#access_management-user-setting) | Use Server [configuration files](https://clickhouse.com/docs/en/operations/configuration-files) `users.xml` and `config.xml`. Avoid SQL-driven workflow. | SQL-driven workflow implies that at least one user has `access_management` which can be avoided via configuration files. These files are easier to audit and monitor too, considering that ["You can't manage the same access entity by both configuration methods simultaneously."](https://clickhouse.com/docs/en/operations/access-rights/#access-control). |
+| [`user_name/networks`](https://clickhouse.com/docs/en/operations/settings/settings-users/#user-namenetworks) | At least one of `<ip>`, `<host>`, `<host_regexp>` **must** be set. Do not use `<ip>::/0</ip>` to open access for any network. | Network controls. ([Trust cautiously](https://about.gitlab.com/handbook/security/architecture/#trust-cautiously) principle) |
+| [`user_name/profile`](https://clickhouse.com/docs/en/operations/settings/settings-users/#user-nameprofile) | Use profiles to set similar properties across multiple users and set limits (from the user interface). | [Least privilege](https://about.gitlab.com/handbook/security/architecture/#assign-the-least-privilege-possible) principle and limits. |
+| [`user_name/quota`](https://clickhouse.com/docs/en/operations/settings/settings-users/#user-namequota) | Set quotas for users whenever possible. | Limit resource usage over a period of time or track the use of resources. |
+| [`user_name/databases`](https://clickhouse.com/docs/en/operations/settings/settings-users/#user-namedatabases) | Restrict access to data, and avoid users with full access. | [Least privilege](https://about.gitlab.com/handbook/security/architecture/#assign-the-least-privilege-possible) principle. |
+
+### Network
+
+Files: `config.xml`
+
+| Topic | Security Requirement | Reason |
+| ----- | -------------------- | ------ |
+| [`mysql_port`](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-mysql_port) | Disable MySQL access unless strictly necessary:<br/> `<!-- <mysql_port>9004</mysql_port> -->`. | Close unnecessary ports and features exposure. ([Defense in depth](https://about.gitlab.com/handbook/security/architecture/#implement-defense-in-depth) principle) |
+| [`postgresql_port`](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-postgresql_port) | Disable PostgreSQL access unless strictly necessary:<br/> `<!-- <mysql_port>9005</mysql_port> -->` | Close unnecessary ports and features exposure. ([Defense in depth](https://about.gitlab.com/handbook/security/architecture/#implement-defense-in-depth) principle) |
+| [`http_port/https_port`](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#http-porthttps-port) & [`tcp_port/tcp_port_secure`](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#http-porthttps-port) | Configure [SSL-TLS](https://clickhouse.com/docs/en/guides/sre/configuring-ssl), and disable non SSL ports:<br/>`<!-- <http_port>8123</http_port> -->`<br/>`<!-- <tcp_port>9000</tcp_port> -->`<br/>and enable secure ports:<br/>`<https_port>8443</https_port>`<br/>`<tcp_port_secure>9440</tcp_port_secure>` | Encrypt data in transit. ([Defense in depth](https://about.gitlab.com/handbook/security/architecture/#implement-defense-in-depth) principle) |
+| [`interserver_http_host`](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#interserver-http-host) | Disable `interserver_http_host` in favor of `interserver_https_host` (`<interserver_https_port>9010</interserver_https_port>`) if ClickHouse is configured as a cluster. | Encrypt data in transit. ([Defense in depth](https://about.gitlab.com/handbook/security/architecture/#implement-defense-in-depth) principle) |
+
+### Storage
+
+| Topic | Security Requirement | Reason |
+| ----- | -------------------- | ------ |
+| Permissions | ClickHouse runs by default with the `clickhouse` user. Running as `root` is never needed. Use the principle of least privileges for the folders: `/etc/clickhouse-server`, `/var/lib/clickhouse`, `/var/log/clickhouse-server`. These folders must belong to the `clickhouse` user and group, and no other system user must have access to them. | Default passwords, ports and rules are "open doors". ([Fail securely & use secure defaults](https://about.gitlab.com/handbook/security/architecture/#fail-securely--use-secure-defaults) principle) |
+| Encryption | Use an encrypted storage for logs and data if RED data is processed. On Kubernetes, the [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) used must be encrypted. | Encrypt data at rest. ([Defense in depth](https://about.gitlab.com/handbook/security/architecture/#implement-defense-in-depth)) |
+
+### Logging
+
+| Topic | Security Requirement | Reason |
+| ----- | -------------------- | ------ |
+| `logger` | `Log` and `errorlog` **must** be defined and writable by `clickhouse`. | Make sure logs are stored. |
+| SIEM | If hosted on GitLab.com, the ClickHouse instance or cluster **must** report [logs to our SIEM](https://internal-handbook.gitlab.io/handbook/security/infrastructure_security_logging/tooling/devo/) (internal link). | [GitLab logs critical information system activity](https://about.gitlab.com/handbook/security/audit-logging-policy.html). |
+| Log sensitive data | Query masking rules **must** be used if sensitive data can be logged. See [example masking rules](#example-masking-rules). | [Column level encryption](https://clickhouse.com/docs/en/sql-reference/functions/encryption-functions/) can be used and leak sensitive data (keys) in logs. |
+
+#### Example masking rules
+
+```xml
+<query_masking_rules>
+ <rule>
+ <name>hide SSN</name>
+ <regexp>(^|\D)\d{3}-\d{2}-\d{4}($|\D)</regexp>
+ <replace>000-00-0000</replace>
+ </rule>
+ <rule>
+ <name>hide encrypt/decrypt arguments</name>
+ <regexp>
+ ((?:aes_)?(?:encrypt|decrypt)(?:_mysql)?)\s*\(\s*(?:'(?:\\'|.)+'|.*?)\s*\)
+ </regexp>
+ <replace>\1(???)</replace>
+ </rule>
+</query_masking_rules>
+```
diff --git a/doc/development/database/clickhouse/merge_request_analytics.md b/doc/development/database/clickhouse/merge_request_analytics.md
new file mode 100644
index 00000000000..34da71d6c4c
--- /dev/null
+++ b/doc/development/database/clickhouse/merge_request_analytics.md
@@ -0,0 +1,355 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Merge request analytics with ClickHouse
+
+The [merge request analytics feature](../../../user/analytics/merge_request_analytics.md)
+shows statistics about the merged merge requests in the project and also exposes record-level metadata.
+Aggregations include:
+
+- **Average time to merge**: The duration between the creation time and the merge time.
+- **Monthly aggregations**: A chart of 12 months of the merged merge requests.
+
+Under the chart, the user can see the paginated list of merge requests, 12 months per page.
+
+You can filter by:
+
+- Author
+- Assignee
+- Labels
+- Milestone
+- Source branch
+- Target branch
+
+## Current performance problems
+
+- The aggregation queries require specialized indexes, which cost additional
+ disk space (index-only scans).
+- Querying the whole 12 months is slow (statement timeout). Instead, the frontend
+ requests data per month (12 database queries).
+- Even with specialized indexes, making the feature available on the group level
+ would not be feasible due to the large volume of merge requests.
+
+## Example queries
+
+Get the number of merge requests merged in a given month:
+
+```sql
+SELECT COUNT(*)
+FROM "merge_requests"
+INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
+WHERE (NOT EXISTS
+ (SELECT 1
+ FROM "banned_users"
+ WHERE (merge_requests.author_id = banned_users.user_id)))
+ AND "merge_request_metrics"."target_project_id" = 278964
+ AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
+ AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
+```
+
+The `merge_request_metrics` table was de-normalized (by adding `target_project_id`)
+to improve the first-page load time. The query itself works well for smaller date ranges,
+however, it can time out as the date range increases.
+
+After an extra filter is added, the query becomes more complex because it must also
+filter the `merge_requests` table:
+
+```sql
+SELECT COUNT(*)
+FROM "merge_requests"
+INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
+WHERE (NOT EXISTS
+ (SELECT 1
+ FROM "banned_users"
+ WHERE (merge_requests.author_id = banned_users.user_id)))
+ AND "merge_requests"."author_id" IN
+ (SELECT "users"."id"
+ FROM "users"
+ WHERE (LOWER("users"."username") IN (LOWER('ahegyi'))))
+ AND "merge_request_metrics"."target_project_id" = 278964
+ AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
+ AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
+```
+
+To calculate mean time to merge, we also query the total time between the
+merge request creation time and merge time.
+
+```sql
+SELECT EXTRACT(epoch
+ FROM SUM(AGE(merge_request_metrics.merged_at, merge_request_metrics.created_at)))
+FROM "merge_requests"
+INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
+WHERE (NOT EXISTS
+ (SELECT 1
+ FROM "banned_users"
+ WHERE (merge_requests.author_id = banned_users.user_id)))
+ AND "merge_requests"."author_id" IN
+ (SELECT "users"."id"
+ FROM "users"
+ WHERE (LOWER("users"."username") IN (LOWER('ahegyi'))))
+ AND "merge_request_metrics"."target_project_id" = 278964
+ AND "merge_request_metrics"."merged_at" >= '2022-08-01 00:00:00'
+ AND "merge_request_metrics"."merged_at" <= '2022-09-01 00:00:00'
+ AND "merge_request_metrics"."merged_at" > "merge_request_metrics"."created_at"
+LIMIT 1
+```
+
+## Store merge request data in ClickHouse
+
+Several other use cases exist for storing and querying merge request data in
+ClickHouse. In this document, we focus on this particular feature.
+
+The core data exists in the `merge_request_metrics` and in the `merge_requests`
+database tables. Some filters require extra tables to be joined:
+
+- `banned_users`: Filter out merge requests created by banned users.
+- `labels`: A merge request can have one or more assigned labels.
+- `assignees`: A merge request can have one or more assignees.
+- `merged_at`: The `merged_at` column is located in the `merge_request_metrics` table.
+
+The `merge_requests` table contains data that can be filtered directly:
+
+- **Author**: via the `author_id` column.
+- **Milestone**: via the `milestone_id` column.
+- **Source branch**.
+- **Target branch**.
+- **Project**: via the `project_id` column.
+
+### Keep ClickHouse data up to date
+
+Replicating or syncing the `merge_requests` table is unfortunately not enough.
+Separate queries to associated tables are required to insert one de-normalized
+`merge_requests` row into the ClickHouse database.
+
+Change detection is non-trivial to implement. A few corners we could cut:
+
+- The feature is available for GitLab Premium and GitLab Ultimate customers.
+ We don't have to sync all the data, but instead sync only the `merge_requests` records
+ which are part of licensed groups.
+- Data changes (often) happen via the `MergeRequest` services, where bumping the
+ `updated_at` timestamp column is mostly consistent. Some sort of incremental
+ synchronization process could be implemented.
+- We only need to query the merged merge requests. After the merge, the record rarely changes.
+
+### Database table structure
+
+The database table structure uses de-normalization to make all required columns
+available in one database table. This eliminates the need for `JOINs`.
+
+```sql
+CREATE TABLE merge_requests
+(
+ `id` UInt64,
+ `project_id` UInt64 DEFAULT 0 NOT NULL,
+ `author_id` UInt64 DEFAULT 0 NOT NULL,
+ `milestone_id` UInt64 DEFAULT 0 NOT NULL,
+ `label_ids` Array(UInt64) DEFAULT [] NOT NULL,
+ `assignee_ids` Array(UInt64) DEFAULT [] NOT NULL,
+ `source_branch` String DEFAULT '' NOT NULL,
+ `target_branch` String DEFAULT '' NOT NULL,
+ `merged_at` DateTime64(6, 'UTC') NOT NULL,
+ `created_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL,
+ `updated_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL
+)
+ENGINE = ReplacingMergeTree(updated_at)
+ORDER BY (project_id, merged_at, id);
+```
+
+Similarly to the [activity data example](gitlab_activity_data.md), we use the
+`ReplacingMergeTree` engine. Several columns of the merge request record may change,
+so keeping the table up-to-date is important.
+
+The database table is ordered by the `project_id, merged_at, id` columns. This ordering
+optimizes the table data for our use case: querying the `merged_at` column in a project.
+
+## Rewrite the count query
+
+First, let's generate some data for the table.
+
+```sql
+INSERT INTO merge_requests (id, project_id, author_id, milestone_id, label_ids, merged_at, created_at)
+SELECT id, project_id, author_id, milestone_id, label_ids, merged_at, created_at
+FROM generateRandom('id UInt64, project_id UInt8, author_id UInt8, milestone_id UInt8, label_ids Array(UInt8), merged_at DateTime64(6, \'UTC\'), created_at DateTime64(6, \'UTC\')')
+LIMIT 1000000;
+```
+
+NOTE:
+Some integer data types were cast as `UInt8` so it is highly probable that they
+have same values across different rows.
+
+The original count query only aggregated data for one month. With ClickHouse, we can
+attempt aggregating the data for the whole year.
+
+PostgreSQL-based count query:
+
+```sql
+SELECT COUNT(*)
+FROM "merge_requests"
+INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
+WHERE (NOT EXISTS
+ (SELECT 1
+ FROM "banned_users"
+ WHERE (merge_requests.author_id = banned_users.user_id)))
+ AND "merge_request_metrics"."target_project_id" = 278964
+ AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
+ AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
+```
+
+ClickHouse query:
+
+```sql
+SELECT
+ toYear(merged_at) AS year,
+ toMonth(merged_at) AS month,
+ COUNT(*)
+FROM merge_requests
+WHERE
+ project_id = 200
+ AND merged_at BETWEEN '2022-01-01 00:00:00'
+ AND '2023-01-01 00:00:00'
+GROUP BY year, month
+```
+
+The query processed a significantly lower number of rows compared to the generated data.
+The `ORDER BY` clause (primary key) is helping the query execution:
+
+```plaintext
+11 rows in set. Elapsed: 0.010 sec.
+Processed 8.19 thousand rows, 131.07 KB (783.45 thousand rows/s., 12.54 MB/s.)
+```
+
+## Rewrite the Mean time to merge query
+
+The query calculates the mean time to merge as:
+`duration(created_at, merged_at) / merge_request_count`. The calculation is done in
+two separate steps:
+
+1. Request the monthly counts and the monthly duration values.
+1. Sum the counts to get the yearly count.
+1. Sum the durations to get the yearly duration.
+1. Divide the durations by the count.
+
+In ClickHouse, we can calculate the mean time to merge with one query:
+
+```sql
+SELECT
+ SUM(
+ dateDiff('second', merged_at, created_at) / 3600 / 24
+ ) / COUNT(*) AS mean_time_to_merge -- mean_time_to_merge is in days
+FROM merge_requests
+WHERE
+ project_id = 200
+ AND merged_at BETWEEN '2022-01-01 00:00:00'
+ AND '2023-01-01 00:00:00'
+```
+
+## Filtering
+
+The database queries above can be used as base queries. You can add more filters.
+For example, filtering for a label and a milestone:
+
+```sql
+SELECT
+ toYear(merged_at) AS year,
+ toMonth(merged_at) AS month,
+ COUNT(*)
+FROM merge_requests
+WHERE
+ project_id = 200
+ AND milestone_id = 15
+ AND has(label_ids, 118)
+ AND -- array includes 118
+ merged_at BETWEEN '2022-01-01 00:00:00'
+ AND '2023-01-01 00:00:00'
+GROUP BY year, month
+```
+
+Optimizing a particular filter is usually done with a database index. This particular
+query reads 8000 rows:
+
+```plaintext
+1 row in set. Elapsed: 0.016 sec.
+Processed 8.19 thousand rows, 589.99 KB (505.38 thousand rows/s., 36.40 MB/s.)
+```
+
+Adding an index on `milestone_id`:
+
+```sql
+ALTER TABLE merge_requests
+ADD
+ INDEX milestone_id_index milestone_id TYPE minmax GRANULARITY 10;
+ALTER TABLE
+ merge_requests MATERIALIZE INDEX milestone_id_index;
+```
+
+On the generated data, adding the index didn't improve the performance.
+
+### Banned users filter
+
+A recently added feature in GitLab filters out merge requests where the author is
+banned by the admins. The banned users are tracked on the instance level in the
+`banned_users` database table.
+
+#### Idea 1: Enumerate the banned user IDs
+
+This would require no structural changes to the ClickHouse database schema.
+We could query the banned users in the project and filter the values out in query time.
+
+Get the banned users (in PostgreSQL):
+
+```sql
+SELECT user_id FROM banned_users
+```
+
+In ClickHouse
+
+```sql
+SELECT
+ toYear(merged_at) AS year,
+ toMonth(merged_at) AS month,
+ COUNT(*)
+FROM merge_requests
+WHERE
+ author_id NOT IN (1, 2, 3, 4) AND -- banned users
+ project_id = 200
+ AND milestone_id = 15
+ AND has(label_ids, 118) AND -- array includes 118
+ merged_at BETWEEN '2022-01-01 00:00:00'
+ AND '2023-01-01 00:00:00'
+GROUP BY year, month
+```
+
+The problem with this approach is that the number of banned users could increase significantly which would make the query bigger and slower.
+
+#### Idea 2: replicate the `banned_users` table
+
+Assuming that the `banned_users table` doesn't grow to millions of rows, we could
+attempt to periodically sync the whole table to ClickHouse. With this approach,
+a mostly consistent `banned_users` table could be used in the ClickHouse database query:
+
+```sql
+SELECT
+ toYear(merged_at) AS year,
+ toMonth(merged_at) AS month,
+ COUNT(*)
+FROM merge_requests
+WHERE
+ author_id NOT IN (SELECT user_id FROM banned_users) AND
+ project_id = 200 AND
+ milestone_id = 15 AND
+ has(label_ids, 118) AND -- array includes 118
+ merged_at BETWEEN '2022-01-01 00:00:00' AND '2023-01-01 00:00:00'
+GROUP BY year, month
+```
+
+Alternatively, the `banned_users` table could be stored as a
+[dictionary](https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts)
+to further improve the query performance.
+
+#### Idea 3: Alter the feature
+
+For analytical calculations, it might be acceptable to drop this particular filter.
+This approach assumes that including the merge requests of banned users doesn't skew the statistics significantly.
diff --git a/doc/development/database/clickhouse/tiered_storage.md b/doc/development/database/clickhouse/tiered_storage.md
new file mode 100644
index 00000000000..d9026f47e28
--- /dev/null
+++ b/doc/development/database/clickhouse/tiered_storage.md
@@ -0,0 +1,138 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Tiered Storages in ClickHouse
+
+NOTE:
+The MergeTree table engine in ClickHouse supports tiered storage.
+See the documentation for [Using Multiple Block Devices for Data Storage](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes)
+for details on setup and further explanation.
+
+Quoting from the [MergeTree documentation](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes):
+
+<!-- vale gitlab.Simplicity = NO -->
+
+> MergeTree family table engines can store data on multiple block devices. For example,
+> it can be useful when the data of a certain table are implicitly split into "hot" and "cold".
+> The most recent data is regularly requested but requires only a small amount of space.
+> On the contrary, the fat-tailed historical data is requested rarely.
+
+<!-- vale gitlab.Simplicity = YES -->
+
+When used with remote storage backends such as
+[Amazon S3](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-s3),
+this makes a very efficient storage scheme. It allows for storage policies, which
+allows data to be on local disks for a period of time and then move it to object storage.
+
+An [example configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure) can look like this:
+
+```xml
+<storage_configuration>
+ <disks>
+ <fast_ssd>
+ <path>/mnt/fast_ssd/clickhouse/</path>
+ </fast_ssd>
+ <gcs>
+ <support_batch_delete>false</support_batch_delete>
+ <type>s3</type>
+ <endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint>
+ <access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id>
+ <secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key>
+ <metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path>
+ </gcs>
+ ...
+ </disks>
+ ...
+ <policies>
+
+ <move_from_local_disks_to_gcs> <!-- policy name -->
+ <volumes>
+ <hot> <!-- volume name -->
+ <disk>fast_ssd</disk> <!-- disk name -->
+ </hot>
+ <cold>
+ <disk>gcs</disk>
+ </cold>
+ </volumes>
+ <move_factor>0.2</move_factor>
+ <!-- The move factor determines when to move data from hot volume to cold.
+ See ClickHouse docs for more details. -->
+ </moving_from_ssd_to_hdd>
+ ....
+</storage_configuration>
+```
+
+In this storage policy, two volumes are defined `hot` and `cold`. After the `hot` volume is filled with occupancy of `disk_size * move_factor`, the data is being moved to Google Cloud Storage (GCS).
+
+If this storage policy is not the default, create tables by attaching the storage policies. For example:
+
+```sql
+CREATE TABLE key_value_table (
+ event_date Date,
+ key String,
+ value String,
+) ENGINE = MergeTree
+ORDER BY (key)
+PARTITION BY toYYYYMM(event_date)
+SETTINGS storage_policy = 'move_from_local_disks_to_gcs'
+```
+
+NOTE:
+In this storage policy, the move happens implicitly. It is also possible to keep
+_hot_ data on local disks for a fixed period of time and then move them as _cold_.
+
+This approach is possible with
+[Table TTLs](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl),
+which are also available with MergeTree table engine.
+
+The ClickHouse documentation shows this feature in detail, in the example of
+[implementing a hot - warm - cold architecture](https://clickhouse.com/docs/en/guides/developer/ttl/#implementing-a-hotwarmcold-architecture).
+
+You can take a similar approach for the example shown above. First, adjust the storage policy:
+
+```xml
+<storage_configuration>
+ ...
+ <policies>
+ <local_disk_and_gcs> <!-- policy name -->
+ <volumes>
+ <hot> <!-- volume name -->
+ <disk>fast_ssd</disk> <!-- disk name -->
+ </hot>
+ <cold>
+ <disk>gcs</disk>
+ </cold>
+ </volumes>
+ </local_disk_and_gcs>
+ ....
+</storage_configuration>
+```
+
+Then create the table as:
+
+```sql
+CREATE TABLE another_key_value_table (
+ event_date Date,
+ key String,
+ value String,
+) ENGINE = MergeTree
+ORDER BY (key)
+PARTITION BY toYYYYMM(event_date)
+TTL
+ event_date TO VOLUME 'hot',
+ event_date + INTERVAL 1 YEAR TO VOLUME 'cold'
+SETTINGS storage_policy = 'local_disk_and_gcs';
+```
+
+This creates the table so that data older than 1 year (evaluated against the
+`event_date` column) is moved to GCS. Such a storage policy can be helpful for append-only
+tables (like audit events) where only the most recent data is accessed frequently.
+You can drop the data altogether, which can be a regulatory requirement.
+
+We don't mention modifying TTLs in this guide, but that is possible as well.
+See ClickHouse documentation for
+[modifying TTL](https://clickhouse.com/docs/en/sql-reference/statements/alter/ttl/#modify-ttl)
+for details.