Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database/clickhouse/tiered_storage.md')
-rw-r--r--doc/development/database/clickhouse/tiered_storage.md138
1 files changed, 138 insertions, 0 deletions
diff --git a/doc/development/database/clickhouse/tiered_storage.md b/doc/development/database/clickhouse/tiered_storage.md
new file mode 100644
index 00000000000..d9026f47e28
--- /dev/null
+++ b/doc/development/database/clickhouse/tiered_storage.md
@@ -0,0 +1,138 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Tiered Storages in ClickHouse
+
+NOTE:
+The MergeTree table engine in ClickHouse supports tiered storage.
+See the documentation for [Using Multiple Block Devices for Data Storage](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes)
+for details on setup and further explanation.
+
+Quoting from the [MergeTree documentation](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes):
+
+<!-- vale gitlab.Simplicity = NO -->
+
+> MergeTree family table engines can store data on multiple block devices. For example,
+> it can be useful when the data of a certain table are implicitly split into "hot" and "cold".
+> The most recent data is regularly requested but requires only a small amount of space.
+> On the contrary, the fat-tailed historical data is requested rarely.
+
+<!-- vale gitlab.Simplicity = YES -->
+
+When used with remote storage backends such as
+[Amazon S3](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-s3),
+this makes a very efficient storage scheme. It allows for storage policies, which
+allows data to be on local disks for a period of time and then move it to object storage.
+
+An [example configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure) can look like this:
+
+```xml
+<storage_configuration>
+ <disks>
+ <fast_ssd>
+ <path>/mnt/fast_ssd/clickhouse/</path>
+ </fast_ssd>
+ <gcs>
+ <support_batch_delete>false</support_batch_delete>
+ <type>s3</type>
+ <endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint>
+ <access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id>
+ <secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key>
+ <metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path>
+ </gcs>
+ ...
+ </disks>
+ ...
+ <policies>
+
+ <move_from_local_disks_to_gcs> <!-- policy name -->
+ <volumes>
+ <hot> <!-- volume name -->
+ <disk>fast_ssd</disk> <!-- disk name -->
+ </hot>
+ <cold>
+ <disk>gcs</disk>
+ </cold>
+ </volumes>
+ <move_factor>0.2</move_factor>
+ <!-- The move factor determines when to move data from hot volume to cold.
+ See ClickHouse docs for more details. -->
+ </moving_from_ssd_to_hdd>
+ ....
+</storage_configuration>
+```
+
+In this storage policy, two volumes are defined `hot` and `cold`. After the `hot` volume is filled with occupancy of `disk_size * move_factor`, the data is being moved to Google Cloud Storage (GCS).
+
+If this storage policy is not the default, create tables by attaching the storage policies. For example:
+
+```sql
+CREATE TABLE key_value_table (
+ event_date Date,
+ key String,
+ value String,
+) ENGINE = MergeTree
+ORDER BY (key)
+PARTITION BY toYYYYMM(event_date)
+SETTINGS storage_policy = 'move_from_local_disks_to_gcs'
+```
+
+NOTE:
+In this storage policy, the move happens implicitly. It is also possible to keep
+_hot_ data on local disks for a fixed period of time and then move them as _cold_.
+
+This approach is possible with
+[Table TTLs](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl),
+which are also available with MergeTree table engine.
+
+The ClickHouse documentation shows this feature in detail, in the example of
+[implementing a hot - warm - cold architecture](https://clickhouse.com/docs/en/guides/developer/ttl/#implementing-a-hotwarmcold-architecture).
+
+You can take a similar approach for the example shown above. First, adjust the storage policy:
+
+```xml
+<storage_configuration>
+ ...
+ <policies>
+ <local_disk_and_gcs> <!-- policy name -->
+ <volumes>
+ <hot> <!-- volume name -->
+ <disk>fast_ssd</disk> <!-- disk name -->
+ </hot>
+ <cold>
+ <disk>gcs</disk>
+ </cold>
+ </volumes>
+ </local_disk_and_gcs>
+ ....
+</storage_configuration>
+```
+
+Then create the table as:
+
+```sql
+CREATE TABLE another_key_value_table (
+ event_date Date,
+ key String,
+ value String,
+) ENGINE = MergeTree
+ORDER BY (key)
+PARTITION BY toYYYYMM(event_date)
+TTL
+ event_date TO VOLUME 'hot',
+ event_date + INTERVAL 1 YEAR TO VOLUME 'cold'
+SETTINGS storage_policy = 'local_disk_and_gcs';
+```
+
+This creates the table so that data older than 1 year (evaluated against the
+`event_date` column) is moved to GCS. Such a storage policy can be helpful for append-only
+tables (like audit events) where only the most recent data is accessed frequently.
+You can drop the data altogether, which can be a regulatory requirement.
+
+We don't mention modifying TTLs in this guide, but that is possible as well.
+See ClickHouse documentation for
+[modifying TTL](https://clickhouse.com/docs/en/sql-reference/statements/alter/ttl/#modify-ttl)
+for details.