Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/zabbix/zabbix.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorYulia Chukina <yulia.chukina@zabbix.com>2021-04-08 12:42:34 +0300
committerYulia Chukina <yulia.chukina@zabbix.com>2021-04-08 12:42:34 +0300
commit8a68b233d94510f0738d5c8ab5353c45c00580e7 (patch)
treeb7738653a1c186c5f6a495d2275775d254e14801 /templates
parent483b4bfa329b84d5946747d2d52383a53ad57fbe (diff)
.........T [ZBXNEXT-6504] added Templates "TiDB by HTTP", "TiDB TiKV by HTTP" and "TiDB PD by HTTP"
Diffstat (limited to 'templates')
-rw-r--r--templates/db/tidb_http/tidb_pd_http/README.md108
-rw-r--r--templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml874
-rw-r--r--templates/db/tidb_http/tidb_tidb_http/README.md131
-rw-r--r--templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml1266
-rw-r--r--templates/db/tidb_http/tidb_tikv_http/README.md112
-rw-r--r--templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml1005
6 files changed, 3496 insertions, 0 deletions
diff --git a/templates/db/tidb_http/tidb_pd_http/README.md b/templates/db/tidb_http/tidb_pd_http/README.md
new file mode 100644
index 00000000000..33e0a0a245b
--- /dev/null
+++ b/templates/db/tidb_http/tidb_pd_http/README.md
@@ -0,0 +1,108 @@
+
+# TiDB PD by HTTP
+
+## Overview
+
+For Zabbix version: 5.4 and higher
+The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts.
+Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+
+Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
+See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
+
+
+This template was tested on:
+
+- TiDB cluster, version 4.0.10
+
+## Setup
+
+> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions.
+
+This template works with PD server of TiDB cluster.
+Internal service metrics are collected from PD /metrics endpoint and from monitoring API.
+See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
+Don't forget to change the macros {$PD.URL}, {$PD.PORT}.
+Also, see the Macros section for a list of macros used to set trigger values.
+
+
+## Zabbix configuration
+
+No specific Zabbix configuration is required.
+
+### Macros used
+
+|Name|Description|Default|
+|----|-----------|-------|
+|{$PD.MISS_REGION.MAX.WARN} |<p>Maximum number of missed regions</p> |`100` |
+|{$PD.PORT} |<p>The port of PD server metrics web endpoint</p> |`2379` |
+|{$PD.STORAGE_USAGE.MAX.WARN} |<p>Maximum percentage of cluster space used</p> |`80` |
+|{$PD.URL} |<p>PD server URL</p> |`localhost` |
+
+## Template links
+
+There are no template links in this template.
+
+## Discovery rules
+
+|Name|Description|Type|Key and additional info|
+|----|-----------|----|----|
+|Cluster metrics discovery |<p>Discovery cluster specific metrics.</p> |DEPENDENT |pd.cluster.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_cluster_status")]`</p><p>- JAVASCRIPT: `return JSON.stringify(value != "[]" ? [{'{#SINGLETON}': ''}] : []);`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Region labels discovery |<p>Discovery region labels specific metrics.</p> |DEPENDENT |pd.region_labels.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_label_level")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Region status discovery |<p>Discovery region status specific metrics.</p> |DEPENDENT |pd.region_status.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Too many missed regions trigger<br> - {#TYPE} MATCHES_REGEX `miss_peer_region_count`<br> - TRIGGER_PROTOTYPE LIKE `Too many missed regions` - DISCOVER</p><p>Unresponsive peers trigger<br> - {#TYPE} MATCHES_REGEX `down_peer_region_count`<br> - TRIGGER_PROTOTYPE LIKE `There are unresponsive peers` - DISCOVER</p> |
+|Running scheduler discovery |<p>Discovery scheduler specific metrics.</p> |DEPENDENT |pd.scheduler.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_status" && @.labels.type == "allow")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|gRPC commands discovery |<p>Discovery grpc commands specific metrics.</p> |DEPENDENT |pd.grpc_command.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Region discovery |<p>Discovery region specific metrics.</p> |DEPENDENT |pd.region.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+
+## Items collected
+
+|Group|Name|Description|Type|Key and additional info|
+|-----|----|-----------|----|---------------------|
+|PD instance |PD: Status |<p>Status of PD instance.</p> |DEPENDENT |pd.status<p>**Preprocessing**:</p><p>- JSONPATH: `$.status`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 1`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|PD instance |PD: GRPC Commands total, rate |<p>The rate at which gRPC commands are completed.</p> |DEPENDENT |pd.grpc_command.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|PD instance |PD: Version |<p>Version of the PD instance.</p> |DEPENDENT |pd.version<p>**Preprocessing**:</p><p>- JSONPATH: `$.version`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `3h`</p> |
+|PD instance |PD: Uptime |<p>The runtime of each PD instance.</p> |DEPENDENT |pd.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$.start_timestamp`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> |
+|PD instance |PD: GRPC Commands: {#GRPC_METHOD}, rate |<p>The rate per command type at which gRPC commands are completed.</p> |DEPENDENT |pd.grpc_command.rate[{#GRPC_METHOD}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count" && @.labels.grpc_method == "{#GRPC_METHOD}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |TiDB cluster: Offline stores |<p>-</p> |DEPENDENT |pd.cluster_status.store_offline[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_offline_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Tombstone stores |<p>The count of tombstone stores.</p> |DEPENDENT |pd.cluster_status.store_tombstone[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_tombstone_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Down stores |<p>The count of down stores.</p> |DEPENDENT |pd.cluster_status.store_down[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_down_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Lowspace stores |<p>The count of low space stores.</p> |DEPENDENT |pd.cluster_status.store_low_space[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_low_space_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Unhealth stores |<p>The count of unhealthy stores.</p> |DEPENDENT |pd.cluster_status.store_unhealth[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_unhealth_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Disconnect stores |<p>The count of disconnected stores.</p> |DEPENDENT |pd.cluster_status.store_disconnected[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_disconnected_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Normal stores |<p>The count of healthy storage instances.</p> |DEPENDENT |pd.cluster_status.store_up[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_up_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Storage capacity |<p>The total storage capacity for this TiDB cluster.</p> |DEPENDENT |pd.cluster_status.storage_capacity[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_capacity")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB cluster |TiDB cluster: Storage size |<p>The storage size that is currently used by the TiDB cluster.</p> |DEPENDENT |pd.cluster_status.storage_size[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_size")].value.first()`</p> |
+|TiDB cluster |TiDB cluster: Number of regions |<p>The total count of cluster Regions.</p> |DEPENDENT |pd.cluster_status.leader_count[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "leader_count")].value.first()`</p> |
+|TiDB cluster |TiDB cluster: Current peer count |<p>The current count of all cluster peers.</p> |DEPENDENT |pd.cluster_status.region_count[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "region_count")].value.first()`</p> |
+|TiDB cluster |TiDB cluster: Regions label: {#TYPE} |<p>The number of Regions in different label levels.</p> |DEPENDENT |pd.region_labels[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_label_level" && @.labels.type == "{#TYPE}")].value.first()`</p> |
+|TiDB cluster |TiDB cluster: Regions status: {#TYPE} |<p>The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.</p> |DEPENDENT |pd.region_status[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status" && @.labels.type == "{#TYPE}")].value.first()`</p> |
+|TiDB cluster |TiDB cluster: Scheduler status: {#KIND} |<p>The current running schedulers.</p> |DEPENDENT |pd.scheduler[{#KIND}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status" && @.labels.type == "allow" && @.labels.kind == "{#KIND}")].value.first()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p> |
+|TiDB cluster |PD: Region heartbeat: active, rate |<p>The count of heartbeats with the ok status per second.</p> |DEPENDENT |pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "ok" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |PD: Region heartbeat: error, rate |<p>The count of heartbeats with the error status per second.</p> |DEPENDENT |pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "err" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |PD: Region heartbeat: total, rate |<p>The count of heartbeats reported to PD per instance per second.</p> |DEPENDENT |pd.region_heartbeat.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |PD: Region schedule push: error, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}" && @.labels.status == "err" )].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |PD: Region schedule push: ok, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}" && @.labels.status == "ok" )].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiDB cluster |PD: Region schedule push: total, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|Zabbix_raw_items |PD: Get instance metrics |<p>Get TiDB PD instance metrics.</p> |HTTP_AGENT |pd.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON |
+|Zabbix_raw_items |PD: Get instance status |<p>Get TiDB PD instance status info.</p> |HTTP_AGENT |pd.get_status<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED |
+
+## Triggers
+
+|Name|Description|Expression|Severity|Dependencies and additional info|
+|----|-----------|----|----|----|
+|PD: Instance is not responding |<p>-</p> |`{TEMPLATE_NAME:pd.status.last()}=0` |AVERAGE | |
+|PD: Version has changed (new version: {ITEM.VALUE}) |<p>PD version has changed. Ack to close.</p> |`{TEMPLATE_NAME:pd.version.diff()}=1 and {TEMPLATE_NAME:pd.version.strlen()}>0` |INFO |<p>Manual close: YES</p> |
+|PD: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:pd.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> |
+|TiDB cluster: There are offline TiKV nodes |<p>PD has not received a TiKV heartbeat for a long time.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_down[{#SINGLETON}].last()}>0` |AVERAGE | |
+|TiDB cluster: There are low space TiKV nodes |<p>Indicates that there is no sufficient space on the TiKV node.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_low_space[{#SINGLETON}].last()}>0` |AVERAGE | |
+|TiDB cluster: There are disconnected TiKV nodes |<p>PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_disconnected[{#SINGLETON}].last()}>0` |WARNING | |
+|TiDB cluster: Current storage usage is too high (over {$PD.STORAGE_USAGE.MAX.WARN}% for 5m) |<p>Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.</p> |`{TEMPLATE_NAME:pd.cluster_status.storage_size[{#SINGLETON}].min(5m)}/{TiDB PD by HTTP:pd.cluster_status.storage_capacity[{#SINGLETON}].last()}*100>{$PD.STORAGE_USAGE.MAX.WARN}` |WARNING | |
+|TiDB cluster: Too many missed regions (over {$PD.MISS_REGION.MAX.WARN} in 5m) |<p>The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.</p> |`{TEMPLATE_NAME:pd.region_status[{#TYPE}].min(5m)}>{$PD.MISS_REGION.MAX.WARN}` |WARNING | |
+|TiDB cluster: There are unresponsive peers |<p>The number of Regions with an unresponsive peer reported by the Raft leader.</p> |`{TEMPLATE_NAME:pd.region_status[{#TYPE}].min(5m)}>0` |WARNING | |
+
+## Feedback
+
+Please report any issues with the template at https://support.zabbix.com
+
+You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback).
+
diff --git a/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml b/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml
new file mode 100644
index 00000000000..e53fccf2695
--- /dev/null
+++ b/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml
@@ -0,0 +1,874 @@
+zabbix_export:
+ version: '5.4'
+ date: '2021-04-08T09:02:39Z'
+ groups:
+ -
+ name: Templates/Databases
+ templates:
+ -
+ template: 'TiDB PD by HTTP'
+ name: 'TiDB PD by HTTP'
+ description: |
+ The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts.
+ Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+ Don't forget to change the macros {$PD.URL}, {$PD.PORT}.
+
+ Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
+
+ You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback
+
+ Template tooling version used: 0.38
+ groups:
+ -
+ name: Templates/Databases
+ applications:
+ -
+ name: 'PD instance'
+ -
+ name: 'TiDB cluster'
+ -
+ name: 'Zabbix raw items'
+ items:
+ -
+ name: 'PD: Get instance metrics'
+ type: HTTP_AGENT
+ key: pd.get_metrics
+ history: '0'
+ trends: '0'
+ value_type: TEXT
+ description: 'Get TiDB PD instance metrics.'
+ applications:
+ -
+ name: 'Zabbix raw items'
+ preprocessing:
+ -
+ type: CHECK_NOT_SUPPORTED
+ parameters:
+ - ''
+ -
+ type: PROMETHEUS_TO_JSON
+ parameters:
+ - ''
+ url: '{$PD.URL}:{$PD.PORT}/metrics'
+ -
+ name: 'PD: Get instance status'
+ type: HTTP_AGENT
+ key: pd.get_status
+ history: '0'
+ trends: '0'
+ value_type: TEXT
+ description: 'Get TiDB PD instance status info.'
+ applications:
+ -
+ name: 'Zabbix raw items'
+ preprocessing:
+ -
+ type: CHECK_NOT_SUPPORTED
+ parameters:
+ - ''
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '{"status": "0"}'
+ url: '{$PD.URL}:{$PD.PORT}/pd/api/v1/status'
+ -
+ name: 'PD: GRPC Commands total, rate'
+ type: DEPENDENT
+ key: pd.grpc_command.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The rate at which gRPC commands are completed.'
+ applications:
+ -
+ name: 'PD instance'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "grpc_server_handling_seconds_count")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'PD: Status'
+ type: DEPENDENT
+ key: pd.status
+ delay: '0'
+ history: 7d
+ trends: '0'
+ value_type: CHAR
+ description: 'Status of PD instance.'
+ applications:
+ -
+ name: 'PD instance'
+ valuemap:
+ name: 'Service state'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - $.status
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '1'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_status
+ triggers:
+ -
+ expression: '{last()}=0'
+ name: 'PD: Instance is not responding'
+ priority: AVERAGE
+ -
+ name: 'PD: Uptime'
+ type: DEPENDENT
+ key: pd.uptime
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: uptime
+ description: 'The runtime of each PD instance.'
+ applications:
+ -
+ name: 'PD instance'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - $.start_timestamp
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ //use boottime to calculate uptime
+ return (Math.floor(Date.now()/1000)-Number(value));
+ master_item:
+ key: pd.get_status
+ triggers:
+ -
+ expression: '{last()}<10m'
+ name: 'PD: has been restarted (uptime < 10m)'
+ priority: INFO
+ description: 'Uptime is less than 10 minutes'
+ manual_close: 'YES'
+ -
+ name: 'PD: Version'
+ type: DEPENDENT
+ key: pd.version
+ delay: '0'
+ history: 7d
+ trends: '0'
+ value_type: CHAR
+ description: 'Version of the PD instance.'
+ applications:
+ -
+ name: 'PD instance'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - $.version
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 3h
+ master_item:
+ key: pd.get_status
+ triggers:
+ -
+ expression: '{diff()}=1 and {strlen()}>0'
+ name: 'PD: Version has changed (new version: {ITEM.VALUE})'
+ priority: INFO
+ description: 'PD version has changed. Ack to close.'
+ manual_close: 'YES'
+ discovery_rules:
+ -
+ name: 'Cluster metrics discovery'
+ type: DEPENDENT
+ key: pd.cluster.discovery
+ delay: '0'
+ description: 'Discovery cluster specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB cluster: Number of regions'
+ type: DEPENDENT
+ key: 'pd.cluster_status.leader_count[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The total count of cluster Regions.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "leader_count")].value.first()'
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Current peer count'
+ type: DEPENDENT
+ key: 'pd.cluster_status.region_count[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The current count of all cluster peers.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "region_count")].value.first()'
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Storage capacity'
+ type: DEPENDENT
+ key: 'pd.cluster_status.storage_capacity[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'The total storage capacity for this TiDB cluster.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_capacity")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Storage size'
+ type: DEPENDENT
+ key: 'pd.cluster_status.storage_size[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'The storage size that is currently used by the TiDB cluster.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_size")].value.first()'
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Disconnect stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_disconnected[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of disconnected stores.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_disconnected_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{last()}>0'
+ name: 'TiDB cluster: There are disconnected TiKV nodes'
+ priority: WARNING
+ description: 'PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.'
+ -
+ name: 'TiDB cluster: Down stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_down[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of down stores.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_down_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{last()}>0'
+ name: 'TiDB cluster: There are offline TiKV nodes'
+ priority: AVERAGE
+ description: 'PD has not received a TiKV heartbeat for a long time.'
+ -
+ name: 'TiDB cluster: Lowspace stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_low_space[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of low space stores.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_low_space_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{last()}>0'
+ name: 'TiDB cluster: There are low space TiKV nodes'
+ priority: AVERAGE
+ description: 'Indicates that there is no sufficient space on the TiKV node.'
+ -
+ name: 'TiDB cluster: Offline stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_offline[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_offline_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Tombstone stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_tombstone[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of tombstone stores.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_tombstone_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Unhealth stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_unhealth[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of unhealthy stores.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_unhealth_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'TiDB cluster: Normal stores'
+ type: DEPENDENT
+ key: 'pd.cluster_status.store_up[{#SINGLETON}]'
+ delay: '0'
+ history: 7d
+ description: 'The count of healthy storage instances.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_up_count")].value.first()'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: pd.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{TiDB PD by HTTP:pd.cluster_status.storage_size[{#SINGLETON}].min(5m)}/{TiDB PD by HTTP:pd.cluster_status.storage_capacity[{#SINGLETON}].last()}*100>{$PD.STORAGE_USAGE.MAX.WARN}'
+ name: 'TiDB cluster: Current storage usage is too high (over {$PD.STORAGE_USAGE.MAX.WARN}% for 5m)'
+ priority: WARNING
+ description: 'Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.'
+ graph_prototypes:
+ -
+ name: 'TiDB cluster: Storage Usage[{#SINGLETON}]'
+ graph_items:
+ -
+ color: 1A7C11
+ item:
+ host: 'TiDB PD by HTTP'
+ key: 'pd.cluster_status.storage_size[{#SINGLETON}]'
+ -
+ sortorder: '1'
+ color: 2774A4
+ item:
+ host: 'TiDB PD by HTTP'
+ key: 'pd.cluster_status.storage_capacity[{#SINGLETON}]'
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="pd_cluster_status")]'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '[]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - 'return JSON.stringify(value != "[]" ? [{''{#SINGLETON}'': ''''}] : []);'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'gRPC commands discovery'
+ type: DEPENDENT
+ key: pd.grpc_command.discovery
+ delay: '0'
+ description: 'Discovery grpc commands specific metrics.'
+ item_prototypes:
+ -
+ name: 'PD: GRPC Commands: {#GRPC_METHOD}, rate'
+ type: DEPENDENT
+ key: 'pd.grpc_command.rate[{#GRPC_METHOD}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The rate per command type at which gRPC commands are completed.'
+ applications:
+ -
+ name: 'PD instance'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "grpc_server_handling_seconds_count" && @.labels.grpc_method == "{#GRPC_METHOD}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "grpc_server_handling_seconds_count")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ var lookup = {},
+ result = [];
+
+ JSON.parse(value).forEach(function (item) {
+ var grpc_method = item.labels.grpc_method;
+ if (!(lookup[grpc_method])) {
+ lookup[grpc_method] = 1;
+ result.push({ "{#GRPC_METHOD}": grpc_method });
+ }
+ })
+
+ return JSON.stringify(result);
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Region discovery'
+ type: DEPENDENT
+ key: pd.region.discovery
+ delay: '0'
+ description: 'Discovery region specific metrics.'
+ item_prototypes:
+ -
+ name: 'PD: Region heartbeat: error, rate'
+ type: DEPENDENT
+ key: 'pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The count of heartbeats with the error status per second.'
+ application_prototypes:
+ -
+ name: 'TiDB Store [{#STORE_ADDRESS}]'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "err" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'PD: Region heartbeat: active, rate'
+ type: DEPENDENT
+ key: 'pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The count of heartbeats with the ok status per second.'
+ application_prototypes:
+ -
+ name: 'TiDB Store [{#STORE_ADDRESS}]'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "ok" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'PD: Region schedule push: total, rate'
+ type: DEPENDENT
+ key: 'pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ application_prototypes:
+ -
+ name: 'TiDB Store [{#STORE_ADDRESS}]'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ -
+ name: 'PD: Region heartbeat: total, rate'
+ type: DEPENDENT
+ key: 'pd.region_heartbeat.rate[{#STORE_ADDRESS}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The count of heartbeats reported to PD per instance per second.'
+ application_prototypes:
+ -
+ name: 'TiDB Store [{#STORE_ADDRESS}]'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: pd.get_metrics
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_region_heartbeat")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ var lookup = {},
+ result = [];
+
+ JSON.parse(value).forEach(function (item) {
+ var address = item.labels.address;
+ if (!(lookup[address])) {
+ lookup[address] = 1;
+ result.push({ "{#STORE_ADDRESS}": address });
+ }
+ })
+
+ return JSON.stringify(result);
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Region labels discovery'
+ type: DEPENDENT
+ key: pd.region_labels.discovery
+ delay: '0'
+ description: 'Discovery region labels specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB cluster: Regions label: {#TYPE}'
+ type: DEPENDENT
+ key: 'pd.region_labels[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of Regions in different label levels.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_regions_label_level" && @.labels.type == "{#TYPE}")].value.first()'
+ master_item:
+ key: pd.get_metrics
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_regions_label_level")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Region status discovery'
+ type: DEPENDENT
+ key: pd.region_status.discovery
+ delay: '0'
+ description: 'Discovery region status specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB cluster: Regions status: {#TYPE}'
+ type: DEPENDENT
+ key: 'pd.region_status[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_regions_status" && @.labels.type == "{#TYPE}")].value.first()'
+ master_item:
+ key: pd.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{min(5m)}>0'
+ name: 'TiDB cluster: There are unresponsive peers'
+ discover: NO_DISCOVER
+ priority: WARNING
+ description: 'The number of Regions with an unresponsive peer reported by the Raft leader.'
+ -
+ expression: '{min(5m)}>{$PD.MISS_REGION.MAX.WARN}'
+ name: 'TiDB cluster: Too many missed regions (over {$PD.MISS_REGION.MAX.WARN} in 5m)'
+ discover: NO_DISCOVER
+ priority: WARNING
+ description: 'The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.'
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_regions_status")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ overrides:
+ -
+ name: 'Too many missed regions trigger'
+ step: '1'
+ filter:
+ conditions:
+ -
+ macro: '{#TYPE}'
+ value: miss_peer_region_count
+ formulaid: A
+ operations:
+ -
+ operationobject: TRIGGER_PROTOTYPE
+ operator: LIKE
+ value: 'Too many missed regions'
+ status: ENABLED
+ discover: DISCOVER
+ -
+ name: 'Unresponsive peers trigger'
+ step: '2'
+ filter:
+ conditions:
+ -
+ macro: '{#TYPE}'
+ value: down_peer_region_count
+ formulaid: A
+ operations:
+ -
+ operationobject: TRIGGER_PROTOTYPE
+ operator: LIKE
+ value: 'There are unresponsive peers'
+ status: ENABLED
+ discover: DISCOVER
+ -
+ name: 'Running scheduler discovery'
+ type: DEPENDENT
+ key: pd.scheduler.discovery
+ delay: '0'
+ description: 'Discovery scheduler specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB cluster: Scheduler status: {#KIND}'
+ type: DEPENDENT
+ key: 'pd.scheduler[{#KIND}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The current running schedulers.'
+ applications:
+ -
+ name: 'TiDB cluster'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_regions_status" && @.labels.type == "allow" && @.labels.kind == "{#KIND}")].value.first()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ master_item:
+ key: pd.get_metrics
+ master_item:
+ key: pd.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "pd_scheduler_status" && @.labels.type == "allow")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#KIND}": item.labels.kind,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ macros:
+ -
+ macro: '{$PD.MISS_REGION.MAX.WARN}'
+ value: '100'
+ description: 'Maximum number of missed regions'
+ -
+ macro: '{$PD.PORT}'
+ value: '2379'
+ description: 'The port of PD server metrics web endpoint'
+ -
+ macro: '{$PD.STORAGE_USAGE.MAX.WARN}'
+ value: '80'
+ description: 'Maximum percentage of cluster space used'
+ -
+ macro: '{$PD.URL}'
+ value: localhost
+ description: 'PD server URL'
+ valuemaps:
+ -
+ name: 'Service state'
+ mappings:
+ -
+ value: '0'
+ newvalue: Down
+ -
+ value: '1'
+ newvalue: Up
diff --git a/templates/db/tidb_http/tidb_tidb_http/README.md b/templates/db/tidb_http/tidb_tidb_http/README.md
new file mode 100644
index 00000000000..f02ed4f39b2
--- /dev/null
+++ b/templates/db/tidb_http/tidb_tidb_http/README.md
@@ -0,0 +1,131 @@
+
+# TiDB by HTTP
+
+## Overview
+
+For Zabbix version: 5.4 and higher
+The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts.
+Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+
+Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
+See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
+
+
+This template was tested on:
+
+- TiDB cluster, version 4.0.10
+
+## Setup
+
+> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions.
+
+This template works with TiDB server of TiDB cluster.
+Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API.
+See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api.
+Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}.
+Also, see the Macros section for a list of macros used to set trigger values.
+
+
+## Zabbix configuration
+
+No specific Zabbix configuration is required.
+
+### Macros used
+
+|Name|Description|Default|
+|----|-----------|-------|
+|{$TIDB.DDL.WAITING.MAX.WARN} |<p>Maximum number of DDL tasks that are waiting</p> |`5` |
+|{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} |<p>Maximum number of GC-related operations failures</p> |`1` |
+|{$TIDB.HEAP.USAGE.MAX.WARN} |<p>Maximum heap memory used</p> |`10G` |
+|{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} |<p>Minimum number of keep alive operations</p> |`10` |
+|{$TIDB.OPEN.FDS.MAX.WARN} |<p>Maximum percentage of used file descriptors</p> |`90` |
+|{$TIDB.PORT} |<p>The port of TiDB server metrics web endpoint</p> |`10080` |
+|{$TIDB.REGION_ERROR.MAX.WARN} |<p>Maximum number of region related errors</p> |`50` |
+|{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} |<p>Maximum number of schema lease errors</p> |`0` |
+|{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} |<p>Maximum number of load schema errors</p> |`1` |
+|{$TIDB.TIME_JUMP_BACK.MAX.WARN} |<p>Maximum number of times that the operating system rewinds every second</p> |`1` |
+|{$TIDB.URL} |<p>TiDB server URL</p> |`localhost` |
+
+## Template links
+
+There are no template links in this template.
+
+## Discovery rules
+
+|Name|Description|Type|Key and additional info|
+|----|-----------|----|----|
+|QPS metrics discovery |<p>Discovery QPS specific metrics.</p> |DEPENDENT |tidb.qps.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_query_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Statement metrics discovery |<p>Discovery statement specific metrics.</p> |DEPENDENT |tidb.statement.discover<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|KV metrics discovery |<p>Discovery KV specific metrics.</p> |DEPENDENT |tidb.kv_ops.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Lock resolves discovery |<p>Discovery lock resolves specific metrics.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|KV backoff discovery |<p>Discovery KV backoff specific metrics.</p> |DEPENDENT |tidb.tikvclient_backoff.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|GC action results discovery |<p>Discovery GC action results metrics.</p> |DEPENDENT |tidb.tikvclient_gc_action.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_gc_action_result")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Failed GC-related operations trigger<br> - {#TYPE} MATCHES_REGEX `failed`<br> - TRIGGER_PROTOTYPE LIKE `Too many failed GC-related operations` - DISCOVER</p> |
+
+## Items collected
+
+|Group|Name|Description|Type|Key and additional info|
+|-----|----|-----------|----|---------------------|
+|TiDB node |TiDB: Status |<p>Status of PD instance.</p> |DEPENDENT |tidb.status<p>**Preprocessing**:</p><p>- JSONPATH: `$.status`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 1`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|TiDB node |TiDB: Total "error" server query, rate |<p>The number of queries on TiDB instance per second with failure of command execution results.</p> |DEPENDENT |tidb.server_query.error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Total "ok" server query, rate |<p>The number of queries on TiDB instance per second with success of command execution results.</p> |DEPENDENT |tidb.server_query.ok.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Total server query, rate |<p>The number of queries per second on TiDB instance.</p> |DEPENDENT |tidb.server_query.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: SQL statements, rate |<p>The total number of SQL statements executed per second.</p> |DEPENDENT |tidb.statement_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Failed Query, rate |<p>The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).</p> |DEPENDENT |tidb.execute_error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_execute_error_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: KV commands, rate |<p>The number of executed KV commands per second.</p> |DEPENDENT |tidb.tikvclient_txn.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: PD TSO commands, rate |<p>The number of TSO commands that TiDB obtains from PD per second.</p> |DEPENDENT |tidb.pd_tso_cmd.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_client_cmd_handle_cmds_duration_seconds_count" && @.labels.type == "tso")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: PD TSO requests, rate |<p>The number of TSO requests that TiDB obtains from PD per second.</p> |DEPENDENT |tidb.pd_tso_request.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_client_request_handle_requests_duration_seconds_count" && @.labels.type == "tso")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: TiClient region errors, rate |<p>The number of region related errors returned by TiKV per second.</p> |DEPENDENT |tidb.tikvclient_region_err.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Lock resolves, rate |<p>The number of DDL tasks that are waiting.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: DDL waiting jobs |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.ddl_waiting_jobs<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()`</p> |
+|TiDB node |TiDB: Load schema total, rate |<p>The statistics of the schemas that TiDB obtains from TiKV per second.</p> |DEPENDENT |tidb.domain_load_schema.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_domain_load_schema_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Load schema failed, rate |<p>The total number of failures to reload the latest schema information in TiDB per second.</p> |DEPENDENT |tidb.domain_load_schema.failed.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_domain_load_schema_total && @.labels.type == "failed"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Schema lease "outdate" errors , rate |<p>The number of schema lease errors per second. </p><p>"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.</p> |DEPENDENT |tidb.session_schema_lease_error.outdate.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "outdate"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Schema lease "change" errors, rate |<p>The number of schema lease errors per second. </p><p>"change" means that the schema has changed</p> |DEPENDENT |tidb.session_schema_lease_error.change.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "change"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: KV backoff, rate |<p>The number of errors returned by TiKV.</p> |DEPENDENT |tidb.tikvclient_backoff.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Keep alive, rate |<p>The number of times that the metrics are refreshed on TiDB instance per minute.</p> |DEPENDENT |tidb.monitor_keep_alive.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_monitor_keep_alive_total")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- SIMPLE_CHANGE |
+|TiDB node |TiDB: Server connections |<p>The connection number of current TiDB instance.</p> |DEPENDENT |tidb.tidb_server_connections<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_connections")].value.first()`</p> |
+|TiDB node |TiDB: Heap memory usage |<p>Number of heap bytes that are in use.</p> |DEPENDENT |tidb.heap_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()`</p> |
+|TiDB node |TiDB: RSS memory usage |<p>Resident memory size in bytes.</p> |DEPENDENT |tidb.rss_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_resident_memory_bytes")].value.first()`</p> |
+|TiDB node |TiDB: Goroutine count |<p>The number of Goroutines on TiDB instance.</p> |DEPENDENT |tidb.goroutines<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="go_goroutines")].value.first()`</p> |
+|TiDB node |TiDB: Open file descriptors |<p>Number of open file descriptors.</p> |DEPENDENT |tidb.process_open_fds<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_open_fds")].value.first()`</p> |
+|TiDB node |TiDB: Open file descriptors, max |<p>Maximum number of open file descriptors.</p> |DEPENDENT |tidb.process_max_fds<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_max_fds")].value.first()`</p> |
+|TiDB node |TiDB: CPU |<p>Total user and system CPU usage ratio.</p> |DEPENDENT |tidb.cpu.util<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_cpu_seconds_total")].value.first()`</p><p>- CHANGE_PER_SECOND<p>- MULTIPLIER: `100`</p> |
+|TiDB node |TiDB: Uptime |<p>The runtime of each TiDB instance.</p> |DEPENDENT |tidb.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_start_time_seconds")].value.first()`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> |
+|TiDB node |TiDB: Version |<p>Version of the TiDB instance.</p> |DEPENDENT |tidb.version<p>**Preprocessing**:</p><p>- JSONPATH: `$.version`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `3h`</p> |
+|TiDB node |TiDB: Time jump back, rate |<p>The number of times that the operating system rewinds every second.</p> |DEPENDENT |tidb.monitor_time_jump_back.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_monitor_time_jump_back_total")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Server critical error, rate |<p>The number of critical errors occurred in TiDB per second.</p> |DEPENDENT |tidb.tidb_server_critical_error_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_critical_error_total")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Server panic, rate |<p>The number of panics occurred in TiDB per second.</p> |DEPENDENT |tidb.tidb_server_panic_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_panic_total")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Server query "OK": {#TYPE}, rate |<p>The number of queries on TiDB instance per second with success of command execution results.</p> |DEPENDENT |tidb.server_query.ok.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Server query "Error": {#TYPE}, rate |<p>The number of queries on TiDB instance per second with failure of command execution results.</p> |DEPENDENT |tidb.server_query.error.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: SQL statements: {#TYPE}, rate |<p>The number of SQL statements executed per second.</p> |DEPENDENT |tidb.statement.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: KV Commands: {#TYPE}, rate |<p>The number of executed KV commands per second.</p> |DEPENDENT |tidb.tikvclient_txn.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: Lock resolves: {#TYPE}, rate |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: KV backoff: {#TYPE}, rate |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.tikvclient_backoff.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiDB node |TiDB: GC action result: {#TYPE}, rate |<p>The number of results of GC-related operations per second.</p> |DEPENDENT |tidb.tikvclient_gc_action.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_gc_action_result" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|Zabbix_raw_items |TiDB: Get instance metrics |<p>Get TiDB instance metrics.</p> |HTTP_AGENT |tidb.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON |
+|Zabbix_raw_items |TiDB: Get instance status |<p>Get TiDB instance status info.</p> |HTTP_AGENT |tidb.get_status<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED |
+
+## Triggers
+
+|Name|Description|Expression|Severity|Dependencies and additional info|
+|----|-----------|----|----|----|
+|TiDB: Instance is not responding |<p>-</p> |`{TEMPLATE_NAME:tidb.status.last()}=0` |AVERAGE | |
+|TiDB: Too many region related errors (over {$TIDB.REGION_ERROR.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.tikvclient_region_err.rate.min(5m)}>{$TIDB.REGION_ERROR.MAX.WARN}` |AVERAGE | |
+|TiDB: Too many DDL waiting jobs (over {$TIDB.DDL.WAITING.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.ddl_waiting_jobs.min(5m)}>{$TIDB.DDL.WAITING.MAX.WARN}` |WARNING | |
+|TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.domain_load_schema.failed.rate.min(5m)}>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}` |AVERAGE | |
+|TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} for 5m) |<p>The latest schema information is not reloaded in TiDB within one lease.</p> |`{TEMPLATE_NAME:tidb.session_schema_lease_error.outdate.rate.min(5m)}>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}` |AVERAGE | |
+|TiDB: Too few keep alive operations (less {$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} for 5m) |<p>Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.</p> |`{TEMPLATE_NAME:tidb.monitor_keep_alive.rate.max(5m)}<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}` |AVERAGE | |
+|TiDB: Heap memory usage is too high (over {$TIDB.HEAP.USAGE.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.heap_bytes.min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}` |WARNING | |
+|TiDB: Current number of open files is too high (over {$TIDB.OPEN.FDS.MAX.WARN}% for 5m) |<p>"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue."</p> |`{TEMPLATE_NAME:tidb.process_open_fds.min(5m)}/{TiDB by HTTP:tidb.process_max_fds.last()}*100>{$TIDB.OPEN.FDS.MAX.WARN}` |WARNING | |
+|TiDB: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:tidb.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> |
+|TiDB: Version has changed (new version: {ITEM.VALUE}) |<p>TiDB version has changed. Ack to close.</p> |`{TEMPLATE_NAME:tidb.version.diff()}=1 and {TEMPLATE_NAME:tidb.version.strlen()}>0` |INFO |<p>Manual close: YES</p> |
+|TiDB: Too many time jump backs (over {$TIDB.TIME_JUMP_BACK.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.monitor_time_jump_back.rate.min(5m)}>{$TIDB.TIME_JUMP_BACK.MAX.WARN}` |WARNING | |
+|TiDB: There are panicked TiDB threads |<p>When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.</p> |`{TEMPLATE_NAME:tidb.tidb_server_panic_total.rate.last()}>0` |AVERAGE | |
+|TiDB: Too many failed GC-related operations (over {$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} in 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.tikvclient_gc_action.rate[{#TYPE}].min(5m)}>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}` |WARNING | |
+
+## Feedback
+
+Please report any issues with the template at https://support.zabbix.com
+
+You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback).
+
diff --git a/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml b/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml
new file mode 100644
index 00000000000..32fe1a7cc51
--- /dev/null
+++ b/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml
@@ -0,0 +1,1266 @@
+zabbix_export:
+ version: '5.4'
+ date: '2021-04-08T09:02:36Z'
+ groups:
+ -
+ name: Templates/Databases
+ templates:
+ -
+ template: 'TiDB by HTTP'
+ name: 'TiDB by HTTP'
+ description: |
+ The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts.
+ Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+ Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}.
+
+ Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API.
+
+ You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback
+
+ Template tooling version used: 0.38
+ groups:
+ -
+ name: Templates/Databases
+ applications:
+ -
+ name: 'TiDB node'
+ -
+ name: 'Zabbix raw items'
+ items:
+ -
+ name: 'TiDB: CPU'
+ type: DEPENDENT
+ key: tidb.cpu.util
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: '%'
+ description: 'Total user and system CPU usage ratio.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_cpu_seconds_total")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ -
+ type: MULTIPLIER
+ parameters:
+ - '100'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: DDL waiting jobs'
+ type: DEPENDENT
+ key: tidb.ddl_waiting_jobs
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()'
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.DDL.WAITING.MAX.WARN}'
+ name: 'TiDB: Too many DDL waiting jobs (over {$TIDB.DDL.WAITING.MAX.WARN} for 5m)'
+ priority: WARNING
+ -
+ name: 'TiDB: Load schema failed, rate'
+ type: DEPENDENT
+ key: tidb.domain_load_schema.failed.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The total number of failures to reload the latest schema information in TiDB per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_domain_load_schema_total && @.labels.type == "failed"")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}'
+ name: 'TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ -
+ name: 'TiDB: Load schema total, rate'
+ type: DEPENDENT
+ key: tidb.domain_load_schema.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The statistics of the schemas that TiDB obtains from TiKV per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_domain_load_schema_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Failed Query, rate'
+ type: DEPENDENT
+ key: tidb.execute_error.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_server_execute_error_total")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Get instance metrics'
+ type: HTTP_AGENT
+ key: tidb.get_metrics
+ history: '0'
+ trends: '0'
+ value_type: TEXT
+ description: 'Get TiDB instance metrics.'
+ applications:
+ -
+ name: 'Zabbix raw items'
+ preprocessing:
+ -
+ type: CHECK_NOT_SUPPORTED
+ parameters:
+ - ''
+ -
+ type: PROMETHEUS_TO_JSON
+ parameters:
+ - ''
+ url: '{$TIDB.URL}:{$TIDB.PORT}/metrics'
+ -
+ name: 'TiDB: Get instance status'
+ type: HTTP_AGENT
+ key: tidb.get_status
+ history: '0'
+ trends: '0'
+ value_type: TEXT
+ description: 'Get TiDB instance status info.'
+ applications:
+ -
+ name: 'Zabbix raw items'
+ preprocessing:
+ -
+ type: CHECK_NOT_SUPPORTED
+ parameters:
+ - ''
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '{"status": "0"}'
+ url: '{$TIDB.URL}:{$TIDB.PORT}/status'
+ -
+ name: 'TiDB: Goroutine count'
+ type: DEPENDENT
+ key: tidb.goroutines
+ delay: '0'
+ history: 7d
+ description: 'The number of Goroutines on TiDB instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="go_goroutines")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Heap memory usage'
+ type: DEPENDENT
+ key: tidb.heap_bytes
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'Number of heap bytes that are in use.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}'
+ name: 'TiDB: Heap memory usage is too high (over {$TIDB.HEAP.USAGE.MAX.WARN} for 5m)'
+ priority: WARNING
+ -
+ name: 'TiDB: Keep alive, rate'
+ type: DEPENDENT
+ key: tidb.monitor_keep_alive.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of times that the metrics are refreshed on TiDB instance per minute.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_monitor_keep_alive_total")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: SIMPLE_CHANGE
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{max(5m)}<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}'
+ name: 'TiDB: Too few keep alive operations (less {$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ description: 'Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.'
+ -
+ name: 'TiDB: Time jump back, rate'
+ type: DEPENDENT
+ key: tidb.monitor_time_jump_back.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of times that the operating system rewinds every second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_monitor_time_jump_back_total")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.TIME_JUMP_BACK.MAX.WARN}'
+ name: 'TiDB: Too many time jump backs (over {$TIDB.TIME_JUMP_BACK.MAX.WARN} for 5m)'
+ priority: WARNING
+ -
+ name: 'TiDB: PD TSO commands, rate'
+ type: DEPENDENT
+ key: tidb.pd_tso_cmd.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of TSO commands that TiDB obtains from PD per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="pd_client_cmd_handle_cmds_duration_seconds_count" && @.labels.type == "tso")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: PD TSO requests, rate'
+ type: DEPENDENT
+ key: tidb.pd_tso_request.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of TSO requests that TiDB obtains from PD per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="pd_client_request_handle_requests_duration_seconds_count" && @.labels.type == "tso")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Open file descriptors, max'
+ type: DEPENDENT
+ key: tidb.process_max_fds
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Maximum number of open file descriptors.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_max_fds")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Open file descriptors'
+ type: DEPENDENT
+ key: tidb.process_open_fds
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Number of open file descriptors.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_open_fds")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: RSS memory usage'
+ type: DEPENDENT
+ key: tidb.rss_bytes
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'Resident memory size in bytes.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_resident_memory_bytes")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Total "error" server query, rate'
+ type: DEPENDENT
+ key: tidb.server_query.error.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Qps
+ description: 'The number of queries on TiDB instance per second with failure of command execution results.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Total "ok" server query, rate'
+ type: DEPENDENT
+ key: tidb.server_query.ok.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Qps
+ description: 'The number of queries on TiDB instance per second with success of command execution results.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Total server query, rate'
+ type: DEPENDENT
+ key: tidb.server_query.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Qps
+ description: 'The number of queries per second on TiDB instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tidb_server_query_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Schema lease "change" errors, rate'
+ type: DEPENDENT
+ key: tidb.session_schema_lease_error.change.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: |
+ The number of schema lease errors per second.
+ "change" means that the schema has changed
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "change"")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Schema lease "outdate" errors , rate'
+ type: DEPENDENT
+ key: tidb.session_schema_lease_error.outdate.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: |
+ The number of schema lease errors per second.
+ "outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "outdate"")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}'
+ name: 'TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ description: 'The latest schema information is not reloaded in TiDB within one lease.'
+ -
+ name: 'TiDB: SQL statements, rate'
+ type: DEPENDENT
+ key: tidb.statement_total.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The total number of SQL statements executed per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_executor_statement_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Status'
+ type: DEPENDENT
+ key: tidb.status
+ delay: '0'
+ history: 7d
+ trends: '0'
+ value_type: CHAR
+ description: 'Status of PD instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ valuemap:
+ name: 'Service state'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - $.status
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '1'
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ master_item:
+ key: tidb.get_status
+ triggers:
+ -
+ expression: '{last()}=0'
+ name: 'TiDB: Instance is not responding'
+ priority: AVERAGE
+ -
+ name: 'TiDB: Server connections'
+ type: DEPENDENT
+ key: tidb.tidb_server_connections
+ delay: '0'
+ history: 7d
+ description: 'The connection number of current TiDB instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_server_connections")].value.first()'
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Server critical error, rate'
+ type: DEPENDENT
+ key: tidb.tidb_server_critical_error_total.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of critical errors occurred in TiDB per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_server_critical_error_total")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Server panic, rate'
+ type: DEPENDENT
+ key: tidb.tidb_server_panic_total.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of panics occurred in TiDB per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_server_panic_total")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{last()}>0'
+ name: 'TiDB: There are panicked TiDB threads'
+ priority: AVERAGE
+ description: 'When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.'
+ -
+ name: 'TiDB: KV backoff, rate'
+ type: DEPENDENT
+ key: tidb.tikvclient_backoff.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of errors returned by TiKV.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Lock resolves, rate'
+ type: DEPENDENT
+ key: tidb.tikvclient_lock_resolver_action.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of DDL tasks that are waiting.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: TiClient region errors, rate'
+ type: DEPENDENT
+ key: tidb.tikvclient_region_err.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of region related errors returned by TiKV per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIDB.REGION_ERROR.MAX.WARN}'
+ name: 'TiDB: Too many region related errors (over {$TIDB.REGION_ERROR.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ -
+ name: 'TiDB: KV commands, rate'
+ type: DEPENDENT
+ key: tidb.tikvclient_txn.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of executed KV commands per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Uptime'
+ type: DEPENDENT
+ key: tidb.uptime
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: uptime
+ description: 'The runtime of each TiDB instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_start_time_seconds")].value.first()'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ //use boottime to calculate uptime
+ return (Math.floor(Date.now()/1000)-Number(value));
+ master_item:
+ key: tidb.get_metrics
+ triggers:
+ -
+ expression: '{last()}<10m'
+ name: 'TiDB: has been restarted (uptime < 10m)'
+ priority: INFO
+ description: 'Uptime is less than 10 minutes'
+ manual_close: 'YES'
+ -
+ name: 'TiDB: Version'
+ type: DEPENDENT
+ key: tidb.version
+ delay: '0'
+ history: 7d
+ trends: '0'
+ value_type: CHAR
+ description: 'Version of the TiDB instance.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - $.version
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 3h
+ master_item:
+ key: tidb.get_status
+ triggers:
+ -
+ expression: '{diff()}=1 and {strlen()}>0'
+ name: 'TiDB: Version has changed (new version: {ITEM.VALUE})'
+ priority: INFO
+ description: 'TiDB version has changed. Ack to close.'
+ manual_close: 'YES'
+ discovery_rules:
+ -
+ name: 'KV metrics discovery'
+ type: DEPENDENT
+ key: tidb.kv_ops.discovery
+ delay: '0'
+ description: 'Discovery KV specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: KV Commands: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.tikvclient_txn.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of executed KV commands per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'QPS metrics discovery'
+ type: DEPENDENT
+ key: tidb.qps.discovery
+ delay: '0'
+ description: 'Discovery QPS specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: Server query "Error": {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.server_query.error.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Qps
+ description: 'The number of queries on TiDB instance per second with failure of command execution results.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ -
+ name: 'TiDB: Server query "OK": {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.server_query.ok.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Qps
+ description: 'The number of queries on TiDB instance per second with success of command execution results.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_server_query_total")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ var lookup = {},
+ result = [];
+
+ JSON.parse(value).forEach(function (item) {
+ var type = item.labels.type;
+ if (!(lookup[type])) {
+ lookup[type] = 1;
+ result.push({ "{#TYPE}": type });
+ }
+ })
+
+ return JSON.stringify(result);
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Statement metrics discovery'
+ type: DEPENDENT
+ key: tidb.statement.discover
+ delay: '0'
+ description: 'Discovery statement specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: SQL statements: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.statement.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The number of SQL statements executed per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_executor_statement_total" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_executor_statement_total")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'KV backoff discovery'
+ type: DEPENDENT
+ key: tidb.tikvclient_backoff.discovery
+ delay: '0'
+ description: 'Discovery KV backoff specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: KV backoff: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.tikvclient_backoff.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_backoff_total" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_backoff_total")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'GC action results discovery'
+ type: DEPENDENT
+ key: tidb.tikvclient_gc_action.discovery
+ delay: '0'
+ description: 'Discovery GC action results metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: GC action result: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.tikvclient_gc_action.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of results of GC-related operations per second.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_gc_action_result" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{min(5m)}>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}'
+ name: 'TiDB: Too many failed GC-related operations (over {$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} in 5m)'
+ discover: NO_DISCOVER
+ priority: WARNING
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_gc_action_result")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ overrides:
+ -
+ name: 'Failed GC-related operations trigger'
+ step: '1'
+ filter:
+ conditions:
+ -
+ macro: '{#TYPE}'
+ value: failed
+ formulaid: A
+ operations:
+ -
+ operationobject: TRIGGER_PROTOTYPE
+ operator: LIKE
+ value: 'Too many failed GC-related operations'
+ status: ENABLED
+ discover: DISCOVER
+ -
+ name: 'Lock resolves discovery'
+ type: DEPENDENT
+ key: tidb.tikvclient_lock_resolver_action.discovery
+ delay: '0'
+ description: 'Discovery lock resolves specific metrics.'
+ item_prototypes:
+ -
+ name: 'TiDB: Lock resolves: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.'
+ applications:
+ -
+ name: 'TiDB node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total" && @.labels.type == "{#TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tidb.get_metrics
+ master_item:
+ key: tidb.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ macros:
+ -
+ macro: '{$TIDB.DDL.WAITING.MAX.WARN}'
+ value: '5'
+ description: 'Maximum number of DDL tasks that are waiting'
+ -
+ macro: '{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of GC-related operations failures'
+ -
+ macro: '{$TIDB.HEAP.USAGE.MAX.WARN}'
+ value: 10G
+ description: 'Maximum heap memory used'
+ -
+ macro: '{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}'
+ value: '10'
+ description: 'Minimum number of keep alive operations'
+ -
+ macro: '{$TIDB.OPEN.FDS.MAX.WARN}'
+ value: '90'
+ description: 'Maximum percentage of used file descriptors'
+ -
+ macro: '{$TIDB.PORT}'
+ value: '10080'
+ description: 'The port of TiDB server metrics web endpoint'
+ -
+ macro: '{$TIDB.REGION_ERROR.MAX.WARN}'
+ value: '50'
+ description: 'Maximum number of region related errors'
+ -
+ macro: '{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}'
+ value: '0'
+ description: 'Maximum number of schema lease errors'
+ -
+ macro: '{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of load schema errors'
+ -
+ macro: '{$TIDB.TIME_JUMP_BACK.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of times that the operating system rewinds every second'
+ -
+ macro: '{$TIDB.URL}'
+ value: localhost
+ description: 'TiDB server URL'
+ valuemaps:
+ -
+ name: 'Service state'
+ mappings:
+ -
+ value: '0'
+ newvalue: Down
+ -
+ value: '1'
+ newvalue: Up
+ triggers:
+ -
+ expression: '{TiDB by HTTP:tidb.process_open_fds.min(5m)}/{TiDB by HTTP:tidb.process_max_fds.last()}*100>{$TIDB.OPEN.FDS.MAX.WARN}'
+ name: 'TiDB: Current number of open files is too high (over {$TIDB.OPEN.FDS.MAX.WARN}% for 5m)'
+ priority: WARNING
+ description: '"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue."'
+ graphs:
+ -
+ name: 'TiDB: File descriptors'
+ graph_items:
+ -
+ drawtype: GRADIENT_LINE
+ color: 1A7C11
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.process_open_fds
+ -
+ sortorder: '1'
+ drawtype: BOLD_LINE
+ color: 2774A4
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.process_max_fds
+ -
+ name: 'TiDB: Memory usage'
+ graph_items:
+ -
+ color: 1A7C11
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.heap_bytes
+ -
+ sortorder: '1'
+ color: 2774A4
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.rss_bytes
+ -
+ name: 'TiDB: Server query rate'
+ graph_items:
+ -
+ color: 1A7C11
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.server_query.rate
+ -
+ sortorder: '1'
+ color: 2774A4
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.server_query.ok.rate
+ -
+ sortorder: '2'
+ color: F63100
+ item:
+ host: 'TiDB by HTTP'
+ key: tidb.server_query.error.rate
diff --git a/templates/db/tidb_http/tidb_tikv_http/README.md b/templates/db/tidb_http/tidb_tikv_http/README.md
new file mode 100644
index 00000000000..165b1a26f84
--- /dev/null
+++ b/templates/db/tidb_http/tidb_tikv_http/README.md
@@ -0,0 +1,112 @@
+
+# TiDB TiKV by HTTP
+
+## Overview
+
+For Zabbix version: 5.4 and higher
+The template to monitor TiKV server of TiDB cluster by Zabbix that works without any external scripts.
+Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+
+Template `TiDB TiKV by HTTP` — collects metrics by HTTP agent from TiKV /metrics endpoint.
+
+
+This template was tested on:
+
+- TiDB cluster, version 4.0.10
+
+## Setup
+
+> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions.
+
+This template works with TiKV server of TiDB cluster.
+Internal service metrics are collected from TiKV /metrics endpoint.
+Don't forget to change the macros {$TIKV.URL}, {$TIKV.PORT}.
+Also, see the Macros section for a list of macros used to set trigger values.
+
+
+## Zabbix configuration
+
+No specific Zabbix configuration is required.
+
+### Macros used
+
+|Name|Description|Default|
+|----|-----------|-------|
+|{$TIKV.COPOCESSOR.ERRORS.MAX.WARN} |<p>Maximum number of coprocessor request errors</p> |`1` |
+|{$TIKV.PENDING_COMMANDS.MAX.WARN} |<p>Maximum number of pending commands</p> |`1` |
+|{$TIKV.PENDING_TASKS.MAX.WARN} |<p>Maximum number of tasks currently running by the worker or pending</p> |`1` |
+|{$TIKV.PORT} |<p>The port of TiKV server metrics web endpoint</p> |`20180` |
+|{$TIKV.STORE.ERRORS.MAX.WARN} |<p>Maximum number of failure messages</p> |`1` |
+|{$TIKV.URL} |<p>TiKV server URL</p> |`localhost` |
+
+## Template links
+
+There are no template links in this template.
+
+## Discovery rules
+
+|Name|Description|Type|Key and additional info|
+|----|-----------|----|----|
+|QPS metrics discovery |<p>Discovery QPS metrics.</p> |DEPENDENT |tikv.qps.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Coprocessor metrics discovery |<p>Discovery coprocessor metrics.</p> |DEPENDENT |tikv.coprocessor.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Scheduler metrics discovery |<p>Discovery scheduler metrics.</p> |DEPENDENT |tikv.scheduler.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> |
+|Server errors discovery |<p>Discovery server errors metrics.</p> |DEPENDENT |tikv.server_report_failure.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Too many unreachable messages trigger<br> - {#TYPE} MATCHES_REGEX `unreachable`<br> - TRIGGER_PROTOTYPE LIKE `Too many failure messages` - DISCOVER</p> |
+
+## Items collected
+
+|Group|Name|Description|Type|Key and additional info|
+|-----|----|-----------|----|---------------------|
+|TiKV node |TiKV: Store size |<p>The storage size of TiKV instance.</p> |DEPENDENT |tikv.engine_size<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_size_bytes")].value.sum()`</p> |
+|TiKV node |TiKV: Available size |<p>The available capacity of TiKV instance.</p> |DEPENDENT |tikv.store_size.available<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "available")].value.first()`</p> |
+|TiKV node |TiKV: Capacity size |<p>The capacity size of TiKV instance.</p> |DEPENDENT |tikv.store_size.capacity<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "capacity")].value.first()`</p> |
+|TiKV node |TiKV: Bytes read |<p>The total bytes of read in TiKV instance.</p> |DEPENDENT |tikv.engine_flow_bytes.read<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type =~ "bytes_read|iter_bytes_read")].value.sum()`</p> |
+|TiKV node |TiKV: Bytes write |<p>The total bytes of write in TiKV instance.</p> |DEPENDENT |tikv.engine_flow_bytes.write<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type == "wal_file_bytes")].value.first()`</p> |
+|TiKV node |TiKV: Storage: commands total, rate |<p>Total number of commands received per second.</p> |DEPENDENT |tikv.storage_command.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_storage_command_total")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: CPU util |<p>The CPU usage ratio on TiKV instance.</p> |DEPENDENT |tikv.cpu.util<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_thread_cpu_seconds_total")].value.sum()`</p><p>- CHANGE_PER_SECOND<p>- MULTIPLIER: `100`</p> |
+|TiKV node |TiKV: RSS memory usage |<p>Resident memory size in bytes.</p> |DEPENDENT |tikv.rss_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "process_resident_memory_bytes")].value.first()`</p> |
+|TiKV node |TiKV: Regions, count |<p>The number of regions collected in TiKV instance.</p> |DEPENDENT |tikv.region_count<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "region" )].value.first()`</p> |
+|TiKV node |TiKV: Regions, leader |<p>The number of leaders in TiKV instance.</p> |DEPENDENT |tikv.region_leader<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "leader" )].value.first()`</p> |
+|TiKV node |TiKV: Total query, rate |<p>The total QPS in TiKV instance.</p> |DEPENDENT |tikv.grpc_msg.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Total query errors, rate |<p>The total number of gRPC message handling failure per second.</p> |DEPENDENT |tikv.grpc_msg_fail.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_fail_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: Errors, rate |<p>Total number of push down request error per second.</p> |DEPENDENT |tikv.coprocessor_request_error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_error")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: Requests, rate |<p>Total number of coprocessor requests per second.</p> |DEPENDENT |tikv.coprocessor_request.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: Scan keys, rate |<p>Total number of scan keys observed per request per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_scan_keys")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: RocksDB ops, rate |<p>Total number of RocksDB internal operations from PerfContext per second.</p> |DEPENDENT |tikv.coprocessor_rocksdb_perf.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_rocksdb_perf")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: Response size, rate |<p>The total size of coprocessor response per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_response_bytes")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: Pending commands |<p>The total number of pending commands. The scheduler receives commands from clients, executes them against the MVCC layer storage engine.</p> |DEPENDENT |tikv.scheduler_contex<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_contex_total")].value.first()`</p> |
+|TiKV node |TiKV: Scheduler: Busy, rate |<p>The total count of too busy schedulers per second.</p> |DEPENDENT |tikv.scheduler_too_busy.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_too_busy_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: Commands total, rate |<p>Total number of commands per second.</p> |DEPENDENT |tikv.scheduler_commands.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: Low priority commands total, rate |<p>Total count of low priority commands per second.</p> |DEPENDENT |tikv.commands_pri.low.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "low")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: Normal priority commands total, rate |<p>Total count of normal priority commands per second.</p> |DEPENDENT |tikv.commands_pri.normal.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "normal")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: High priority commands total, rate |<p>Total count of high priority commands per second.</p> |DEPENDENT |tikv.commands_pri.high.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "high")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Snapshot: Pending tasks |<p>The number of tasks currently running by the worker or pending.</p> |DEPENDENT |tikv.scheduler_contex<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_worker_pending_task_total")].value.first()`</p> |
+|TiKV node |TiKV: Snapshot: Sending |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.sending<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "sending")].value.first()`</p> |
+|TiKV node |TiKV: Snapshot: Receiving |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.receiving<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "receiving")].value.first()`</p> |
+|TiKV node |TiKV: Snapshot: Applying |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.applying<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "applying")].value.first()`</p> |
+|TiKV node |TiKV: Uptime |<p>The runtime of each TiKV instance.</p> |DEPENDENT |tikv.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_start_time_seconds")].value.first()`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> |
+|TiKV node |TiKV: Server: failure messages total, rate |<p>Total number of reporting failure messages per second.</p> |DEPENDENT |tikv.messages.failure.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Query: {#TYPE}, rate |<p>The QPS per command in TiKV instance.</p> |DEPENDENT |tikv.grpc_msg.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> `</p> |
+|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} errors, rate |<p>Total number of push down request error per second.</p> |DEPENDENT |tikv.coprocessor_request_error.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_error" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} requests, rate |<p>Total number of coprocessor requests per second.</p> |DEPENDENT |tikv.coprocessor_request.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} scan keys, rate |<p>Total number of scan keys observed per request per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_scan_keys_count" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} RocksDB ops, rate |<p>Total number of RocksDB internal operations from PerfContext per second.</p> |DEPENDENT |tikv.coprocessor_rocksdb_perf.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_rocksdb_perf" && @.labels.req == "{#REQ_TYPE}")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Scheduler: commands {#STAGE}, rate |<p>Total number of commands on each stage per second.</p> |DEPENDENT |tikv.scheduler_stage.rate[{#STAGE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total" && @.labels.stage == "{#STAGE}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND |
+|TiKV node |TiKV: Store_id {#STORE_ID}: failure messages "{#TYPE}", rate |<p>Total number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store id.</p> |DEPENDENT |tikv.messages.failure.rate[{#STORE_ID},{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total" && @.labels.store_id == "{#STORE_ID}" && @.labels.type == "{#TYPE}")].value.sum()`</p><p>- CHANGE_PER_SECOND |
+|Zabbix_raw_items |TiKV: Get instance metrics |<p>Get TiKV instance metrics.</p> |HTTP_AGENT |tikv.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON |
+
+## Triggers
+
+|Name|Description|Expression|Severity|Dependencies and additional info|
+|----|-----------|----|----|----|
+|TiKV: Too many coprocessor request error (over {$TIKV.COPOCESSOR.ERRORS.MAX.WARN} in 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.coprocessor_request_error.rate.min(5m)}>{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}` |WARNING | |
+|TiKV: Too many pending commands (over {$TIKV.PENDING_COMMANDS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.scheduler_contex.min(5m)}>{$TIKV.PENDING_COMMANDS.MAX.WARN}` |AVERAGE | |
+|TiKV: Too many pending commands (over {$TIKV.PENDING_TASKS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.scheduler_contex.min(5m)}>{$TIKV.PENDING_TASKS.MAX.WARN}` |AVERAGE | |
+|TiKV: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:tikv.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> |
+|TiKV: Store_id {#STORE_ID}: Too many failure messages "{#TYPE}" (over {$TIKV.STORE.ERRORS.MAX.WARN} in 5m) |<p>Indicates that the remote TiKV cannot be connected.</p> |`{TEMPLATE_NAME:tikv.messages.failure.rate[{#STORE_ID},{#TYPE}].min(5m)}>{$TIKV.STORE.ERRORS.MAX.WARN}` |WARNING | |
+
+## Feedback
+
+Please report any issues with the template at https://support.zabbix.com
+
+You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback).
+
diff --git a/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml b/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml
new file mode 100644
index 00000000000..74ae7c37684
--- /dev/null
+++ b/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml
@@ -0,0 +1,1005 @@
+zabbix_export:
+ version: '5.4'
+ date: '2021-04-08T09:02:42Z'
+ groups:
+ -
+ name: Templates/Databases
+ templates:
+ -
+ template: 'TiDB TiKV by HTTP'
+ name: 'TiDB TiKV by HTTP'
+ description: |
+ The template to monitor TiKV server of TiDB cluster by Zabbix that works without any external scripts.
+ Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.
+ Don't forget to change the macros {$TIKV.URL}, {$TIKV.PORT}.
+
+ Template `TiDB TiKV by HTTP` — collects metrics by HTTP agent from TiKV /metrics endpoint.
+
+ You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback
+
+ Template tooling version used: 0.38
+ groups:
+ -
+ name: Templates/Databases
+ applications:
+ -
+ name: 'TiKV node'
+ -
+ name: 'Zabbix raw items'
+ items:
+ -
+ name: 'TiKV: Scheduler: High priority commands total, rate'
+ type: DEPENDENT
+ key: tikv.commands_pri.high.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total count of high priority commands per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "high")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Scheduler: Low priority commands total, rate'
+ type: DEPENDENT
+ key: tikv.commands_pri.low.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total count of low priority commands per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "low")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Scheduler: Normal priority commands total, rate'
+ type: DEPENDENT
+ key: tikv.commands_pri.normal.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total count of normal priority commands per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "normal")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: Requests, rate'
+ type: DEPENDENT
+ key: tikv.coprocessor_request.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of coprocessor requests per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: Errors, rate'
+ type: DEPENDENT
+ key: tikv.coprocessor_request_error.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of push down request error per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_request_error")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}'
+ name: 'TiKV: Too many coprocessor request error (over {$TIKV.COPOCESSOR.ERRORS.MAX.WARN} in 5m)'
+ priority: WARNING
+ -
+ name: 'TiKV: Coprocessor: RocksDB ops, rate'
+ type: DEPENDENT
+ key: tikv.coprocessor_rocksdb_perf.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of RocksDB internal operations from PerfContext per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_rocksdb_perf")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: Response size, rate'
+ type: DEPENDENT
+ key: tikv.coprocessor_scan_keys.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Bps
+ description: 'The total size of coprocessor response per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_response_bytes")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: CPU util'
+ type: DEPENDENT
+ key: tikv.cpu.util
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: '%'
+ description: 'The CPU usage ratio on TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_thread_cpu_seconds_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ -
+ type: MULTIPLIER
+ parameters:
+ - '100'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Bytes read'
+ type: DEPENDENT
+ key: tikv.engine_flow_bytes.read
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Bps
+ description: 'The total bytes of read in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type =~ "bytes_read|iter_bytes_read")].value.sum()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Bytes write'
+ type: DEPENDENT
+ key: tikv.engine_flow_bytes.write
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Bps
+ description: 'The total bytes of write in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type == "wal_file_bytes")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Store size'
+ type: DEPENDENT
+ key: tikv.engine_size
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'The storage size of TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_engine_size_bytes")].value.sum()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Get instance metrics'
+ type: HTTP_AGENT
+ key: tikv.get_metrics
+ history: '0'
+ trends: '0'
+ value_type: TEXT
+ description: 'Get TiKV instance metrics.'
+ applications:
+ -
+ name: 'Zabbix raw items'
+ preprocessing:
+ -
+ type: CHECK_NOT_SUPPORTED
+ parameters:
+ - ''
+ -
+ type: PROMETHEUS_TO_JSON
+ parameters:
+ - ''
+ url: '{$TIKV.URL}:{$TIKV.PORT}/metrics'
+ -
+ name: 'TiKV: Total query, rate'
+ type: DEPENDENT
+ key: tikv.grpc_msg.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The total QPS in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Total query errors, rate'
+ type: DEPENDENT
+ key: tikv.grpc_msg_fail.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The total number of gRPC message handling failure per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_grpc_msg_fail_total")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Server: failure messages total, rate'
+ type: DEPENDENT
+ key: tikv.messages.failure.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total number of reporting failure messages per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_server_report_failure_msg_total")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Regions, count'
+ type: DEPENDENT
+ key: tikv.region_count
+ delay: '0'
+ history: 7d
+ description: 'The number of regions collected in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "region" )].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Regions, leader'
+ type: DEPENDENT
+ key: tikv.region_leader
+ delay: '0'
+ history: 7d
+ description: 'The number of leaders in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "leader" )].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: RSS memory usage'
+ type: DEPENDENT
+ key: tikv.rss_bytes
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'Resident memory size in bytes.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "process_resident_memory_bytes")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Scheduler: Commands total, rate'
+ type: DEPENDENT
+ key: tikv.scheduler_commands.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total number of commands per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_stage_total")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Snapshot: Pending tasks'
+ type: DEPENDENT
+ key: tikv.scheduler_contex
+ delay: '0'
+ history: 7d
+ description: 'The number of tasks currently running by the worker or pending.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_worker_pending_task_total")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ triggers:
+ -
+ expression: '{min(5m)}>{$TIKV.PENDING_COMMANDS.MAX.WARN}'
+ name: 'TiKV: Too many pending commands (over {$TIKV.PENDING_COMMANDS.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ -
+ expression: '{min(5m)}>{$TIKV.PENDING_TASKS.MAX.WARN}'
+ name: 'TiKV: Too many pending commands (over {$TIKV.PENDING_TASKS.MAX.WARN} for 5m)'
+ priority: AVERAGE
+ -
+ name: 'TiKV: Scheduler: Busy, rate'
+ type: DEPENDENT
+ key: tikv.scheduler_too_busy.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'The total count of too busy schedulers per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_too_busy_total")].value.sum()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Snapshot: Applying'
+ type: DEPENDENT
+ key: tikv.snapshot.applying
+ delay: '0'
+ history: 7d
+ description: 'The total amount of raftstore snapshot traffic.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "applying")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Snapshot: Receiving'
+ type: DEPENDENT
+ key: tikv.snapshot.receiving
+ delay: '0'
+ history: 7d
+ description: 'The total amount of raftstore snapshot traffic.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "receiving")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Snapshot: Sending'
+ type: DEPENDENT
+ key: tikv.snapshot.sending
+ delay: '0'
+ history: 7d
+ description: 'The total amount of raftstore snapshot traffic.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "sending")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Storage: commands total, rate'
+ type: DEPENDENT
+ key: tikv.storage_command.rate
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total number of commands received per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_storage_command_total")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Available size'
+ type: DEPENDENT
+ key: tikv.store_size.available
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'The available capacity of TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "available")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Capacity size'
+ type: DEPENDENT
+ key: tikv.store_size.capacity
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: B
+ description: 'The capacity size of TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "capacity")].value.first()'
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Uptime'
+ type: DEPENDENT
+ key: tikv.uptime
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: uptime
+ description: 'The runtime of each TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name=="process_start_time_seconds")].value.first()'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ //use boottime to calculate uptime
+ return (Math.floor(Date.now()/1000)-Number(value));
+ master_item:
+ key: tikv.get_metrics
+ triggers:
+ -
+ expression: '{last()}<10m'
+ name: 'TiKV: has been restarted (uptime < 10m)'
+ priority: INFO
+ description: 'Uptime is less than 10 minutes'
+ manual_close: 'YES'
+ discovery_rules:
+ -
+ name: 'Coprocessor metrics discovery'
+ type: DEPENDENT
+ key: tikv.coprocessor.discovery
+ delay: '0'
+ description: 'Discovery coprocessor metrics.'
+ item_prototypes:
+ -
+ name: 'TiKV: Coprocessor: {#REQ_TYPE} requests, rate'
+ type: DEPENDENT
+ key: 'tikv.coprocessor_request.rate[{#REQ_TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of coprocessor requests per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count" && @.labels.req == "{#REQ_TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: {#REQ_TYPE} errors, rate'
+ type: DEPENDENT
+ key: 'tikv.coprocessor_request_error.rate[{#REQ_TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of push down request error per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_request_error" && @.labels.req == "{#REQ_TYPE}")].value.first()'
+ error_handler: DISCARD_VALUE
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: {#REQ_TYPE} RocksDB ops, rate'
+ type: DEPENDENT
+ key: 'tikv.coprocessor_rocksdb_perf.rate[{#REQ_TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of RocksDB internal operations from PerfContext per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_rocksdb_perf" && @.labels.req == "{#REQ_TYPE}")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ -
+ name: 'TiKV: Coprocessor: {#REQ_TYPE} scan keys, rate'
+ type: DEPENDENT
+ key: 'tikv.coprocessor_scan_keys.rate[{#REQ_TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'Total number of scan keys observed per request per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_scan_keys_count" && @.labels.req == "{#REQ_TYPE}")].value.first()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ master_item:
+ key: tikv.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#REQ_TYPE}": item.labels.req,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'QPS metrics discovery'
+ type: DEPENDENT
+ key: tikv.qps.discovery
+ delay: '0'
+ description: 'Discovery QPS metrics.'
+ item_prototypes:
+ -
+ name: 'TiKV: Query: {#TYPE}, rate'
+ type: DEPENDENT
+ key: 'tikv.grpc_msg.rate[{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ units: Ops
+ description: 'The QPS per command in TiKV instance.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()'
+ error_handler: CUSTOM_VALUE
+ master_item:
+ key: tikv.get_metrics
+ master_item:
+ key: tikv.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#TYPE}": item.labels.type,
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Scheduler metrics discovery'
+ type: DEPENDENT
+ key: tikv.scheduler.discovery
+ delay: '0'
+ description: 'Discovery scheduler metrics.'
+ item_prototypes:
+ -
+ name: 'TiKV: Scheduler: commands {#STAGE}, rate'
+ type: DEPENDENT
+ key: 'tikv.scheduler_stage.rate[{#STAGE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total number of commands on each stage per second.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_stage_total" && @.labels.stage == "{#STAGE}")].value.sum()'
+ error_handler: CUSTOM_VALUE
+ error_handler_params: '0'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ master_item:
+ key: tikv.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_scheduler_stage_total")]'
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ var lookup = {},
+ result = [];
+
+ JSON.parse(value).forEach(function (item) {
+ var stage = item.labels.stage;
+ if (!(lookup[stage])) {
+ lookup[stage] = 1;
+ result.push({ "{#STAGE}": stage });
+ }
+ })
+
+ return JSON.stringify(result);
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ -
+ name: 'Server errors discovery'
+ type: DEPENDENT
+ key: tikv.server_report_failure.discovery
+ delay: '0'
+ description: 'Discovery server errors metrics.'
+ item_prototypes:
+ -
+ name: 'TiKV: Store_id {#STORE_ID}: failure messages "{#TYPE}", rate'
+ type: DEPENDENT
+ key: 'tikv.messages.failure.rate[{#STORE_ID},{#TYPE}]'
+ delay: '0'
+ history: 7d
+ value_type: FLOAT
+ description: 'Total number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store id.'
+ applications:
+ -
+ name: 'TiKV node'
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_server_report_failure_msg_total" && @.labels.store_id == "{#STORE_ID}" && @.labels.type == "{#TYPE}")].value.sum()'
+ -
+ type: CHANGE_PER_SECOND
+ parameters:
+ - ''
+ master_item:
+ key: tikv.get_metrics
+ trigger_prototypes:
+ -
+ expression: '{min(5m)}>{$TIKV.STORE.ERRORS.MAX.WARN}'
+ name: 'TiKV: Store_id {#STORE_ID}: Too many failure messages "{#TYPE}" (over {$TIKV.STORE.ERRORS.MAX.WARN} in 5m)'
+ discover: NO_DISCOVER
+ priority: WARNING
+ description: 'Indicates that the remote TiKV cannot be connected.'
+ master_item:
+ key: tikv.get_metrics
+ preprocessing:
+ -
+ type: JSONPATH
+ parameters:
+ - '$[?(@.name == "tikv_server_report_failure_msg_total")]'
+ error_handler: DISCARD_VALUE
+ -
+ type: JAVASCRIPT
+ parameters:
+ - |
+ output = JSON.parse(value).map(function(item){
+ return {
+ "{#STORE_ID}": item.labels.store_id,
+ "{#TYPE}": item.labels.type,
+
+ }})
+ return JSON.stringify({"data": output})
+ -
+ type: DISCARD_UNCHANGED_HEARTBEAT
+ parameters:
+ - 1h
+ overrides:
+ -
+ name: 'Too many unreachable messages trigger'
+ step: '1'
+ filter:
+ conditions:
+ -
+ macro: '{#TYPE}'
+ value: unreachable
+ formulaid: A
+ operations:
+ -
+ operationobject: TRIGGER_PROTOTYPE
+ operator: LIKE
+ value: 'Too many failure messages'
+ status: ENABLED
+ discover: DISCOVER
+ macros:
+ -
+ macro: '{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of coprocessor request errors'
+ -
+ macro: '{$TIKV.PENDING_COMMANDS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of pending commands'
+ -
+ macro: '{$TIKV.PENDING_TASKS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of tasks currently running by the worker or pending'
+ -
+ macro: '{$TIKV.PORT}'
+ value: '20180'
+ description: 'The port of TiKV server metrics web endpoint'
+ -
+ macro: '{$TIKV.STORE.ERRORS.MAX.WARN}'
+ value: '1'
+ description: 'Maximum number of failure messages'
+ -
+ macro: '{$TIKV.URL}'
+ value: localhost
+ description: 'TiKV server URL'
+ graphs:
+ -
+ name: 'TiKV: Scheduler priority commands rate'
+ graph_items:
+ -
+ color: 1A7C11
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.commands_pri.normal.rate
+ -
+ sortorder: '1'
+ color: 2774A4
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.commands_pri.high.rate
+ -
+ sortorder: '2'
+ color: F63100
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.commands_pri.low.rate
+ -
+ name: 'TiKV: Snapshot state count'
+ graph_items:
+ -
+ color: 1A7C11
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.snapshot.applying
+ -
+ sortorder: '1'
+ color: 2774A4
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.snapshot.receiving
+ -
+ sortorder: '2'
+ color: F63100
+ item:
+ host: 'TiDB TiKV by HTTP'
+ key: tikv.snapshot.sending