diff options
author | Yulia Chukina <yulia.chukina@zabbix.com> | 2021-04-08 12:42:34 +0300 |
---|---|---|
committer | Yulia Chukina <yulia.chukina@zabbix.com> | 2021-04-08 12:42:34 +0300 |
commit | 8a68b233d94510f0738d5c8ab5353c45c00580e7 (patch) | |
tree | b7738653a1c186c5f6a495d2275775d254e14801 /templates | |
parent | 483b4bfa329b84d5946747d2d52383a53ad57fbe (diff) |
.........T [ZBXNEXT-6504] added Templates "TiDB by HTTP", "TiDB TiKV by HTTP" and "TiDB PD by HTTP"
Diffstat (limited to 'templates')
6 files changed, 3496 insertions, 0 deletions
diff --git a/templates/db/tidb_http/tidb_pd_http/README.md b/templates/db/tidb_http/tidb_pd_http/README.md new file mode 100644 index 00000000000..33e0a0a245b --- /dev/null +++ b/templates/db/tidb_http/tidb_pd_http/README.md @@ -0,0 +1,108 @@ + +# TiDB PD by HTTP + +## Overview + +For Zabbix version: 5.4 and higher +The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. +Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + +Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. +See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. + + +This template was tested on: + +- TiDB cluster, version 4.0.10 + +## Setup + +> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions. + +This template works with PD server of TiDB cluster. +Internal service metrics are collected from PD /metrics endpoint and from monitoring API. +See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. +Don't forget to change the macros {$PD.URL}, {$PD.PORT}. +Also, see the Macros section for a list of macros used to set trigger values. + + +## Zabbix configuration + +No specific Zabbix configuration is required. + +### Macros used + +|Name|Description|Default| +|----|-----------|-------| +|{$PD.MISS_REGION.MAX.WARN} |<p>Maximum number of missed regions</p> |`100` | +|{$PD.PORT} |<p>The port of PD server metrics web endpoint</p> |`2379` | +|{$PD.STORAGE_USAGE.MAX.WARN} |<p>Maximum percentage of cluster space used</p> |`80` | +|{$PD.URL} |<p>PD server URL</p> |`localhost` | + +## Template links + +There are no template links in this template. + +## Discovery rules + +|Name|Description|Type|Key and additional info| +|----|-----------|----|----| +|Cluster metrics discovery |<p>Discovery cluster specific metrics.</p> |DEPENDENT |pd.cluster.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_cluster_status")]`</p><p>- JAVASCRIPT: `return JSON.stringify(value != "[]" ? [{'{#SINGLETON}': ''}] : []);`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Region labels discovery |<p>Discovery region labels specific metrics.</p> |DEPENDENT |pd.region_labels.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_label_level")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Region status discovery |<p>Discovery region status specific metrics.</p> |DEPENDENT |pd.region_status.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Too many missed regions trigger<br> - {#TYPE} MATCHES_REGEX `miss_peer_region_count`<br> - TRIGGER_PROTOTYPE LIKE `Too many missed regions` - DISCOVER</p><p>Unresponsive peers trigger<br> - {#TYPE} MATCHES_REGEX `down_peer_region_count`<br> - TRIGGER_PROTOTYPE LIKE `There are unresponsive peers` - DISCOVER</p> | +|Running scheduler discovery |<p>Discovery scheduler specific metrics.</p> |DEPENDENT |pd.scheduler.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_status" && @.labels.type == "allow")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|gRPC commands discovery |<p>Discovery grpc commands specific metrics.</p> |DEPENDENT |pd.grpc_command.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Region discovery |<p>Discovery region specific metrics.</p> |DEPENDENT |pd.region.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | + +## Items collected + +|Group|Name|Description|Type|Key and additional info| +|-----|----|-----------|----|---------------------| +|PD instance |PD: Status |<p>Status of PD instance.</p> |DEPENDENT |pd.status<p>**Preprocessing**:</p><p>- JSONPATH: `$.status`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 1`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|PD instance |PD: GRPC Commands total, rate |<p>The rate at which gRPC commands are completed.</p> |DEPENDENT |pd.grpc_command.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|PD instance |PD: Version |<p>Version of the PD instance.</p> |DEPENDENT |pd.version<p>**Preprocessing**:</p><p>- JSONPATH: `$.version`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `3h`</p> | +|PD instance |PD: Uptime |<p>The runtime of each PD instance.</p> |DEPENDENT |pd.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$.start_timestamp`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> | +|PD instance |PD: GRPC Commands: {#GRPC_METHOD}, rate |<p>The rate per command type at which gRPC commands are completed.</p> |DEPENDENT |pd.grpc_command.rate[{#GRPC_METHOD}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "grpc_server_handling_seconds_count" && @.labels.grpc_method == "{#GRPC_METHOD}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |TiDB cluster: Offline stores |<p>-</p> |DEPENDENT |pd.cluster_status.store_offline[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_offline_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Tombstone stores |<p>The count of tombstone stores.</p> |DEPENDENT |pd.cluster_status.store_tombstone[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_tombstone_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Down stores |<p>The count of down stores.</p> |DEPENDENT |pd.cluster_status.store_down[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_down_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Lowspace stores |<p>The count of low space stores.</p> |DEPENDENT |pd.cluster_status.store_low_space[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_low_space_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Unhealth stores |<p>The count of unhealthy stores.</p> |DEPENDENT |pd.cluster_status.store_unhealth[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_unhealth_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Disconnect stores |<p>The count of disconnected stores.</p> |DEPENDENT |pd.cluster_status.store_disconnected[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_disconnected_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Normal stores |<p>The count of healthy storage instances.</p> |DEPENDENT |pd.cluster_status.store_up[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "store_up_count")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Storage capacity |<p>The total storage capacity for this TiDB cluster.</p> |DEPENDENT |pd.cluster_status.storage_capacity[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_capacity")].value.first()`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB cluster |TiDB cluster: Storage size |<p>The storage size that is currently used by the TiDB cluster.</p> |DEPENDENT |pd.cluster_status.storage_size[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_size")].value.first()`</p> | +|TiDB cluster |TiDB cluster: Number of regions |<p>The total count of cluster Regions.</p> |DEPENDENT |pd.cluster_status.leader_count[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "leader_count")].value.first()`</p> | +|TiDB cluster |TiDB cluster: Current peer count |<p>The current count of all cluster peers.</p> |DEPENDENT |pd.cluster_status.region_count[{#SINGLETON}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_cluster_status" && @.labels.type == "region_count")].value.first()`</p> | +|TiDB cluster |TiDB cluster: Regions label: {#TYPE} |<p>The number of Regions in different label levels.</p> |DEPENDENT |pd.region_labels[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_label_level" && @.labels.type == "{#TYPE}")].value.first()`</p> | +|TiDB cluster |TiDB cluster: Regions status: {#TYPE} |<p>The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.</p> |DEPENDENT |pd.region_status[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status" && @.labels.type == "{#TYPE}")].value.first()`</p> | +|TiDB cluster |TiDB cluster: Scheduler status: {#KIND} |<p>The current running schedulers.</p> |DEPENDENT |pd.scheduler[{#KIND}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_regions_status" && @.labels.type == "allow" && @.labels.kind == "{#KIND}")].value.first()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p> | +|TiDB cluster |PD: Region heartbeat: active, rate |<p>The count of heartbeats with the ok status per second.</p> |DEPENDENT |pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "ok" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |PD: Region heartbeat: error, rate |<p>The count of heartbeats with the error status per second.</p> |DEPENDENT |pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "err" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |PD: Region heartbeat: total, rate |<p>The count of heartbeats reported to PD per instance per second.</p> |DEPENDENT |pd.region_heartbeat.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |PD: Region schedule push: error, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}" && @.labels.status == "err" )].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |PD: Region schedule push: ok, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}" && @.labels.status == "ok" )].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiDB cluster |PD: Region schedule push: total, rate | |DEPENDENT |pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|Zabbix_raw_items |PD: Get instance metrics |<p>Get TiDB PD instance metrics.</p> |HTTP_AGENT |pd.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON | +|Zabbix_raw_items |PD: Get instance status |<p>Get TiDB PD instance status info.</p> |HTTP_AGENT |pd.get_status<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED | + +## Triggers + +|Name|Description|Expression|Severity|Dependencies and additional info| +|----|-----------|----|----|----| +|PD: Instance is not responding |<p>-</p> |`{TEMPLATE_NAME:pd.status.last()}=0` |AVERAGE | | +|PD: Version has changed (new version: {ITEM.VALUE}) |<p>PD version has changed. Ack to close.</p> |`{TEMPLATE_NAME:pd.version.diff()}=1 and {TEMPLATE_NAME:pd.version.strlen()}>0` |INFO |<p>Manual close: YES</p> | +|PD: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:pd.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> | +|TiDB cluster: There are offline TiKV nodes |<p>PD has not received a TiKV heartbeat for a long time.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_down[{#SINGLETON}].last()}>0` |AVERAGE | | +|TiDB cluster: There are low space TiKV nodes |<p>Indicates that there is no sufficient space on the TiKV node.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_low_space[{#SINGLETON}].last()}>0` |AVERAGE | | +|TiDB cluster: There are disconnected TiKV nodes |<p>PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.</p> |`{TEMPLATE_NAME:pd.cluster_status.store_disconnected[{#SINGLETON}].last()}>0` |WARNING | | +|TiDB cluster: Current storage usage is too high (over {$PD.STORAGE_USAGE.MAX.WARN}% for 5m) |<p>Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.</p> |`{TEMPLATE_NAME:pd.cluster_status.storage_size[{#SINGLETON}].min(5m)}/{TiDB PD by HTTP:pd.cluster_status.storage_capacity[{#SINGLETON}].last()}*100>{$PD.STORAGE_USAGE.MAX.WARN}` |WARNING | | +|TiDB cluster: Too many missed regions (over {$PD.MISS_REGION.MAX.WARN} in 5m) |<p>The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.</p> |`{TEMPLATE_NAME:pd.region_status[{#TYPE}].min(5m)}>{$PD.MISS_REGION.MAX.WARN}` |WARNING | | +|TiDB cluster: There are unresponsive peers |<p>The number of Regions with an unresponsive peer reported by the Raft leader.</p> |`{TEMPLATE_NAME:pd.region_status[{#TYPE}].min(5m)}>0` |WARNING | | + +## Feedback + +Please report any issues with the template at https://support.zabbix.com + +You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback). + diff --git a/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml b/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml new file mode 100644 index 00000000000..e53fccf2695 --- /dev/null +++ b/templates/db/tidb_http/tidb_pd_http/template_db_tidb_pd_http.yaml @@ -0,0 +1,874 @@ +zabbix_export: + version: '5.4' + date: '2021-04-08T09:02:39Z' + groups: + - + name: Templates/Databases + templates: + - + template: 'TiDB PD by HTTP' + name: 'TiDB PD by HTTP' + description: | + The template to monitor PD server of TiDB cluster by Zabbix that works without any external scripts. + Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + Don't forget to change the macros {$PD.URL}, {$PD.PORT}. + + Template `TiDB PD by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. + + You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback + + Template tooling version used: 0.38 + groups: + - + name: Templates/Databases + applications: + - + name: 'PD instance' + - + name: 'TiDB cluster' + - + name: 'Zabbix raw items' + items: + - + name: 'PD: Get instance metrics' + type: HTTP_AGENT + key: pd.get_metrics + history: '0' + trends: '0' + value_type: TEXT + description: 'Get TiDB PD instance metrics.' + applications: + - + name: 'Zabbix raw items' + preprocessing: + - + type: CHECK_NOT_SUPPORTED + parameters: + - '' + - + type: PROMETHEUS_TO_JSON + parameters: + - '' + url: '{$PD.URL}:{$PD.PORT}/metrics' + - + name: 'PD: Get instance status' + type: HTTP_AGENT + key: pd.get_status + history: '0' + trends: '0' + value_type: TEXT + description: 'Get TiDB PD instance status info.' + applications: + - + name: 'Zabbix raw items' + preprocessing: + - + type: CHECK_NOT_SUPPORTED + parameters: + - '' + error_handler: CUSTOM_VALUE + error_handler_params: '{"status": "0"}' + url: '{$PD.URL}:{$PD.PORT}/pd/api/v1/status' + - + name: 'PD: GRPC Commands total, rate' + type: DEPENDENT + key: pd.grpc_command.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The rate at which gRPC commands are completed.' + applications: + - + name: 'PD instance' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "grpc_server_handling_seconds_count")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + - + name: 'PD: Status' + type: DEPENDENT + key: pd.status + delay: '0' + history: 7d + trends: '0' + value_type: CHAR + description: 'Status of PD instance.' + applications: + - + name: 'PD instance' + valuemap: + name: 'Service state' + preprocessing: + - + type: JSONPATH + parameters: + - $.status + error_handler: CUSTOM_VALUE + error_handler_params: '1' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_status + triggers: + - + expression: '{last()}=0' + name: 'PD: Instance is not responding' + priority: AVERAGE + - + name: 'PD: Uptime' + type: DEPENDENT + key: pd.uptime + delay: '0' + history: 7d + value_type: FLOAT + units: uptime + description: 'The runtime of each PD instance.' + applications: + - + name: 'PD instance' + preprocessing: + - + type: JSONPATH + parameters: + - $.start_timestamp + - + type: JAVASCRIPT + parameters: + - | + //use boottime to calculate uptime + return (Math.floor(Date.now()/1000)-Number(value)); + master_item: + key: pd.get_status + triggers: + - + expression: '{last()}<10m' + name: 'PD: has been restarted (uptime < 10m)' + priority: INFO + description: 'Uptime is less than 10 minutes' + manual_close: 'YES' + - + name: 'PD: Version' + type: DEPENDENT + key: pd.version + delay: '0' + history: 7d + trends: '0' + value_type: CHAR + description: 'Version of the PD instance.' + applications: + - + name: 'PD instance' + preprocessing: + - + type: JSONPATH + parameters: + - $.version + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 3h + master_item: + key: pd.get_status + triggers: + - + expression: '{diff()}=1 and {strlen()}>0' + name: 'PD: Version has changed (new version: {ITEM.VALUE})' + priority: INFO + description: 'PD version has changed. Ack to close.' + manual_close: 'YES' + discovery_rules: + - + name: 'Cluster metrics discovery' + type: DEPENDENT + key: pd.cluster.discovery + delay: '0' + description: 'Discovery cluster specific metrics.' + item_prototypes: + - + name: 'TiDB cluster: Number of regions' + type: DEPENDENT + key: 'pd.cluster_status.leader_count[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The total count of cluster Regions.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "leader_count")].value.first()' + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Current peer count' + type: DEPENDENT + key: 'pd.cluster_status.region_count[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The current count of all cluster peers.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "region_count")].value.first()' + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Storage capacity' + type: DEPENDENT + key: 'pd.cluster_status.storage_capacity[{#SINGLETON}]' + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'The total storage capacity for this TiDB cluster.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_capacity")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Storage size' + type: DEPENDENT + key: 'pd.cluster_status.storage_size[{#SINGLETON}]' + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'The storage size that is currently used by the TiDB cluster.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "storage_size")].value.first()' + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Disconnect stores' + type: DEPENDENT + key: 'pd.cluster_status.store_disconnected[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of disconnected stores.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_disconnected_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + trigger_prototypes: + - + expression: '{last()}>0' + name: 'TiDB cluster: There are disconnected TiKV nodes' + priority: WARNING + description: 'PD does not receive a TiKV heartbeat within 20 seconds. Normally a TiKV heartbeat comes in every 10 seconds.' + - + name: 'TiDB cluster: Down stores' + type: DEPENDENT + key: 'pd.cluster_status.store_down[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of down stores.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_down_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + trigger_prototypes: + - + expression: '{last()}>0' + name: 'TiDB cluster: There are offline TiKV nodes' + priority: AVERAGE + description: 'PD has not received a TiKV heartbeat for a long time.' + - + name: 'TiDB cluster: Lowspace stores' + type: DEPENDENT + key: 'pd.cluster_status.store_low_space[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of low space stores.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_low_space_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + trigger_prototypes: + - + expression: '{last()}>0' + name: 'TiDB cluster: There are low space TiKV nodes' + priority: AVERAGE + description: 'Indicates that there is no sufficient space on the TiKV node.' + - + name: 'TiDB cluster: Offline stores' + type: DEPENDENT + key: 'pd.cluster_status.store_offline[{#SINGLETON}]' + delay: '0' + history: 7d + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_offline_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Tombstone stores' + type: DEPENDENT + key: 'pd.cluster_status.store_tombstone[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of tombstone stores.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_tombstone_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Unhealth stores' + type: DEPENDENT + key: 'pd.cluster_status.store_unhealth[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of unhealthy stores.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_unhealth_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + - + name: 'TiDB cluster: Normal stores' + type: DEPENDENT + key: 'pd.cluster_status.store_up[{#SINGLETON}]' + delay: '0' + history: 7d + description: 'The count of healthy storage instances.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_cluster_status" && @.labels.type == "store_up_count")].value.first()' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: pd.get_metrics + trigger_prototypes: + - + expression: '{TiDB PD by HTTP:pd.cluster_status.storage_size[{#SINGLETON}].min(5m)}/{TiDB PD by HTTP:pd.cluster_status.storage_capacity[{#SINGLETON}].last()}*100>{$PD.STORAGE_USAGE.MAX.WARN}' + name: 'TiDB cluster: Current storage usage is too high (over {$PD.STORAGE_USAGE.MAX.WARN}% for 5m)' + priority: WARNING + description: 'Over {$PD.STORAGE_USAGE.MAX.WARN}% of the cluster space is occupied.' + graph_prototypes: + - + name: 'TiDB cluster: Storage Usage[{#SINGLETON}]' + graph_items: + - + color: 1A7C11 + item: + host: 'TiDB PD by HTTP' + key: 'pd.cluster_status.storage_size[{#SINGLETON}]' + - + sortorder: '1' + color: 2774A4 + item: + host: 'TiDB PD by HTTP' + key: 'pd.cluster_status.storage_capacity[{#SINGLETON}]' + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="pd_cluster_status")]' + error_handler: CUSTOM_VALUE + error_handler_params: '[]' + - + type: JAVASCRIPT + parameters: + - 'return JSON.stringify(value != "[]" ? [{''{#SINGLETON}'': ''''}] : []);' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'gRPC commands discovery' + type: DEPENDENT + key: pd.grpc_command.discovery + delay: '0' + description: 'Discovery grpc commands specific metrics.' + item_prototypes: + - + name: 'PD: GRPC Commands: {#GRPC_METHOD}, rate' + type: DEPENDENT + key: 'pd.grpc_command.rate[{#GRPC_METHOD}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The rate per command type at which gRPC commands are completed.' + applications: + - + name: 'PD instance' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "grpc_server_handling_seconds_count" && @.labels.grpc_method == "{#GRPC_METHOD}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "grpc_server_handling_seconds_count")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + var lookup = {}, + result = []; + + JSON.parse(value).forEach(function (item) { + var grpc_method = item.labels.grpc_method; + if (!(lookup[grpc_method])) { + lookup[grpc_method] = 1; + result.push({ "{#GRPC_METHOD}": grpc_method }); + } + }) + + return JSON.stringify(result); + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Region discovery' + type: DEPENDENT + key: pd.region.discovery + delay: '0' + description: 'Discovery region specific metrics.' + item_prototypes: + - + name: 'PD: Region heartbeat: error, rate' + type: DEPENDENT + key: 'pd.region_heartbeat.error.rate[{#STORE_ADDRESS}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The count of heartbeats with the error status per second.' + application_prototypes: + - + name: 'TiDB Store [{#STORE_ADDRESS}]' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "err" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + - + name: 'PD: Region heartbeat: active, rate' + type: DEPENDENT + key: 'pd.region_heartbeat.ok.rate[{#STORE_ADDRESS}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The count of heartbeats with the ok status per second.' + application_prototypes: + - + name: 'TiDB Store [{#STORE_ADDRESS}]' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.status == "ok" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + - + name: 'PD: Region schedule push: total, rate' + type: DEPENDENT + key: 'pd.region_heartbeat.push.err.rate[{#STORE_ADDRESS}]' + delay: '0' + history: 7d + value_type: FLOAT + application_prototypes: + - + name: 'TiDB Store [{#STORE_ADDRESS}]' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "push" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + - + name: 'PD: Region heartbeat: total, rate' + type: DEPENDENT + key: 'pd.region_heartbeat.rate[{#STORE_ADDRESS}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The count of heartbeats reported to PD per instance per second.' + application_prototypes: + - + name: 'TiDB Store [{#STORE_ADDRESS}]' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_region_heartbeat" && @.labels.type == "report" && @.labels.address == "{#STORE_ADDRESS}")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: pd.get_metrics + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_region_heartbeat")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + var lookup = {}, + result = []; + + JSON.parse(value).forEach(function (item) { + var address = item.labels.address; + if (!(lookup[address])) { + lookup[address] = 1; + result.push({ "{#STORE_ADDRESS}": address }); + } + }) + + return JSON.stringify(result); + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Region labels discovery' + type: DEPENDENT + key: pd.region_labels.discovery + delay: '0' + description: 'Discovery region labels specific metrics.' + item_prototypes: + - + name: 'TiDB cluster: Regions label: {#TYPE}' + type: DEPENDENT + key: 'pd.region_labels[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of Regions in different label levels.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_regions_label_level" && @.labels.type == "{#TYPE}")].value.first()' + master_item: + key: pd.get_metrics + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_regions_label_level")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Region status discovery' + type: DEPENDENT + key: pd.region_status.discovery + delay: '0' + description: 'Discovery region status specific metrics.' + item_prototypes: + - + name: 'TiDB cluster: Regions status: {#TYPE}' + type: DEPENDENT + key: 'pd.region_status[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_regions_status" && @.labels.type == "{#TYPE}")].value.first()' + master_item: + key: pd.get_metrics + trigger_prototypes: + - + expression: '{min(5m)}>0' + name: 'TiDB cluster: There are unresponsive peers' + discover: NO_DISCOVER + priority: WARNING + description: 'The number of Regions with an unresponsive peer reported by the Raft leader.' + - + expression: '{min(5m)}>{$PD.MISS_REGION.MAX.WARN}' + name: 'TiDB cluster: Too many missed regions (over {$PD.MISS_REGION.MAX.WARN} in 5m)' + discover: NO_DISCOVER + priority: WARNING + description: 'The number of Region replicas is smaller than the value of max-replicas. When a TiKV machine is down and its downtime exceeds max-down-time, it usually leads to missing replicas for some Regions during a period of time. When a TiKV node is made offline, it might result in a small number of Regions with missing replicas.' + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_regions_status")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + overrides: + - + name: 'Too many missed regions trigger' + step: '1' + filter: + conditions: + - + macro: '{#TYPE}' + value: miss_peer_region_count + formulaid: A + operations: + - + operationobject: TRIGGER_PROTOTYPE + operator: LIKE + value: 'Too many missed regions' + status: ENABLED + discover: DISCOVER + - + name: 'Unresponsive peers trigger' + step: '2' + filter: + conditions: + - + macro: '{#TYPE}' + value: down_peer_region_count + formulaid: A + operations: + - + operationobject: TRIGGER_PROTOTYPE + operator: LIKE + value: 'There are unresponsive peers' + status: ENABLED + discover: DISCOVER + - + name: 'Running scheduler discovery' + type: DEPENDENT + key: pd.scheduler.discovery + delay: '0' + description: 'Discovery scheduler specific metrics.' + item_prototypes: + - + name: 'TiDB cluster: Scheduler status: {#KIND}' + type: DEPENDENT + key: 'pd.scheduler[{#KIND}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The current running schedulers.' + applications: + - + name: 'TiDB cluster' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_regions_status" && @.labels.type == "allow" && @.labels.kind == "{#KIND}")].value.first()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + master_item: + key: pd.get_metrics + master_item: + key: pd.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "pd_scheduler_status" && @.labels.type == "allow")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#KIND}": item.labels.kind, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + macros: + - + macro: '{$PD.MISS_REGION.MAX.WARN}' + value: '100' + description: 'Maximum number of missed regions' + - + macro: '{$PD.PORT}' + value: '2379' + description: 'The port of PD server metrics web endpoint' + - + macro: '{$PD.STORAGE_USAGE.MAX.WARN}' + value: '80' + description: 'Maximum percentage of cluster space used' + - + macro: '{$PD.URL}' + value: localhost + description: 'PD server URL' + valuemaps: + - + name: 'Service state' + mappings: + - + value: '0' + newvalue: Down + - + value: '1' + newvalue: Up diff --git a/templates/db/tidb_http/tidb_tidb_http/README.md b/templates/db/tidb_http/tidb_tidb_http/README.md new file mode 100644 index 00000000000..f02ed4f39b2 --- /dev/null +++ b/templates/db/tidb_http/tidb_tidb_http/README.md @@ -0,0 +1,131 @@ + +# TiDB by HTTP + +## Overview + +For Zabbix version: 5.4 and higher +The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. +Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + +Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. +See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. + + +This template was tested on: + +- TiDB cluster, version 4.0.10 + +## Setup + +> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions. + +This template works with TiDB server of TiDB cluster. +Internal service metrics are collected from TiDB /metrics endpoint and from monitoring API. +See https://docs.pingcap.com/tidb/stable/tidb-monitoring-api. +Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. +Also, see the Macros section for a list of macros used to set trigger values. + + +## Zabbix configuration + +No specific Zabbix configuration is required. + +### Macros used + +|Name|Description|Default| +|----|-----------|-------| +|{$TIDB.DDL.WAITING.MAX.WARN} |<p>Maximum number of DDL tasks that are waiting</p> |`5` | +|{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} |<p>Maximum number of GC-related operations failures</p> |`1` | +|{$TIDB.HEAP.USAGE.MAX.WARN} |<p>Maximum heap memory used</p> |`10G` | +|{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} |<p>Minimum number of keep alive operations</p> |`10` | +|{$TIDB.OPEN.FDS.MAX.WARN} |<p>Maximum percentage of used file descriptors</p> |`90` | +|{$TIDB.PORT} |<p>The port of TiDB server metrics web endpoint</p> |`10080` | +|{$TIDB.REGION_ERROR.MAX.WARN} |<p>Maximum number of region related errors</p> |`50` | +|{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} |<p>Maximum number of schema lease errors</p> |`0` | +|{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} |<p>Maximum number of load schema errors</p> |`1` | +|{$TIDB.TIME_JUMP_BACK.MAX.WARN} |<p>Maximum number of times that the operating system rewinds every second</p> |`1` | +|{$TIDB.URL} |<p>TiDB server URL</p> |`localhost` | + +## Template links + +There are no template links in this template. + +## Discovery rules + +|Name|Description|Type|Key and additional info| +|----|-----------|----|----| +|QPS metrics discovery |<p>Discovery QPS specific metrics.</p> |DEPENDENT |tidb.qps.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_query_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Statement metrics discovery |<p>Discovery statement specific metrics.</p> |DEPENDENT |tidb.statement.discover<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|KV metrics discovery |<p>Discovery KV specific metrics.</p> |DEPENDENT |tidb.kv_ops.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Lock resolves discovery |<p>Discovery lock resolves specific metrics.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|KV backoff discovery |<p>Discovery KV backoff specific metrics.</p> |DEPENDENT |tidb.tikvclient_backoff.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|GC action results discovery |<p>Discovery GC action results metrics.</p> |DEPENDENT |tidb.tikvclient_gc_action.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_gc_action_result")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Failed GC-related operations trigger<br> - {#TYPE} MATCHES_REGEX `failed`<br> - TRIGGER_PROTOTYPE LIKE `Too many failed GC-related operations` - DISCOVER</p> | + +## Items collected + +|Group|Name|Description|Type|Key and additional info| +|-----|----|-----------|----|---------------------| +|TiDB node |TiDB: Status |<p>Status of PD instance.</p> |DEPENDENT |tidb.status<p>**Preprocessing**:</p><p>- JSONPATH: `$.status`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 1`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|TiDB node |TiDB: Total "error" server query, rate |<p>The number of queries on TiDB instance per second with failure of command execution results.</p> |DEPENDENT |tidb.server_query.error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Total "ok" server query, rate |<p>The number of queries on TiDB instance per second with success of command execution results.</p> |DEPENDENT |tidb.server_query.ok.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Total server query, rate |<p>The number of queries per second on TiDB instance.</p> |DEPENDENT |tidb.server_query.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: SQL statements, rate |<p>The total number of SQL statements executed per second.</p> |DEPENDENT |tidb.statement_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Failed Query, rate |<p>The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).</p> |DEPENDENT |tidb.execute_error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_execute_error_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: KV commands, rate |<p>The number of executed KV commands per second.</p> |DEPENDENT |tidb.tikvclient_txn.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: PD TSO commands, rate |<p>The number of TSO commands that TiDB obtains from PD per second.</p> |DEPENDENT |tidb.pd_tso_cmd.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_client_cmd_handle_cmds_duration_seconds_count" && @.labels.type == "tso")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: PD TSO requests, rate |<p>The number of TSO requests that TiDB obtains from PD per second.</p> |DEPENDENT |tidb.pd_tso_request.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="pd_client_request_handle_requests_duration_seconds_count" && @.labels.type == "tso")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: TiClient region errors, rate |<p>The number of region related errors returned by TiKV per second.</p> |DEPENDENT |tidb.tikvclient_region_err.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Lock resolves, rate |<p>The number of DDL tasks that are waiting.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: DDL waiting jobs |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.ddl_waiting_jobs<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()`</p> | +|TiDB node |TiDB: Load schema total, rate |<p>The statistics of the schemas that TiDB obtains from TiKV per second.</p> |DEPENDENT |tidb.domain_load_schema.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_domain_load_schema_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Load schema failed, rate |<p>The total number of failures to reload the latest schema information in TiDB per second.</p> |DEPENDENT |tidb.domain_load_schema.failed.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_domain_load_schema_total && @.labels.type == "failed"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Schema lease "outdate" errors , rate |<p>The number of schema lease errors per second. </p><p>"outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert.</p> |DEPENDENT |tidb.session_schema_lease_error.outdate.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "outdate"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Schema lease "change" errors, rate |<p>The number of schema lease errors per second. </p><p>"change" means that the schema has changed</p> |DEPENDENT |tidb.session_schema_lease_error.change.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "change"")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: KV backoff, rate |<p>The number of errors returned by TiKV.</p> |DEPENDENT |tidb.tikvclient_backoff.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Keep alive, rate |<p>The number of times that the metrics are refreshed on TiDB instance per minute.</p> |DEPENDENT |tidb.monitor_keep_alive.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_monitor_keep_alive_total")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- SIMPLE_CHANGE | +|TiDB node |TiDB: Server connections |<p>The connection number of current TiDB instance.</p> |DEPENDENT |tidb.tidb_server_connections<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_connections")].value.first()`</p> | +|TiDB node |TiDB: Heap memory usage |<p>Number of heap bytes that are in use.</p> |DEPENDENT |tidb.heap_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()`</p> | +|TiDB node |TiDB: RSS memory usage |<p>Resident memory size in bytes.</p> |DEPENDENT |tidb.rss_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_resident_memory_bytes")].value.first()`</p> | +|TiDB node |TiDB: Goroutine count |<p>The number of Goroutines on TiDB instance.</p> |DEPENDENT |tidb.goroutines<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="go_goroutines")].value.first()`</p> | +|TiDB node |TiDB: Open file descriptors |<p>Number of open file descriptors.</p> |DEPENDENT |tidb.process_open_fds<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_open_fds")].value.first()`</p> | +|TiDB node |TiDB: Open file descriptors, max |<p>Maximum number of open file descriptors.</p> |DEPENDENT |tidb.process_max_fds<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_max_fds")].value.first()`</p> | +|TiDB node |TiDB: CPU |<p>Total user and system CPU usage ratio.</p> |DEPENDENT |tidb.cpu.util<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_cpu_seconds_total")].value.first()`</p><p>- CHANGE_PER_SECOND<p>- MULTIPLIER: `100`</p> | +|TiDB node |TiDB: Uptime |<p>The runtime of each TiDB instance.</p> |DEPENDENT |tidb.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_start_time_seconds")].value.first()`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> | +|TiDB node |TiDB: Version |<p>Version of the TiDB instance.</p> |DEPENDENT |tidb.version<p>**Preprocessing**:</p><p>- JSONPATH: `$.version`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `3h`</p> | +|TiDB node |TiDB: Time jump back, rate |<p>The number of times that the operating system rewinds every second.</p> |DEPENDENT |tidb.monitor_time_jump_back.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_monitor_time_jump_back_total")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Server critical error, rate |<p>The number of critical errors occurred in TiDB per second.</p> |DEPENDENT |tidb.tidb_server_critical_error_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_critical_error_total")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Server panic, rate |<p>The number of panics occurred in TiDB per second.</p> |DEPENDENT |tidb.tidb_server_panic_total.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_server_panic_total")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Server query "OK": {#TYPE}, rate |<p>The number of queries on TiDB instance per second with success of command execution results.</p> |DEPENDENT |tidb.server_query.ok.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Server query "Error": {#TYPE}, rate |<p>The number of queries on TiDB instance per second with failure of command execution results.</p> |DEPENDENT |tidb.server_query.error.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: SQL statements: {#TYPE}, rate |<p>The number of SQL statements executed per second.</p> |DEPENDENT |tidb.statement.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_executor_statement_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: KV Commands: {#TYPE}, rate |<p>The number of executed KV commands per second.</p> |DEPENDENT |tidb.tikvclient_txn.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: Lock resolves: {#TYPE}, rate |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: KV backoff: {#TYPE}, rate |<p>The number of TiDB operations that resolve locks per second. When TiDB's read or write request encounters a lock, it tries to resolve the lock.</p> |DEPENDENT |tidb.tikvclient_backoff.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_backoff_total" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiDB node |TiDB: GC action result: {#TYPE}, rate |<p>The number of results of GC-related operations per second.</p> |DEPENDENT |tidb.tikvclient_gc_action.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="tidb_tikvclient_gc_action_result" && @.labels.type == "{#TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|Zabbix_raw_items |TiDB: Get instance metrics |<p>Get TiDB instance metrics.</p> |HTTP_AGENT |tidb.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON | +|Zabbix_raw_items |TiDB: Get instance status |<p>Get TiDB instance status info.</p> |HTTP_AGENT |tidb.get_status<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED | + +## Triggers + +|Name|Description|Expression|Severity|Dependencies and additional info| +|----|-----------|----|----|----| +|TiDB: Instance is not responding |<p>-</p> |`{TEMPLATE_NAME:tidb.status.last()}=0` |AVERAGE | | +|TiDB: Too many region related errors (over {$TIDB.REGION_ERROR.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.tikvclient_region_err.rate.min(5m)}>{$TIDB.REGION_ERROR.MAX.WARN}` |AVERAGE | | +|TiDB: Too many DDL waiting jobs (over {$TIDB.DDL.WAITING.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.ddl_waiting_jobs.min(5m)}>{$TIDB.DDL.WAITING.MAX.WARN}` |WARNING | | +|TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.domain_load_schema.failed.rate.min(5m)}>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}` |AVERAGE | | +|TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} for 5m) |<p>The latest schema information is not reloaded in TiDB within one lease.</p> |`{TEMPLATE_NAME:tidb.session_schema_lease_error.outdate.rate.min(5m)}>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}` |AVERAGE | | +|TiDB: Too few keep alive operations (less {$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} for 5m) |<p>Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.</p> |`{TEMPLATE_NAME:tidb.monitor_keep_alive.rate.max(5m)}<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}` |AVERAGE | | +|TiDB: Heap memory usage is too high (over {$TIDB.HEAP.USAGE.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.heap_bytes.min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}` |WARNING | | +|TiDB: Current number of open files is too high (over {$TIDB.OPEN.FDS.MAX.WARN}% for 5m) |<p>"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue."</p> |`{TEMPLATE_NAME:tidb.process_open_fds.min(5m)}/{TiDB by HTTP:tidb.process_max_fds.last()}*100>{$TIDB.OPEN.FDS.MAX.WARN}` |WARNING | | +|TiDB: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:tidb.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> | +|TiDB: Version has changed (new version: {ITEM.VALUE}) |<p>TiDB version has changed. Ack to close.</p> |`{TEMPLATE_NAME:tidb.version.diff()}=1 and {TEMPLATE_NAME:tidb.version.strlen()}>0` |INFO |<p>Manual close: YES</p> | +|TiDB: Too many time jump backs (over {$TIDB.TIME_JUMP_BACK.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.monitor_time_jump_back.rate.min(5m)}>{$TIDB.TIME_JUMP_BACK.MAX.WARN}` |WARNING | | +|TiDB: There are panicked TiDB threads |<p>When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.</p> |`{TEMPLATE_NAME:tidb.tidb_server_panic_total.rate.last()}>0` |AVERAGE | | +|TiDB: Too many failed GC-related operations (over {$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} in 5m) |<p>-</p> |`{TEMPLATE_NAME:tidb.tikvclient_gc_action.rate[{#TYPE}].min(5m)}>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}` |WARNING | | + +## Feedback + +Please report any issues with the template at https://support.zabbix.com + +You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback). + diff --git a/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml b/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml new file mode 100644 index 00000000000..32fe1a7cc51 --- /dev/null +++ b/templates/db/tidb_http/tidb_tidb_http/template_db_tidb_tidb_http.yaml @@ -0,0 +1,1266 @@ +zabbix_export: + version: '5.4' + date: '2021-04-08T09:02:36Z' + groups: + - + name: Templates/Databases + templates: + - + template: 'TiDB by HTTP' + name: 'TiDB by HTTP' + description: | + The template to monitor TiDB server of TiDB cluster by Zabbix that works without any external scripts. + Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + Don't forget to change the macros {$TIDB.URL}, {$TIDB.PORT}. + + Template `TiDB by HTTP` — collects metrics by HTTP agent from PD /metrics endpoint and from monitoring API. + + You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback + + Template tooling version used: 0.38 + groups: + - + name: Templates/Databases + applications: + - + name: 'TiDB node' + - + name: 'Zabbix raw items' + items: + - + name: 'TiDB: CPU' + type: DEPENDENT + key: tidb.cpu.util + delay: '0' + history: 7d + value_type: FLOAT + units: '%' + description: 'Total user and system CPU usage ratio.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_cpu_seconds_total")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + - + type: MULTIPLIER + parameters: + - '100' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: DDL waiting jobs' + type: DEPENDENT + key: tidb.ddl_waiting_jobs + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_ddl_waiting_jobs")].value.sum()' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.DDL.WAITING.MAX.WARN}' + name: 'TiDB: Too many DDL waiting jobs (over {$TIDB.DDL.WAITING.MAX.WARN} for 5m)' + priority: WARNING + - + name: 'TiDB: Load schema failed, rate' + type: DEPENDENT + key: tidb.domain_load_schema.failed.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The total number of failures to reload the latest schema information in TiDB per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_domain_load_schema_total && @.labels.type == "failed"")].value.first()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}' + name: 'TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN} for 5m)' + priority: AVERAGE + - + name: 'TiDB: Load schema total, rate' + type: DEPENDENT + key: tidb.domain_load_schema.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The statistics of the schemas that TiDB obtains from TiKV per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_domain_load_schema_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Failed Query, rate' + type: DEPENDENT + key: tidb.execute_error.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of error occurred when executing SQL statements per second (such as syntax errors and primary key conflicts).' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_server_execute_error_total")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Get instance metrics' + type: HTTP_AGENT + key: tidb.get_metrics + history: '0' + trends: '0' + value_type: TEXT + description: 'Get TiDB instance metrics.' + applications: + - + name: 'Zabbix raw items' + preprocessing: + - + type: CHECK_NOT_SUPPORTED + parameters: + - '' + - + type: PROMETHEUS_TO_JSON + parameters: + - '' + url: '{$TIDB.URL}:{$TIDB.PORT}/metrics' + - + name: 'TiDB: Get instance status' + type: HTTP_AGENT + key: tidb.get_status + history: '0' + trends: '0' + value_type: TEXT + description: 'Get TiDB instance status info.' + applications: + - + name: 'Zabbix raw items' + preprocessing: + - + type: CHECK_NOT_SUPPORTED + parameters: + - '' + error_handler: CUSTOM_VALUE + error_handler_params: '{"status": "0"}' + url: '{$TIDB.URL}:{$TIDB.PORT}/status' + - + name: 'TiDB: Goroutine count' + type: DEPENDENT + key: tidb.goroutines + delay: '0' + history: 7d + description: 'The number of Goroutines on TiDB instance.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="go_goroutines")].value.first()' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Heap memory usage' + type: DEPENDENT + key: tidb.heap_bytes + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'Number of heap bytes that are in use.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="go_memstats_heap_inuse_bytes")].value.first()' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}' + name: 'TiDB: Heap memory usage is too high (over {$TIDB.HEAP.USAGE.MAX.WARN} for 5m)' + priority: WARNING + - + name: 'TiDB: Keep alive, rate' + type: DEPENDENT + key: tidb.monitor_keep_alive.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of times that the metrics are refreshed on TiDB instance per minute.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_monitor_keep_alive_total")].value.first()' + error_handler: DISCARD_VALUE + - + type: SIMPLE_CHANGE + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{max(5m)}<{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}' + name: 'TiDB: Too few keep alive operations (less {$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN} for 5m)' + priority: AVERAGE + description: 'Indicates whether the TiDB process still exists. If the number of times for tidb_monitor_keep_alive_total increases less than 10 per minute, the TiDB process might already exit and an alert is triggered.' + - + name: 'TiDB: Time jump back, rate' + type: DEPENDENT + key: tidb.monitor_time_jump_back.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of times that the operating system rewinds every second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_monitor_time_jump_back_total")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.TIME_JUMP_BACK.MAX.WARN}' + name: 'TiDB: Too many time jump backs (over {$TIDB.TIME_JUMP_BACK.MAX.WARN} for 5m)' + priority: WARNING + - + name: 'TiDB: PD TSO commands, rate' + type: DEPENDENT + key: tidb.pd_tso_cmd.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of TSO commands that TiDB obtains from PD per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="pd_client_cmd_handle_cmds_duration_seconds_count" && @.labels.type == "tso")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: PD TSO requests, rate' + type: DEPENDENT + key: tidb.pd_tso_request.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of TSO requests that TiDB obtains from PD per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="pd_client_request_handle_requests_duration_seconds_count" && @.labels.type == "tso")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Open file descriptors, max' + type: DEPENDENT + key: tidb.process_max_fds + delay: '0' + history: 7d + value_type: FLOAT + description: 'Maximum number of open file descriptors.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_max_fds")].value.first()' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Open file descriptors' + type: DEPENDENT + key: tidb.process_open_fds + delay: '0' + history: 7d + value_type: FLOAT + description: 'Number of open file descriptors.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_open_fds")].value.first()' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: RSS memory usage' + type: DEPENDENT + key: tidb.rss_bytes + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'Resident memory size in bytes.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_resident_memory_bytes")].value.first()' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Total "error" server query, rate' + type: DEPENDENT + key: tidb.server_query.error.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Qps + description: 'The number of queries on TiDB instance per second with failure of command execution results.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Total "ok" server query, rate' + type: DEPENDENT + key: tidb.server_query.ok.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Qps + description: 'The number of queries on TiDB instance per second with success of command execution results.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Total server query, rate' + type: DEPENDENT + key: tidb.server_query.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Qps + description: 'The number of queries per second on TiDB instance.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tidb_server_query_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Schema lease "change" errors, rate' + type: DEPENDENT + key: tidb.session_schema_lease_error.change.rate + delay: '0' + history: 7d + value_type: FLOAT + description: | + The number of schema lease errors per second. + "change" means that the schema has changed + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "change"")].value.first()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Schema lease "outdate" errors , rate' + type: DEPENDENT + key: tidb.session_schema_lease_error.outdate.rate + delay: '0' + history: 7d + value_type: FLOAT + description: | + The number of schema lease errors per second. + "outdate" errors means that the schema cannot be updated, which is a more serious error and triggers an alert. + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_session_schema_lease_error_total && @.labels.type == "outdate"")].value.first()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}' + name: 'TiDB: Too many schema lease errors (over {$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN} for 5m)' + priority: AVERAGE + description: 'The latest schema information is not reloaded in TiDB within one lease.' + - + name: 'TiDB: SQL statements, rate' + type: DEPENDENT + key: tidb.statement_total.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The total number of SQL statements executed per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_executor_statement_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Status' + type: DEPENDENT + key: tidb.status + delay: '0' + history: 7d + trends: '0' + value_type: CHAR + description: 'Status of PD instance.' + applications: + - + name: 'TiDB node' + valuemap: + name: 'Service state' + preprocessing: + - + type: JSONPATH + parameters: + - $.status + error_handler: CUSTOM_VALUE + error_handler_params: '1' + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + master_item: + key: tidb.get_status + triggers: + - + expression: '{last()}=0' + name: 'TiDB: Instance is not responding' + priority: AVERAGE + - + name: 'TiDB: Server connections' + type: DEPENDENT + key: tidb.tidb_server_connections + delay: '0' + history: 7d + description: 'The connection number of current TiDB instance.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_server_connections")].value.first()' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Server critical error, rate' + type: DEPENDENT + key: tidb.tidb_server_critical_error_total.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of critical errors occurred in TiDB per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_server_critical_error_total")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Server panic, rate' + type: DEPENDENT + key: tidb.tidb_server_panic_total.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of panics occurred in TiDB per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_server_panic_total")].value.first()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{last()}>0' + name: 'TiDB: There are panicked TiDB threads' + priority: AVERAGE + description: 'When a panic occurs, an alert is triggered. The thread is often recovered, otherwise, TiDB will frequently restart.' + - + name: 'TiDB: KV backoff, rate' + type: DEPENDENT + key: tidb.tikvclient_backoff.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of errors returned by TiKV.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_backoff_total")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Lock resolves, rate' + type: DEPENDENT + key: tidb.tikvclient_lock_resolver_action.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of DDL tasks that are waiting.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: TiClient region errors, rate' + type: DEPENDENT + key: tidb.tikvclient_region_err.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of region related errors returned by TiKV per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_region_err_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIDB.REGION_ERROR.MAX.WARN}' + name: 'TiDB: Too many region related errors (over {$TIDB.REGION_ERROR.MAX.WARN} for 5m)' + priority: AVERAGE + - + name: 'TiDB: KV commands, rate' + type: DEPENDENT + key: tidb.tikvclient_txn.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of executed KV commands per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Uptime' + type: DEPENDENT + key: tidb.uptime + delay: '0' + history: 7d + value_type: FLOAT + units: uptime + description: 'The runtime of each TiDB instance.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_start_time_seconds")].value.first()' + - + type: JAVASCRIPT + parameters: + - | + //use boottime to calculate uptime + return (Math.floor(Date.now()/1000)-Number(value)); + master_item: + key: tidb.get_metrics + triggers: + - + expression: '{last()}<10m' + name: 'TiDB: has been restarted (uptime < 10m)' + priority: INFO + description: 'Uptime is less than 10 minutes' + manual_close: 'YES' + - + name: 'TiDB: Version' + type: DEPENDENT + key: tidb.version + delay: '0' + history: 7d + trends: '0' + value_type: CHAR + description: 'Version of the TiDB instance.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - $.version + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 3h + master_item: + key: tidb.get_status + triggers: + - + expression: '{diff()}=1 and {strlen()}>0' + name: 'TiDB: Version has changed (new version: {ITEM.VALUE})' + priority: INFO + description: 'TiDB version has changed. Ack to close.' + manual_close: 'YES' + discovery_rules: + - + name: 'KV metrics discovery' + type: DEPENDENT + key: tidb.kv_ops.discovery + delay: '0' + description: 'Discovery KV specific metrics.' + item_prototypes: + - + name: 'TiDB: KV Commands: {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.tikvclient_txn.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of executed KV commands per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_txn_cmd_duration_seconds_count")]' + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'QPS metrics discovery' + type: DEPENDENT + key: tidb.qps.discovery + delay: '0' + description: 'Discovery QPS specific metrics.' + item_prototypes: + - + name: 'TiDB: Server query "Error": {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.server_query.error.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Qps + description: 'The number of queries on TiDB instance per second with failure of command execution results.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "Error" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + - + name: 'TiDB: Server query "OK": {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.server_query.ok.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Qps + description: 'The number of queries on TiDB instance per second with success of command execution results.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tidb_server_query_total" && @.labels.result == "OK" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_server_query_total")]' + - + type: JAVASCRIPT + parameters: + - | + var lookup = {}, + result = []; + + JSON.parse(value).forEach(function (item) { + var type = item.labels.type; + if (!(lookup[type])) { + lookup[type] = 1; + result.push({ "{#TYPE}": type }); + } + }) + + return JSON.stringify(result); + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Statement metrics discovery' + type: DEPENDENT + key: tidb.statement.discover + delay: '0' + description: 'Discovery statement specific metrics.' + item_prototypes: + - + name: 'TiDB: SQL statements: {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.statement.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'The number of SQL statements executed per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_executor_statement_total" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_executor_statement_total")]' + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'KV backoff discovery' + type: DEPENDENT + key: tidb.tikvclient_backoff.discovery + delay: '0' + description: 'Discovery KV backoff specific metrics.' + item_prototypes: + - + name: 'TiDB: KV backoff: {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.tikvclient_backoff.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_backoff_total" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_backoff_total")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'GC action results discovery' + type: DEPENDENT + key: tidb.tikvclient_gc_action.discovery + delay: '0' + description: 'Discovery GC action results metrics.' + item_prototypes: + - + name: 'TiDB: GC action result: {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.tikvclient_gc_action.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of results of GC-related operations per second.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_gc_action_result" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + trigger_prototypes: + - + expression: '{min(5m)}>{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}' + name: 'TiDB: Too many failed GC-related operations (over {$TIDB.GC_ACTIONS.ERRORS.MAX.WARN} in 5m)' + discover: NO_DISCOVER + priority: WARNING + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_gc_action_result")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + overrides: + - + name: 'Failed GC-related operations trigger' + step: '1' + filter: + conditions: + - + macro: '{#TYPE}' + value: failed + formulaid: A + operations: + - + operationobject: TRIGGER_PROTOTYPE + operator: LIKE + value: 'Too many failed GC-related operations' + status: ENABLED + discover: DISCOVER + - + name: 'Lock resolves discovery' + type: DEPENDENT + key: tidb.tikvclient_lock_resolver_action.discovery + delay: '0' + description: 'Discovery lock resolves specific metrics.' + item_prototypes: + - + name: 'TiDB: Lock resolves: {#TYPE}, rate' + type: DEPENDENT + key: 'tidb.tikvclient_lock_resolver_action.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The number of TiDB operations that resolve locks per second. When TiDB''s read or write request encounters a lock, it tries to resolve the lock.' + applications: + - + name: 'TiDB node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total" && @.labels.type == "{#TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tidb.get_metrics + master_item: + key: tidb.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="tidb_tikvclient_lock_resolver_actions_total")]' + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + macros: + - + macro: '{$TIDB.DDL.WAITING.MAX.WARN}' + value: '5' + description: 'Maximum number of DDL tasks that are waiting' + - + macro: '{$TIDB.GC_ACTIONS.ERRORS.MAX.WARN}' + value: '1' + description: 'Maximum number of GC-related operations failures' + - + macro: '{$TIDB.HEAP.USAGE.MAX.WARN}' + value: 10G + description: 'Maximum heap memory used' + - + macro: '{$TIDB.MONITOR_KEEP_ALIVE.MAX.WARN}' + value: '10' + description: 'Minimum number of keep alive operations' + - + macro: '{$TIDB.OPEN.FDS.MAX.WARN}' + value: '90' + description: 'Maximum percentage of used file descriptors' + - + macro: '{$TIDB.PORT}' + value: '10080' + description: 'The port of TiDB server metrics web endpoint' + - + macro: '{$TIDB.REGION_ERROR.MAX.WARN}' + value: '50' + description: 'Maximum number of region related errors' + - + macro: '{$TIDB.SCHEMA_LEASE_ERRORS.MAX.WARN}' + value: '0' + description: 'Maximum number of schema lease errors' + - + macro: '{$TIDB.SCHEMA_LOAD_ERRORS.MAX.WARN}' + value: '1' + description: 'Maximum number of load schema errors' + - + macro: '{$TIDB.TIME_JUMP_BACK.MAX.WARN}' + value: '1' + description: 'Maximum number of times that the operating system rewinds every second' + - + macro: '{$TIDB.URL}' + value: localhost + description: 'TiDB server URL' + valuemaps: + - + name: 'Service state' + mappings: + - + value: '0' + newvalue: Down + - + value: '1' + newvalue: Up + triggers: + - + expression: '{TiDB by HTTP:tidb.process_open_fds.min(5m)}/{TiDB by HTTP:tidb.process_max_fds.last()}*100>{$TIDB.OPEN.FDS.MAX.WARN}' + name: 'TiDB: Current number of open files is too high (over {$TIDB.OPEN.FDS.MAX.WARN}% for 5m)' + priority: WARNING + description: '"Heavy file descriptor usage (i.e., near the process’s file descriptor limit) indicates a potential file descriptor exhaustion issue."' + graphs: + - + name: 'TiDB: File descriptors' + graph_items: + - + drawtype: GRADIENT_LINE + color: 1A7C11 + item: + host: 'TiDB by HTTP' + key: tidb.process_open_fds + - + sortorder: '1' + drawtype: BOLD_LINE + color: 2774A4 + item: + host: 'TiDB by HTTP' + key: tidb.process_max_fds + - + name: 'TiDB: Memory usage' + graph_items: + - + color: 1A7C11 + item: + host: 'TiDB by HTTP' + key: tidb.heap_bytes + - + sortorder: '1' + color: 2774A4 + item: + host: 'TiDB by HTTP' + key: tidb.rss_bytes + - + name: 'TiDB: Server query rate' + graph_items: + - + color: 1A7C11 + item: + host: 'TiDB by HTTP' + key: tidb.server_query.rate + - + sortorder: '1' + color: 2774A4 + item: + host: 'TiDB by HTTP' + key: tidb.server_query.ok.rate + - + sortorder: '2' + color: F63100 + item: + host: 'TiDB by HTTP' + key: tidb.server_query.error.rate diff --git a/templates/db/tidb_http/tidb_tikv_http/README.md b/templates/db/tidb_http/tidb_tikv_http/README.md new file mode 100644 index 00000000000..165b1a26f84 --- /dev/null +++ b/templates/db/tidb_http/tidb_tikv_http/README.md @@ -0,0 +1,112 @@ + +# TiDB TiKV by HTTP + +## Overview + +For Zabbix version: 5.4 and higher +The template to monitor TiKV server of TiDB cluster by Zabbix that works without any external scripts. +Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + +Template `TiDB TiKV by HTTP` — collects metrics by HTTP agent from TiKV /metrics endpoint. + + +This template was tested on: + +- TiDB cluster, version 4.0.10 + +## Setup + +> See [Zabbix template operation](https://www.zabbix.com/documentation/5.4/manual/config/templates_out_of_the_box/http) for basic instructions. + +This template works with TiKV server of TiDB cluster. +Internal service metrics are collected from TiKV /metrics endpoint. +Don't forget to change the macros {$TIKV.URL}, {$TIKV.PORT}. +Also, see the Macros section for a list of macros used to set trigger values. + + +## Zabbix configuration + +No specific Zabbix configuration is required. + +### Macros used + +|Name|Description|Default| +|----|-----------|-------| +|{$TIKV.COPOCESSOR.ERRORS.MAX.WARN} |<p>Maximum number of coprocessor request errors</p> |`1` | +|{$TIKV.PENDING_COMMANDS.MAX.WARN} |<p>Maximum number of pending commands</p> |`1` | +|{$TIKV.PENDING_TASKS.MAX.WARN} |<p>Maximum number of tasks currently running by the worker or pending</p> |`1` | +|{$TIKV.PORT} |<p>The port of TiKV server metrics web endpoint</p> |`20180` | +|{$TIKV.STORE.ERRORS.MAX.WARN} |<p>Maximum number of failure messages</p> |`1` | +|{$TIKV.URL} |<p>TiKV server URL</p> |`localhost` | + +## Template links + +There are no template links in this template. + +## Discovery rules + +|Name|Description|Type|Key and additional info| +|----|-----------|----|----| +|QPS metrics discovery |<p>Discovery QPS metrics.</p> |DEPENDENT |tikv.qps.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Coprocessor metrics discovery |<p>Discovery coprocessor metrics.</p> |DEPENDENT |tikv.coprocessor.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Scheduler metrics discovery |<p>Discovery scheduler metrics.</p> |DEPENDENT |tikv.scheduler.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p> | +|Server errors discovery |<p>Discovery server errors metrics.</p> |DEPENDENT |tikv.server_report_failure.discovery<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total")]`</p><p>- JAVASCRIPT: `Text is too long. Please see the template.`</p><p>- DISCARD_UNCHANGED_HEARTBEAT: `1h`</p><p>**Overrides:**</p><p>Too many unreachable messages trigger<br> - {#TYPE} MATCHES_REGEX `unreachable`<br> - TRIGGER_PROTOTYPE LIKE `Too many failure messages` - DISCOVER</p> | + +## Items collected + +|Group|Name|Description|Type|Key and additional info| +|-----|----|-----------|----|---------------------| +|TiKV node |TiKV: Store size |<p>The storage size of TiKV instance.</p> |DEPENDENT |tikv.engine_size<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_size_bytes")].value.sum()`</p> | +|TiKV node |TiKV: Available size |<p>The available capacity of TiKV instance.</p> |DEPENDENT |tikv.store_size.available<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "available")].value.first()`</p> | +|TiKV node |TiKV: Capacity size |<p>The capacity size of TiKV instance.</p> |DEPENDENT |tikv.store_size.capacity<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "capacity")].value.first()`</p> | +|TiKV node |TiKV: Bytes read |<p>The total bytes of read in TiKV instance.</p> |DEPENDENT |tikv.engine_flow_bytes.read<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type =~ "bytes_read|iter_bytes_read")].value.sum()`</p> | +|TiKV node |TiKV: Bytes write |<p>The total bytes of write in TiKV instance.</p> |DEPENDENT |tikv.engine_flow_bytes.write<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type == "wal_file_bytes")].value.first()`</p> | +|TiKV node |TiKV: Storage: commands total, rate |<p>Total number of commands received per second.</p> |DEPENDENT |tikv.storage_command.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_storage_command_total")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: CPU util |<p>The CPU usage ratio on TiKV instance.</p> |DEPENDENT |tikv.cpu.util<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_thread_cpu_seconds_total")].value.sum()`</p><p>- CHANGE_PER_SECOND<p>- MULTIPLIER: `100`</p> | +|TiKV node |TiKV: RSS memory usage |<p>Resident memory size in bytes.</p> |DEPENDENT |tikv.rss_bytes<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "process_resident_memory_bytes")].value.first()`</p> | +|TiKV node |TiKV: Regions, count |<p>The number of regions collected in TiKV instance.</p> |DEPENDENT |tikv.region_count<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "region" )].value.first()`</p> | +|TiKV node |TiKV: Regions, leader |<p>The number of leaders in TiKV instance.</p> |DEPENDENT |tikv.region_leader<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "leader" )].value.first()`</p> | +|TiKV node |TiKV: Total query, rate |<p>The total QPS in TiKV instance.</p> |DEPENDENT |tikv.grpc_msg.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Total query errors, rate |<p>The total number of gRPC message handling failure per second.</p> |DEPENDENT |tikv.grpc_msg_fail.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_fail_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: Errors, rate |<p>Total number of push down request error per second.</p> |DEPENDENT |tikv.coprocessor_request_error.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_error")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: Requests, rate |<p>Total number of coprocessor requests per second.</p> |DEPENDENT |tikv.coprocessor_request.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: Scan keys, rate |<p>Total number of scan keys observed per request per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_scan_keys")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: RocksDB ops, rate |<p>Total number of RocksDB internal operations from PerfContext per second.</p> |DEPENDENT |tikv.coprocessor_rocksdb_perf.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_rocksdb_perf")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: Response size, rate |<p>The total size of coprocessor response per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_response_bytes")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: Pending commands |<p>The total number of pending commands. The scheduler receives commands from clients, executes them against the MVCC layer storage engine.</p> |DEPENDENT |tikv.scheduler_contex<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_contex_total")].value.first()`</p> | +|TiKV node |TiKV: Scheduler: Busy, rate |<p>The total count of too busy schedulers per second.</p> |DEPENDENT |tikv.scheduler_too_busy.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_too_busy_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: Commands total, rate |<p>Total number of commands per second.</p> |DEPENDENT |tikv.scheduler_commands.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: Low priority commands total, rate |<p>Total count of low priority commands per second.</p> |DEPENDENT |tikv.commands_pri.low.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "low")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: Normal priority commands total, rate |<p>Total count of normal priority commands per second.</p> |DEPENDENT |tikv.commands_pri.normal.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "normal")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: High priority commands total, rate |<p>Total count of high priority commands per second.</p> |DEPENDENT |tikv.commands_pri.high.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "high")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Snapshot: Pending tasks |<p>The number of tasks currently running by the worker or pending.</p> |DEPENDENT |tikv.scheduler_contex<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_worker_pending_task_total")].value.first()`</p> | +|TiKV node |TiKV: Snapshot: Sending |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.sending<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "sending")].value.first()`</p> | +|TiKV node |TiKV: Snapshot: Receiving |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.receiving<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "receiving")].value.first()`</p> | +|TiKV node |TiKV: Snapshot: Applying |<p>The total amount of raftstore snapshot traffic.</p> |DEPENDENT |tikv.snapshot.applying<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "applying")].value.first()`</p> | +|TiKV node |TiKV: Uptime |<p>The runtime of each TiKV instance.</p> |DEPENDENT |tikv.uptime<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name=="process_start_time_seconds")].value.first()`</p><p>- JAVASCRIPT: `//use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value)); `</p> | +|TiKV node |TiKV: Server: failure messages total, rate |<p>Total number of reporting failure messages per second.</p> |DEPENDENT |tikv.messages.failure.rate<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total")].value.sum()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Query: {#TYPE}, rate |<p>The QPS per command in TiKV instance.</p> |DEPENDENT |tikv.grpc_msg.rate[{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_grpc_msg_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> `</p> | +|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} errors, rate |<p>Total number of push down request error per second.</p> |DEPENDENT |tikv.coprocessor_request_error.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_error" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>⛔️ON_FAIL: `DISCARD_VALUE -> `</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} requests, rate |<p>Total number of coprocessor requests per second.</p> |DEPENDENT |tikv.coprocessor_request.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_request_duration_seconds_count" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} scan keys, rate |<p>Total number of scan keys observed per request per second.</p> |DEPENDENT |tikv.coprocessor_scan_keys.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_scan_keys_count" && @.labels.req == "{#REQ_TYPE}")].value.first()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Coprocessor: {#REQ_TYPE} RocksDB ops, rate |<p>Total number of RocksDB internal operations from PerfContext per second.</p> |DEPENDENT |tikv.coprocessor_rocksdb_perf.rate[{#REQ_TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_coprocessor_rocksdb_perf" && @.labels.req == "{#REQ_TYPE}")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Scheduler: commands {#STAGE}, rate |<p>Total number of commands on each stage per second.</p> |DEPENDENT |tikv.scheduler_stage.rate[{#STAGE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_scheduler_stage_total" && @.labels.stage == "{#STAGE}")].value.sum()`</p><p>⛔️ON_FAIL: `CUSTOM_VALUE -> 0`</p><p>- CHANGE_PER_SECOND | +|TiKV node |TiKV: Store_id {#STORE_ID}: failure messages "{#TYPE}", rate |<p>Total number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store id.</p> |DEPENDENT |tikv.messages.failure.rate[{#STORE_ID},{#TYPE}]<p>**Preprocessing**:</p><p>- JSONPATH: `$[?(@.name == "tikv_server_report_failure_msg_total" && @.labels.store_id == "{#STORE_ID}" && @.labels.type == "{#TYPE}")].value.sum()`</p><p>- CHANGE_PER_SECOND | +|Zabbix_raw_items |TiKV: Get instance metrics |<p>Get TiKV instance metrics.</p> |HTTP_AGENT |tikv.get_metrics<p>**Preprocessing**:</p><p>- CHECK_NOT_SUPPORTED<p>- PROMETHEUS_TO_JSON | + +## Triggers + +|Name|Description|Expression|Severity|Dependencies and additional info| +|----|-----------|----|----|----| +|TiKV: Too many coprocessor request error (over {$TIKV.COPOCESSOR.ERRORS.MAX.WARN} in 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.coprocessor_request_error.rate.min(5m)}>{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}` |WARNING | | +|TiKV: Too many pending commands (over {$TIKV.PENDING_COMMANDS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.scheduler_contex.min(5m)}>{$TIKV.PENDING_COMMANDS.MAX.WARN}` |AVERAGE | | +|TiKV: Too many pending commands (over {$TIKV.PENDING_TASKS.MAX.WARN} for 5m) |<p>-</p> |`{TEMPLATE_NAME:tikv.scheduler_contex.min(5m)}>{$TIKV.PENDING_TASKS.MAX.WARN}` |AVERAGE | | +|TiKV: has been restarted (uptime < 10m) |<p>Uptime is less than 10 minutes</p> |`{TEMPLATE_NAME:tikv.uptime.last()}<10m` |INFO |<p>Manual close: YES</p> | +|TiKV: Store_id {#STORE_ID}: Too many failure messages "{#TYPE}" (over {$TIKV.STORE.ERRORS.MAX.WARN} in 5m) |<p>Indicates that the remote TiKV cannot be connected.</p> |`{TEMPLATE_NAME:tikv.messages.failure.rate[{#STORE_ID},{#TYPE}].min(5m)}>{$TIKV.STORE.ERRORS.MAX.WARN}` |WARNING | | + +## Feedback + +Please report any issues with the template at https://support.zabbix.com + +You can also provide a feedback, discuss the template or ask for help with it at [ZABBIX forums](https://www.zabbix.com/forum/zabbix-suggestions-and-feedback). + diff --git a/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml b/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml new file mode 100644 index 00000000000..74ae7c37684 --- /dev/null +++ b/templates/db/tidb_http/tidb_tikv_http/template_db_tidb_tikv_http.yaml @@ -0,0 +1,1005 @@ +zabbix_export: + version: '5.4' + date: '2021-04-08T09:02:42Z' + groups: + - + name: Templates/Databases + templates: + - + template: 'TiDB TiKV by HTTP' + name: 'TiDB TiKV by HTTP' + description: | + The template to monitor TiKV server of TiDB cluster by Zabbix that works without any external scripts. + Most of the metrics are collected in one go, thanks to Zabbix bulk data collection. + Don't forget to change the macros {$TIKV.URL}, {$TIKV.PORT}. + + Template `TiDB TiKV by HTTP` — collects metrics by HTTP agent from TiKV /metrics endpoint. + + You can discuss this template or leave feedback on our forum https://www.zabbix.com/forum/zabbix-suggestions-and-feedback + + Template tooling version used: 0.38 + groups: + - + name: Templates/Databases + applications: + - + name: 'TiKV node' + - + name: 'Zabbix raw items' + items: + - + name: 'TiKV: Scheduler: High priority commands total, rate' + type: DEPENDENT + key: tikv.commands_pri.high.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total count of high priority commands per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "high")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Scheduler: Low priority commands total, rate' + type: DEPENDENT + key: tikv.commands_pri.low.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total count of low priority commands per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "low")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Scheduler: Normal priority commands total, rate' + type: DEPENDENT + key: tikv.commands_pri.normal.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total count of normal priority commands per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_commands_pri_total" && @.labels.priority == "normal")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: Requests, rate' + type: DEPENDENT + key: tikv.coprocessor_request.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of coprocessor requests per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: Errors, rate' + type: DEPENDENT + key: tikv.coprocessor_request_error.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of push down request error per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_request_error")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}' + name: 'TiKV: Too many coprocessor request error (over {$TIKV.COPOCESSOR.ERRORS.MAX.WARN} in 5m)' + priority: WARNING + - + name: 'TiKV: Coprocessor: RocksDB ops, rate' + type: DEPENDENT + key: tikv.coprocessor_rocksdb_perf.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of RocksDB internal operations from PerfContext per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_rocksdb_perf")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: Response size, rate' + type: DEPENDENT + key: tikv.coprocessor_scan_keys.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Bps + description: 'The total size of coprocessor response per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_response_bytes")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: CPU util' + type: DEPENDENT + key: tikv.cpu.util + delay: '0' + history: 7d + value_type: FLOAT + units: '%' + description: 'The CPU usage ratio on TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_thread_cpu_seconds_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + - + type: MULTIPLIER + parameters: + - '100' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Bytes read' + type: DEPENDENT + key: tikv.engine_flow_bytes.read + delay: '0' + history: 7d + value_type: FLOAT + units: Bps + description: 'The total bytes of read in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type =~ "bytes_read|iter_bytes_read")].value.sum()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Bytes write' + type: DEPENDENT + key: tikv.engine_flow_bytes.write + delay: '0' + history: 7d + value_type: FLOAT + units: Bps + description: 'The total bytes of write in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_engine_flow_bytes" && @.labels.db == "kv" && @.labels.type == "wal_file_bytes")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Store size' + type: DEPENDENT + key: tikv.engine_size + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'The storage size of TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_engine_size_bytes")].value.sum()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Get instance metrics' + type: HTTP_AGENT + key: tikv.get_metrics + history: '0' + trends: '0' + value_type: TEXT + description: 'Get TiKV instance metrics.' + applications: + - + name: 'Zabbix raw items' + preprocessing: + - + type: CHECK_NOT_SUPPORTED + parameters: + - '' + - + type: PROMETHEUS_TO_JSON + parameters: + - '' + url: '{$TIKV.URL}:{$TIKV.PORT}/metrics' + - + name: 'TiKV: Total query, rate' + type: DEPENDENT + key: tikv.grpc_msg.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The total QPS in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Total query errors, rate' + type: DEPENDENT + key: tikv.grpc_msg_fail.rate + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The total number of gRPC message handling failure per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_grpc_msg_fail_total")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Server: failure messages total, rate' + type: DEPENDENT + key: tikv.messages.failure.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total number of reporting failure messages per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_server_report_failure_msg_total")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Regions, count' + type: DEPENDENT + key: tikv.region_count + delay: '0' + history: 7d + description: 'The number of regions collected in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "region" )].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Regions, leader' + type: DEPENDENT + key: tikv.region_leader + delay: '0' + history: 7d + description: 'The number of leaders in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_raftstore_region_count" && @.labels.type == "leader" )].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: RSS memory usage' + type: DEPENDENT + key: tikv.rss_bytes + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'Resident memory size in bytes.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "process_resident_memory_bytes")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Scheduler: Commands total, rate' + type: DEPENDENT + key: tikv.scheduler_commands.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total number of commands per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_stage_total")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Snapshot: Pending tasks' + type: DEPENDENT + key: tikv.scheduler_contex + delay: '0' + history: 7d + description: 'The number of tasks currently running by the worker or pending.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_worker_pending_task_total")].value.first()' + master_item: + key: tikv.get_metrics + triggers: + - + expression: '{min(5m)}>{$TIKV.PENDING_COMMANDS.MAX.WARN}' + name: 'TiKV: Too many pending commands (over {$TIKV.PENDING_COMMANDS.MAX.WARN} for 5m)' + priority: AVERAGE + - + expression: '{min(5m)}>{$TIKV.PENDING_TASKS.MAX.WARN}' + name: 'TiKV: Too many pending commands (over {$TIKV.PENDING_TASKS.MAX.WARN} for 5m)' + priority: AVERAGE + - + name: 'TiKV: Scheduler: Busy, rate' + type: DEPENDENT + key: tikv.scheduler_too_busy.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'The total count of too busy schedulers per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_too_busy_total")].value.sum()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Snapshot: Applying' + type: DEPENDENT + key: tikv.snapshot.applying + delay: '0' + history: 7d + description: 'The total amount of raftstore snapshot traffic.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "applying")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Snapshot: Receiving' + type: DEPENDENT + key: tikv.snapshot.receiving + delay: '0' + history: 7d + description: 'The total amount of raftstore snapshot traffic.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "receiving")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Snapshot: Sending' + type: DEPENDENT + key: tikv.snapshot.sending + delay: '0' + history: 7d + description: 'The total amount of raftstore snapshot traffic.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_raftstore_snapshot_traffic_total" && @.labels.type == "sending")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Storage: commands total, rate' + type: DEPENDENT + key: tikv.storage_command.rate + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total number of commands received per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_storage_command_total")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Available size' + type: DEPENDENT + key: tikv.store_size.available + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'The available capacity of TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "available")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Capacity size' + type: DEPENDENT + key: tikv.store_size.capacity + delay: '0' + history: 7d + value_type: FLOAT + units: B + description: 'The capacity size of TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_store_size_bytes" && @.labels.type == "capacity")].value.first()' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Uptime' + type: DEPENDENT + key: tikv.uptime + delay: '0' + history: 7d + value_type: FLOAT + units: uptime + description: 'The runtime of each TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name=="process_start_time_seconds")].value.first()' + - + type: JAVASCRIPT + parameters: + - | + //use boottime to calculate uptime + return (Math.floor(Date.now()/1000)-Number(value)); + master_item: + key: tikv.get_metrics + triggers: + - + expression: '{last()}<10m' + name: 'TiKV: has been restarted (uptime < 10m)' + priority: INFO + description: 'Uptime is less than 10 minutes' + manual_close: 'YES' + discovery_rules: + - + name: 'Coprocessor metrics discovery' + type: DEPENDENT + key: tikv.coprocessor.discovery + delay: '0' + description: 'Discovery coprocessor metrics.' + item_prototypes: + - + name: 'TiKV: Coprocessor: {#REQ_TYPE} requests, rate' + type: DEPENDENT + key: 'tikv.coprocessor_request.rate[{#REQ_TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of coprocessor requests per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count" && @.labels.req == "{#REQ_TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: {#REQ_TYPE} errors, rate' + type: DEPENDENT + key: 'tikv.coprocessor_request_error.rate[{#REQ_TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of push down request error per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_request_error" && @.labels.req == "{#REQ_TYPE}")].value.first()' + error_handler: DISCARD_VALUE + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: {#REQ_TYPE} RocksDB ops, rate' + type: DEPENDENT + key: 'tikv.coprocessor_rocksdb_perf.rate[{#REQ_TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of RocksDB internal operations from PerfContext per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_rocksdb_perf" && @.labels.req == "{#REQ_TYPE}")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + - + name: 'TiKV: Coprocessor: {#REQ_TYPE} scan keys, rate' + type: DEPENDENT + key: 'tikv.coprocessor_scan_keys.rate[{#REQ_TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'Total number of scan keys observed per request per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_scan_keys_count" && @.labels.req == "{#REQ_TYPE}")].value.first()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + master_item: + key: tikv.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_coprocessor_request_duration_seconds_count")]' + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#REQ_TYPE}": item.labels.req, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'QPS metrics discovery' + type: DEPENDENT + key: tikv.qps.discovery + delay: '0' + description: 'Discovery QPS metrics.' + item_prototypes: + - + name: 'TiKV: Query: {#TYPE}, rate' + type: DEPENDENT + key: 'tikv.grpc_msg.rate[{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + units: Ops + description: 'The QPS per command in TiKV instance.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count" && @.labels.type == "{#TYPE}")].value.first()' + error_handler: CUSTOM_VALUE + master_item: + key: tikv.get_metrics + master_item: + key: tikv.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_grpc_msg_duration_seconds_count")]' + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#TYPE}": item.labels.type, + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Scheduler metrics discovery' + type: DEPENDENT + key: tikv.scheduler.discovery + delay: '0' + description: 'Discovery scheduler metrics.' + item_prototypes: + - + name: 'TiKV: Scheduler: commands {#STAGE}, rate' + type: DEPENDENT + key: 'tikv.scheduler_stage.rate[{#STAGE}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total number of commands on each stage per second.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_stage_total" && @.labels.stage == "{#STAGE}")].value.sum()' + error_handler: CUSTOM_VALUE + error_handler_params: '0' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + master_item: + key: tikv.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_scheduler_stage_total")]' + - + type: JAVASCRIPT + parameters: + - | + var lookup = {}, + result = []; + + JSON.parse(value).forEach(function (item) { + var stage = item.labels.stage; + if (!(lookup[stage])) { + lookup[stage] = 1; + result.push({ "{#STAGE}": stage }); + } + }) + + return JSON.stringify(result); + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + - + name: 'Server errors discovery' + type: DEPENDENT + key: tikv.server_report_failure.discovery + delay: '0' + description: 'Discovery server errors metrics.' + item_prototypes: + - + name: 'TiKV: Store_id {#STORE_ID}: failure messages "{#TYPE}", rate' + type: DEPENDENT + key: 'tikv.messages.failure.rate[{#STORE_ID},{#TYPE}]' + delay: '0' + history: 7d + value_type: FLOAT + description: 'Total number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store id.' + applications: + - + name: 'TiKV node' + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_server_report_failure_msg_total" && @.labels.store_id == "{#STORE_ID}" && @.labels.type == "{#TYPE}")].value.sum()' + - + type: CHANGE_PER_SECOND + parameters: + - '' + master_item: + key: tikv.get_metrics + trigger_prototypes: + - + expression: '{min(5m)}>{$TIKV.STORE.ERRORS.MAX.WARN}' + name: 'TiKV: Store_id {#STORE_ID}: Too many failure messages "{#TYPE}" (over {$TIKV.STORE.ERRORS.MAX.WARN} in 5m)' + discover: NO_DISCOVER + priority: WARNING + description: 'Indicates that the remote TiKV cannot be connected.' + master_item: + key: tikv.get_metrics + preprocessing: + - + type: JSONPATH + parameters: + - '$[?(@.name == "tikv_server_report_failure_msg_total")]' + error_handler: DISCARD_VALUE + - + type: JAVASCRIPT + parameters: + - | + output = JSON.parse(value).map(function(item){ + return { + "{#STORE_ID}": item.labels.store_id, + "{#TYPE}": item.labels.type, + + }}) + return JSON.stringify({"data": output}) + - + type: DISCARD_UNCHANGED_HEARTBEAT + parameters: + - 1h + overrides: + - + name: 'Too many unreachable messages trigger' + step: '1' + filter: + conditions: + - + macro: '{#TYPE}' + value: unreachable + formulaid: A + operations: + - + operationobject: TRIGGER_PROTOTYPE + operator: LIKE + value: 'Too many failure messages' + status: ENABLED + discover: DISCOVER + macros: + - + macro: '{$TIKV.COPOCESSOR.ERRORS.MAX.WARN}' + value: '1' + description: 'Maximum number of coprocessor request errors' + - + macro: '{$TIKV.PENDING_COMMANDS.MAX.WARN}' + value: '1' + description: 'Maximum number of pending commands' + - + macro: '{$TIKV.PENDING_TASKS.MAX.WARN}' + value: '1' + description: 'Maximum number of tasks currently running by the worker or pending' + - + macro: '{$TIKV.PORT}' + value: '20180' + description: 'The port of TiKV server metrics web endpoint' + - + macro: '{$TIKV.STORE.ERRORS.MAX.WARN}' + value: '1' + description: 'Maximum number of failure messages' + - + macro: '{$TIKV.URL}' + value: localhost + description: 'TiKV server URL' + graphs: + - + name: 'TiKV: Scheduler priority commands rate' + graph_items: + - + color: 1A7C11 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.commands_pri.normal.rate + - + sortorder: '1' + color: 2774A4 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.commands_pri.high.rate + - + sortorder: '2' + color: F63100 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.commands_pri.low.rate + - + name: 'TiKV: Snapshot state count' + graph_items: + - + color: 1A7C11 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.snapshot.applying + - + sortorder: '1' + color: 2774A4 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.snapshot.receiving + - + sortorder: '2' + color: F63100 + item: + host: 'TiDB TiKV by HTTP' + key: tikv.snapshot.sending |