diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2022-04-20 13:00:54 +0300 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2022-04-20 13:00:54 +0300 |
commit | 3cccd102ba543e02725d247893729e5c73b38295 (patch) | |
tree | f36a04ec38517f5deaaacb5acc7d949688d1e187 /doc/development/stage_group_observability/dashboards/stage_group_dashboard.md | |
parent | 205943281328046ef7b4528031b90fbda70c75ac (diff) |
Add latest changes from gitlab-org/gitlab@14-10-stable-eev14.10.0-rc42
Diffstat (limited to 'doc/development/stage_group_observability/dashboards/stage_group_dashboard.md')
-rw-r--r-- | doc/development/stage_group_observability/dashboards/stage_group_dashboard.md | 200 |
1 files changed, 200 insertions, 0 deletions
diff --git a/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md b/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md new file mode 100644 index 00000000000..c1831cfce69 --- /dev/null +++ b/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md @@ -0,0 +1,200 @@ +--- +stage: Platforms +group: Scalability +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Stage group dashboard + +The stage group dashboard is generated dashboard that contains metrics +for common components used by most stage groups. The dashboard is +fully customizable and owned by the stage groups. + +This page explains what is on these dashboards, how to use their +contents, and how they can be customized. + +## Dashboard contents + +### Error budget panels + +![28 day budget](img/stage_group_dashboards_28d_budget.png) + +The top panels display the [error budget](../index.md#error-budget). +These panels always show the 28 days before the end time selected in the +[time range controls](index.md#time-range-controls). This data doesn't +follow the selected range. It does respect the filters for environment +and stage. + +### Metrics panels + +![Metrics panels](img/stage_group_dashboards_metrics.png) + +Although most of the metrics displayed in the panels are self-explanatory in their title and nearby +description, note the following: + +- The events are counted, measured, accumulated, collected, and stored as + [time series](https://prometheus.io/docs/concepts/data_model/). The data is calculated using + statistical methods to produce metrics. It means that metrics are approximately correct and + meaningful over a time period. They help you get an overview of the stage of a system over time. + They are not meant to give you precise numbers of a discrete event. + + If you need a higher level of accuracy, use another monitoring tool, such as + [logs](https://about.gitlab.com/handbook/engineering/monitoring/#logs). + Read the following examples for more explanations. +- All the rate metrics' units are `requests per second`. The default aggregate time frame is 1 minute. + + For example, a panel shows the requests per second number at `2020-12-25 00:42:00` to be `34.13`. + It means at the minute 42 (from `2020-12-25 00:42:00` to `2020-12-25 00:42:59` ), there are + approximately `34.13 * 60 = ~ 2047` requests processed by the web servers. +- You might encounter some gotchas related to decimal fraction and rounding up frequently, especially + in low-traffic cases. For example, the error rate of `RepositoryUpdateMirrorWorker` at + `2020-12-25 02:04:00` is `0.07`, equivalent to `4.2` jobs per minute. The raw result is + `0.06666666667`, equivalent to 4 jobs per minute. +- All the rate metrics are more accurate when the data is big enough. The default floating-point + precision is 2. In some extremely low panels, you can see `0.00`, even though there is still some + real traffic. + +To inspect the raw data of the panel for further calculation, select **Inspect** from the dropdown +list of a panel. Queries, raw data, and panel JSON structure are available. +Read more at [Grafana panel inspection](https://grafana.com/docs/grafana/latest/panels/inspect-panel/). + +All the dashboards are powered by [Grafana](https://grafana.com/), a frontend for displaying metrics. +Grafana consumes the data returned from queries to backend Prometheus data source, then presents it +with visualizations. The stage group dashboards are built to serve the most common use cases with a +limited set of filters and pre-built queries. Grafana provides a way to explore and visualize the +metrics data with [Grafana Explore](https://grafana.com/docs/grafana/latest/explore/). This requires +some knowledge of the [Prometheus PromQL query language](https://prometheus.io/docs/prometheus/latest/querying/basics/). + +## Example: Debugging with dashboards + +Example debugging workflow: + +1. A team member in the Code Review group has merged an MR which got deployed to production. +1. To verify the deployment, you can check the + [Code Review group's dashboard](https://dashboards.gitlab.net/d/stage-groups-code_review/stage-groups-group-dashboard-create-code-review?orgId=1). +1. Sidekiq Error Rate panel shows an elevated error rate, specifically `UpdateMergeRequestsWorker`. + + ![Debug 1](img/stage_group_dashboards_debug_1.png) + +1. If you select **Kibana: Kibana Sidekiq failed request logs** in the **Extra links** section, you can filter for `UpdateMergeRequestsWorker` and read through the logs. + + ![Debug 2](img/stage_group_dashboards_debug_2.png) + +1. With [Sentry](https://sentry.gitlab.net/gitlab/gitlabcom/) you can find the exception where you + can filter by transaction type and `correlation_id` from Kibana's result item. + + ![Debug 3](img/stage_group_dashboards_debug_3.png) + +1. A precise exception, including a stack trace, job arguments, and other information should now appear. + +Happy debugging! + +## Customizing the dashboard + +All Grafana dashboards at GitLab are generated from the [Jsonnet files](https://github.com/grafana/grafonnet-lib) +stored in [the runbooks project](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards). +Particularly, the stage group dashboards definitions are stored in +[`/dashboards/stage-groups`](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/stage-groups). + +By convention, each group has a corresponding Jsonnet file. The dashboards are synced with GitLab +[stage group data](https://gitlab.com/gitlab-com/www-gitlab-com/-/raw/master/data/stages.yml) every +month. + +Expansion and customization are one of the key principles used when we designed this system. +To customize your group's dashboard, edit the corresponding file and follow the +[Runbook workflow](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards#dashboard-source). +The dashboard is updated after the MR is merged. + +Looking at an autogenerated file, for example, +[`product_planning.dashboard.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/stage-groups/product_planning.dashboard.jsonnet): + +```jsonnet +// This file is autogenerated using scripts/update_stage_groups_dashboards.rb +// Please feel free to customize this file. +local stageGroupDashboards = import './stage-group-dashboards.libsonnet'; + +stageGroupDashboards.dashboard('product_planning') +.stageGroupDashboardTrailer() +``` + +We provide basic customization to filter out the components essential to your group's activities. +By default, only the `web`, `api`, and `sidekiq` components are available in the dashboard, while +`git` is hidden. See [how to enable available components and optional graphs](#optional-graphs). + +You can also append further information or custom metrics to a dashboard. The following example +adds some links and a total request rate to the top of the page: + +```jsonnet +local stageGroupDashboards = import './stage-group-dashboards.libsonnet'; +local grafana = import 'github.com/grafana/grafonnet-lib/grafonnet/grafana.libsonnet'; +local basic = import 'grafana/basic.libsonnet'; + +stageGroupDashboards.dashboard('source_code') +.addPanel( + grafana.text.new( + title='Group information', + mode='markdown', + content=||| + Useful link for the Source Code Management group dashboard: + - [Issue list](https://gitlab.com/groups/gitlab-org/-/issues?scope=all&state=opened&label_name%5B%5D=repository) + - [Epic list](https://gitlab.com/groups/gitlab-org/-/epics?label_name[]=repository) + |||, + ), + gridPos={ x: 0, y: 0, w: 24, h: 4 } +) +.addPanel( + basic.timeseries( + title='Total Request Rate', + yAxisLabel='Requests per Second', + decimals=2, + query=||| + sum ( + rate(gitlab_transaction_duration_seconds_count{ + env='$environment', + environment='$environment', + feature_category=~'source_code_management', + }[$__interval]) + ) + ||| + ), + gridPos={ x: 0, y: 0, w: 24, h: 7 } +) +.stageGroupDashboardTrailer() +``` + +![Stage Group Dashboard Customization](img/stage_group_dashboards_time_customization.png) + +<i class="fa fa-youtube-play youtube" aria-hidden="true"></i> +If you want to see the workflow in action, we've recorded a pairing session on customizing a dashboard, +available on [GitLab Unfiltered](https://youtu.be/shEd_eiUjdI). + +For deeper customization and more complicated metrics, visit the +[Grafonnet lib](https://github.com/grafana/grafonnet-lib) project and the +[GitLab Prometheus Metrics](../../../administration/monitoring/prometheus/gitlab_metrics.md#gitlab-prometheus-metrics) +documentation. + +### Optional graphs + +Some graphs aren't relevant for all groups, so they aren't added to +the dashboard by default. They can be added by customizing the +dashboard. + +By default, only the `web`, `api`, and `sidekiq` metrics are +shown. If you wish to see the metrics from the `git` fleet (or any +other component that might be added in the future), you can configure it as follows: + +```jsonnet +stageGroupDashboards +.dashboard('source_code', components=stageGroupDashboards.supportedComponents) +.stageGroupDashboardTrailer() +``` + +If your group is interested in Sidekiq job durations and their +thresholds, you can add these graphs by calling the `.addSidekiqJobDurationByUrgency` function: + +```jsonnet +stageGroupDashboards +.dashboard('access') +.addSidekiqJobDurationByUrgency() +.stageGroupDashboardTrailer() +``` |