Add latest changes from gitlab-org/gitlab@14-10-stable-eev14.10.0-rc42

author: GitLab Bot <gitlab-bot@gitlab.com> 2022-04-20 13:00:54 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2022-04-20 13:00:54 +0300
commit: 3cccd102ba543e02725d247893729e5c73b38295 (patch)
tree: f36a04ec38517f5deaaacb5acc7d949688d1e187 /doc/development/stage_group_observability/dashboards
parent: 205943281328046ef7b4528031b90fbda70c75ac (diff)
16 files changed, 397 insertions, 0 deletions
diff --git a/doc/development/stage_group_observability/dashboards/error_budget_detail.md b/doc/development/stage_group_observability/dashboards/error_budget_detail.md
new file mode 100644
index 00000000000..19f98d404e7
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/error_budget_detail.md
@@ -0,0 +1,127 @@
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Error budget detail dashboard
+
+With error budget detailed dashboards you can explore the error budget
+spent at specific moments in time. By default, the dashboard shows
+the past 28 days. You can adjust it with the [time range controls](index.md#time-range-controls)
+or by selecting a range on one of the graphs.
+
+This dashboard is the same kind of dashboard we use for service level
+monitoring. For example, see the
+[overview dashboard for the web service](https://dashboards.gitlab.net/d/web-main) (GitLab internal).
+
+## Error budget panels
+
+On top of each dashboard, there's the same panel with the [error budget](../index.md#error-budget).
+Here, the time based targets adjust depending on the range.
+For example, while the budget was 20 minutes per 28 days, it is only 1/4 of that for 7 days:
+
+![5m budget in 7 days](img/error_budget_detail_7d_budget.png)
+
+Also, keep in mind that Grafana rounds the numbers. In this example the
+total time spent is 5 minutes and 24 seconds, so 24 seconds over
+budget.
+
+The attribution panels also show only failures that occurred
+within the selected range.
+
+These two panels represent a view of the "official" error budget: they
+take into account if an SLI was ignored.
+The [attribution panels](../index.md#check-where-budget-is-being-spent) show which components
+contributed the most over the selected period.
+
+The panels below take into account all SLIs that contribute to GitLab.com availability.
+This includes SLIs that are ignored for the official error budget.
+
+## Time series for aggregations
+
+The time series panels for aggregations all contain three panels:
+
+- Apdex: the [Apdex score](https://en.wikipedia.org/wiki/Apdex) for one or more SLIs. Higher score is better.
+- Error Ratio: the error ratio for one or more SLIs. Lower is better.
+- Requests Per Second: the number of operations per second. Higher means a bigger impact on the error budget.
+
+The Apdex and error-ratio panels also contain two alerting thresholds:
+
+- The one-hour threshold: the fast burn rate.
+
+  When this line is crossed, we've spent 2% of our monthly budget in the last hour.
+
+- The six-hour threshold: the slow burn rate.
+
+  When this line is crossed, we've spent 2% of our budget in the last six hours.
+
+If there is no error-ratio or Apdex for a certain SLI, the panel is hidden.
+
+Read more about these alerting windows in
+[Google SRE workbook](https://sre.google/workbook/alerting-on-slos/#recommended_time_windows_and_burn_rates_f).
+
+We don't have alerting on these metrics for stage groups.
+This work is being discussed in [epic 615](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/615).
+If this is something you would like for your group, let us know there.
+
+### Stage group aggregation
+
+![stage group aggregation graphs](img/error_budget_detail_stage_group_aggregation.png)
+
+The stage group aggregation shows a graph with the Apdex and errors
+portion of the error budget over time. The lower a dip in the Apdex
+graph or the higher a peak on the error ratio graph, the more budget
+was spent at that moment.
+
+The third graph shows the sum of all the request rates for all
+SLIs. Higher means there was more traffic.
+
+To zoom in on a particular moment where a lot of budget was spent, select the appropriate time in
+the graph.
+
+### Service-level indicators
+
+![Rails requests service level indicator](img/error_budget_detail_sli.png)
+
+This time series shows a breakdown of each SLI that could be contributing to the
+error budget for a stage group. Similar to the stage group
+aggregation, it contains an Apdex score, error ratio, and request
+rate.
+
+Here we also display an explanation panel, describing the SLI and
+linking to other monitoring tools. The links to logs (📖) or
+visualizations (📈) in Kibana are scoped to the feature categories
+for your stage group, and limited to the range selected. Keep in mind
+that we only keep logs in Kibana for seven days.
+
+In the graphs, there is a single line per service. In the previous example image,
+`rails_requests` is an SLI for the `web`, `api` and `git` services.
+
+Sidekiq is not included in this dashboard. We're tracking this in
+[epic 700](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/700).
+
+### SLI detail
+
+![Rails requests SLI detail](img/error_budget_detail_sli_detail.png)
+
+The SLI details row shows a breakdown of a specific SLI based on the
+labels present on the source metrics.
+
+For example, in the previous image, the `rails_requests` SLI has an `endpoint_id` label.
+We can show how much a certain endpoint was requested (RPS), and how much it contributed to the error
+budget spend.
+
+For Apdex we show the **Apdex Attribution** panel. The more prominent
+color is the one that contributed most to the spend. To see the
+top spending endpoint over the entire range, sort by the average.
+
+For error ratio we show an error rate. To see which label contributed most to the spend, sort by the
+average.
+
+We don't have endpoint information available for Rails errors. This work is being planned in
+[epic 663](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/663).
+
+The number of series to be loaded in the SLI details graphs is very
+high when compared to the other aggregations. Because of this, it's not possible to
+load more than a few days' worth of data.
diff --git a/doc/development/stage_group_observability/dashboards/img/error_budget_detail_7d_budget.png b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_7d_budget.png
new file mode 100644
index 00000000000..1b2996d7d26
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_7d_budget.png
diff --git a/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli.png b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli.png
new file mode 100644
index 00000000000..0472e35b0cb
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli.png
diff --git a/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli_detail.png b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli_detail.png
new file mode 100644
index 00000000000..99530886ae9
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_sli_detail.png
diff --git a/doc/development/stage_group_observability/dashboards/img/error_budget_detail_stage_group_aggregation.png b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_stage_group_aggregation.png
new file mode 100644
index 00000000000..d679637dcc4
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/error_budget_detail_stage_group_aggregation.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_28d_budget.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_28d_budget.png
new file mode 100644
index 00000000000..eb164dd3f68
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_28d_budget.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_annotation.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_annotation.png
new file mode 100644
index 00000000000..3776d87e5bb
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_annotation.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_1.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_1.png
new file mode 100644
index 00000000000..309fad89120
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_1.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_2.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_2.png
new file mode 100644
index 00000000000..2aad9ab5592
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_2.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_3.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_3.png
new file mode 100644
index 00000000000..38647410ffd
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_debug_3.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_filters.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_filters.png
new file mode 100644
index 00000000000..27a836bc36d
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_filters.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_metrics.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_metrics.png
new file mode 100644
index 00000000000..6b6faff6e3b
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_metrics.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_customization.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_customization.png
new file mode 100644
index 00000000000..49e61183b7c
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_customization.png
diff --git a/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_filter.png b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_filter.png
new file mode 100644
index 00000000000..81a3dc789f1
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/img/stage_group_dashboards_time_filter.png
diff --git a/doc/development/stage_group_observability/dashboards/index.md b/doc/development/stage_group_observability/dashboards/index.md
new file mode 100644
index 00000000000..f4e646c8634
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/index.md
@@ -0,0 +1,70 @@
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Dashboards for stage groups
+
+We generate a lot of dashboards acting as windows to the metrics we
+use to monitor GitLab.com. Most of our dashboards are generated from
+Jsonnet in the
+[runbooks repository](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards#dashboard-source).
+Anyone can contribute to these, adding new dashboards or modifying
+existing ones.
+
+When adding new dashboards for your stage groups, tagging them with
+`stage_group:<group name>` cross-links the dashboard on other
+dashboards with the same tag. You can create dashboards for stage groups
+in the [`dashboards/stage-groups`](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/stage-groups)
+directory. Directories can't be nested more than one level deep.
+
+To see a list of all the dashboards for your stage group:
+
+1. In Grafana, go to the [Dashboard browser](https://dashboards.gitlab.net/dashboards?tag=stage-groups).
+1. To see all of the dashboards for a specific group, filter for `stage_group:<group name>`.
+
+Some generated dashboards are already available:
+
+1. [Stage group dashboard](stage_group_dashboard.md): a customizable
+   dashboard with tailored metrics per group.
+1. [Error budget detail dashboard](error_budget_detail.md): a
+   dashboard allowing to explore the error budget spend over time and
+   over multiple SLIs.
+
+## Time range controls
+
+![Default time filter](img/stage_group_dashboards_time_filter.png)
+
+By default, all the times are in UTC time zone.
+[We use UTC when communicating in Engineering.](https://about.gitlab.com/handbook/communication/#writing-style-guidelines)
+
+All metrics recorded in the GitLab production system have
+[one-year retention](https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/-/blob/31526b03fef823e2f9b3cda7c75dcd28a12418a3/attributes/prometheus.rb#L40).
+
+You can also zoom in and filter the time range directly on a graph. For more information, see the
+[Grafana Time Range Controls](https://grafana.com/docs/grafana/latest/dashboards/time-range-controls/)
+documentation.
+
+## Filters and annotations
+
+On each dashboard, there are two filters and some annotation switches on the top of the page.
+
+Some special events are meaningful to development and operational activities.
+[Grafana annotations](https://grafana.com/docs/grafana/latest/dashboards/annotations/) mark them
+directly on the graphs.
+
+![Filters and annotations](img/stage_group_dashboards_filters.png)
+
+| Name            | Type       | Description |
+| --------------- | ---------- | ----------- |
+| `PROMETHEUS_DS` | filter     | Filter the selective [Prometheus data sources](https://about.gitlab.com/handbook/engineering/monitoring/#prometheus). The default value is `Global`, which aggregates the data from all available data sources. Most of the time, you don't need to care about this filter. |
+| `environment`   | filter     | Filter the environment the metrics are fetched from. The default setting is production (`gprd`). For other options, see [Production Environment mapping](https://about.gitlab.com/handbook/engineering/infrastructure/production/architecture/#environments). |
+| `stage`         | filter     | Filter metrics by stage: `main` or `cny` for canary. Default is `main` |
+| `deploy`        | annotation | Mark a deployment event on the GitLab.com SaaS platform. |
+| `canary-deploy` | annotation | Mark a [canary deployment](https://about.gitlab.com/handbook/engineering/#canary-testing) event on the GitLab.com SaaS platform. |
+| `feature-flags` | annotation | Mark the time point when a feature flag is updated. |
+
+Example of a feature flag annotation displayed on a dashboard panel:
+
+![Annotations](img/stage_group_dashboards_annotation.png)
diff --git a/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md b/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md
new file mode 100644
index 00000000000..c1831cfce69
--- /dev/null
+++ b/doc/development/stage_group_observability/dashboards/stage_group_dashboard.md
@@ -0,0 +1,200 @@
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Stage group dashboard
+
+The stage group dashboard is generated dashboard that contains metrics
+for common components used by most stage groups. The dashboard is
+fully customizable and owned by the stage groups.
+
+This page explains what is on these dashboards, how to use their
+contents, and how they can be customized.
+
+## Dashboard contents
+
+### Error budget panels
+
+![28 day budget](img/stage_group_dashboards_28d_budget.png)
+
+The top panels display the [error budget](../index.md#error-budget).
+These panels always show the 28 days before the end time selected in the
+[time range controls](index.md#time-range-controls). This data doesn't
+follow the selected range. It does respect the filters for environment
+and stage.
+
+### Metrics panels
+
+![Metrics panels](img/stage_group_dashboards_metrics.png)
+
+Although most of the metrics displayed in the panels are self-explanatory in their title and nearby
+description, note the following:
+
+- The events are counted, measured, accumulated, collected, and stored as
+  [time series](https://prometheus.io/docs/concepts/data_model/). The data is calculated using
+  statistical methods to produce metrics. It means that metrics are approximately correct and
+  meaningful over a time period. They help you get an overview of the stage of a system over time.
+  They are not meant to give you precise numbers of a discrete event.
+
+  If you need a higher level of accuracy, use another monitoring tool, such as
+  [logs](https://about.gitlab.com/handbook/engineering/monitoring/#logs).
+  Read the following examples for more explanations.
+- All the rate metrics' units are `requests per second`. The default aggregate time frame is 1 minute.
+
+  For example, a panel shows the requests per second number at `2020-12-25 00:42:00` to be `34.13`.
+  It means at the minute 42 (from `2020-12-25 00:42:00` to `2020-12-25 00:42:59` ), there are
+  approximately `34.13 * 60 = ~ 2047` requests processed by the web servers.
+- You might encounter some gotchas related to decimal fraction and rounding up frequently, especially
+  in low-traffic cases. For example, the error rate of `RepositoryUpdateMirrorWorker` at
+  `2020-12-25 02:04:00` is `0.07`, equivalent to `4.2` jobs per minute. The raw result is
+  `0.06666666667`, equivalent to 4 jobs per minute.
+- All the rate metrics are more accurate when the data is big enough. The default floating-point
+  precision is 2. In some extremely low panels, you can see `0.00`, even though there is still some
+  real traffic.
+
+To inspect the raw data of the panel for further calculation, select **Inspect** from the dropdown
+list of a panel. Queries, raw data, and panel JSON structure are available.
+Read more at [Grafana panel inspection](https://grafana.com/docs/grafana/latest/panels/inspect-panel/).
+
+All the dashboards are powered by [Grafana](https://grafana.com/), a frontend for displaying metrics.
+Grafana consumes the data returned from queries to backend Prometheus data source, then presents it
+with visualizations. The stage group dashboards are built to serve the most common use cases with a
+limited set of filters and pre-built queries. Grafana provides a way to explore and visualize the
+metrics data with [Grafana Explore](https://grafana.com/docs/grafana/latest/explore/). This requires
+some knowledge of the [Prometheus PromQL query language](https://prometheus.io/docs/prometheus/latest/querying/basics/).
+
+## Example: Debugging with dashboards
+
+Example debugging workflow:
+
+1. A team member in the Code Review group has merged an MR which got deployed to production.
+1. To verify the deployment, you can check the
+   [Code Review group's dashboard](https://dashboards.gitlab.net/d/stage-groups-code_review/stage-groups-group-dashboard-create-code-review?orgId=1).
+1. Sidekiq Error Rate panel shows an elevated error rate, specifically `UpdateMergeRequestsWorker`.
+
+  ![Debug 1](img/stage_group_dashboards_debug_1.png)
+
+1. If you select **Kibana: Kibana Sidekiq failed request logs** in the **Extra links** section, you can filter for `UpdateMergeRequestsWorker` and read through the logs.
+
+  ![Debug 2](img/stage_group_dashboards_debug_2.png)
+
+1. With [Sentry](https://sentry.gitlab.net/gitlab/gitlabcom/) you can find the exception where you
+   can filter by transaction type and `correlation_id` from Kibana's result item.
+
+  ![Debug 3](img/stage_group_dashboards_debug_3.png)
+
+1. A precise exception, including a stack trace, job arguments, and other information should now appear.
+
+Happy debugging!
+
+## Customizing the dashboard
+
+All Grafana dashboards at GitLab are generated from the [Jsonnet files](https://github.com/grafana/grafonnet-lib)
+stored in [the runbooks project](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards).
+Particularly, the stage group dashboards definitions are stored in
+[`/dashboards/stage-groups`](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/stage-groups).
+
+By convention, each group has a corresponding Jsonnet file. The dashboards are synced with GitLab
+[stage group data](https://gitlab.com/gitlab-com/www-gitlab-com/-/raw/master/data/stages.yml) every
+month.
+
+Expansion and customization are one of the key principles used when we designed this system.
+To customize your group's dashboard, edit the corresponding file and follow the
+[Runbook workflow](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards#dashboard-source).
+The dashboard is updated after the MR is merged.
+
+Looking at an autogenerated file, for example,
+[`product_planning.dashboard.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/stage-groups/product_planning.dashboard.jsonnet):
+
+```jsonnet
+// This file is autogenerated using scripts/update_stage_groups_dashboards.rb
+// Please feel free to customize this file.
+local stageGroupDashboards = import './stage-group-dashboards.libsonnet';
+
+stageGroupDashboards.dashboard('product_planning')
+.stageGroupDashboardTrailer()
+```
+
+We provide basic customization to filter out the components essential to your group's activities.
+By default, only the `web`, `api`, and `sidekiq` components are available in the dashboard, while
+`git` is hidden. See [how to enable available components and optional graphs](#optional-graphs).
+
+You can also append further information or custom metrics to a dashboard. The following example
+adds some links and a total request rate to the top of the page:
+
+```jsonnet
+local stageGroupDashboards = import './stage-group-dashboards.libsonnet';
+local grafana = import 'github.com/grafana/grafonnet-lib/grafonnet/grafana.libsonnet';
+local basic = import 'grafana/basic.libsonnet';
+
+stageGroupDashboards.dashboard('source_code')
+.addPanel(
+  grafana.text.new(
+    title='Group information',
+    mode='markdown',
+    content=|||
+      Useful link for the Source Code Management group dashboard:
+      - [Issue list](https://gitlab.com/groups/gitlab-org/-/issues?scope=all&state=opened&label_name%5B%5D=repository)
+      - [Epic list](https://gitlab.com/groups/gitlab-org/-/epics?label_name[]=repository)
+    |||,
+  ),
+  gridPos={ x: 0, y: 0, w: 24, h: 4 }
+)
+.addPanel(
+  basic.timeseries(
+    title='Total Request Rate',
+    yAxisLabel='Requests per Second',
+    decimals=2,
+    query=|||
+      sum (
+        rate(gitlab_transaction_duration_seconds_count{
+          env='$environment',
+          environment='$environment',
+          feature_category=~'source_code_management',
+        }[$__interval])
+      )
+    |||
+  ),
+  gridPos={ x: 0, y: 0, w: 24, h: 7 }
+)
+.stageGroupDashboardTrailer()
+```
+
+![Stage Group Dashboard Customization](img/stage_group_dashboards_time_customization.png)
+
+<i class="fa fa-youtube-play youtube" aria-hidden="true"></i>
+If you want to see the workflow in action, we've recorded a pairing session on customizing a dashboard,
+available on [GitLab Unfiltered](https://youtu.be/shEd_eiUjdI).
+
+For deeper customization and more complicated metrics, visit the
+[Grafonnet lib](https://github.com/grafana/grafonnet-lib) project and the
+[GitLab Prometheus Metrics](../../../administration/monitoring/prometheus/gitlab_metrics.md#gitlab-prometheus-metrics)
+documentation.
+
+### Optional graphs
+
+Some graphs aren't relevant for all groups, so they aren't added to
+the dashboard by default. They can be added by customizing the
+dashboard.
+
+By default, only the `web`, `api`, and `sidekiq` metrics are
+shown. If you wish to see the metrics from the `git` fleet (or any
+other component that might be added in the future), you can configure it as follows:
+
+```jsonnet
+stageGroupDashboards
+.dashboard('source_code', components=stageGroupDashboards.supportedComponents)
+.stageGroupDashboardTrailer()
+```
+
+If your group is interested in Sidekiq job durations and their
+thresholds, you can add these graphs by calling the `.addSidekiqJobDurationByUrgency` function:
+
+```jsonnet
+stageGroupDashboards
+.dashboard('access')
+.addSidekiqJobDurationByUrgency()
+.stageGroupDashboardTrailer()
+```
author	GitLab Bot <gitlab-bot@gitlab.com>	2022-04-20 13:00:54 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2022-04-20 13:00:54 +0300
commit	3cccd102ba543e02725d247893729e5c73b38295 (patch)
tree	f36a04ec38517f5deaaacb5acc7d949688d1e187 /doc/development/stage_group_observability/dashboards
parent	205943281328046ef7b4528031b90fbda70c75ac (diff)