diff options
Diffstat (limited to 'doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md')
-rw-r--r-- | doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md | 330 |
1 files changed, 330 insertions, 0 deletions
diff --git a/doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md b/doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md new file mode 100644 index 00000000000..aef85107cd9 --- /dev/null +++ b/doc/development/value_stream_analytics/value_stream_analytics_aggregated_backend.md @@ -0,0 +1,330 @@ +--- +stage: Manage +group: Optimize +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Aggregated Value Stream Analytics + +> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/335391) in GitLab 14.7. + +DISCLAIMER: +This page contains information related to upcoming products, features, and functionality. +It is important to note that the information presented is for informational purposes only. +Please do not rely on this information for purchasing or planning purposes. +As with all projects, the items mentioned on this page are subject to change or delay. +The development, release, and timing of any products, features, or functionality remain at the +sole discretion of GitLab Inc. + +This page provides a high-level overview of the aggregated backend for +Value Stream Analytics (VSA). + +## Current Status + +As of 14.8 the aggregated VSA backend is used only in the `gitlab-org` group, for testing purposes +. We plan to gradually roll it out in the next major release (15.0) for the rest of the groups. + +## Motivation + +The aggregated backend aims to solve the performance limitations of the VSA feature and set it up +for long-term growth. + +Our main database is not prepared for analytical workloads. Executing long-running queries can +affect the reliability of the application. For large groups, the current +implementation (old backend) is slow and, in some cases, doesn't even load due to the configured +statement timeout (15s). + +The database queries in the old backend use the core domain models directly through +`IssuableFinders` classes: ([MergeRequestsFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/merge_requests_finder.rb) and [IssuesFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/issues_finder.rb)). +With the requested change of the [date range filters](https://gitlab.com/groups/gitlab-org/-/epics/6046), +this approach was no longer viable from the performance point of view. + +Benefits of the aggregated VSA backend: + +- Simpler database queries (fewer JOINs). +- Faster aggregations, only a single table is accessed. +- Possibility to introduce further aggregations for improving the first page load time. +- Better performance for large groups (with many sub-groups, projects, issues and, merge requests). +- Ready for database decomposition. The VSA related database tables could live in a separate +database with a minimal development effort. +- Ready for keyset pagination which can be useful for exporting the data. +- Possibility to implement more complex event definitions. + - For example, the start event can be two timestamp columns where the earliest value would be + used by the system. + - Example: `MIN(issues.created_at, issues.updated_at)` + +## How does Value Stream Analytics work? + +Value Stream Analytics calculates the duration between two timestamp columns or timestamp +expressions and runs various aggregations on the data. + +Examples: + +- Duration between the Merge Request creation time and Merge Request merge time. +- Duration between the Issue creation time and Issue close time. + +This duration is exposed in various ways: + +- Aggregation: median, average +- Listing: list the duration for individual Merge Request and Issue records + +Apart from the durations, we expose the record count within a stage. + +### Stages + +A stage represents an event pair (start and end events) with additional metadata, such as the name +of the stage. Stages are configurable by the user within the pairing rules defined in the backend. + +**Example stage: Code Review** + +- Start event identifier: Merge Request creation time +- Start event column: uses the `merge_requests.created_at` timestamp column. +- End event identifier: Merge Request merge time +- End event column: uses the `merge_request_metrics.merged_at` timestamp column. +- Stage event hash ID: a calculated hash for the pair of start and end event identifiers. + - If two stages have the same configuration of start and end events, then their stage event hash + IDs are identical. + - The stage event hash ID is later used to store the aggregated data in partitioned database tables. + +### Value streams + +Value streams are container objects for the stages. There can be multiple value streams per group +or project focusing on different aspects of the Dev Ops lifecycle. + +### Example configuration + +![vsa object hierarchy example](img/object_hierarchy_example_V14_10.png) + +In this example, there are two independent value streams set up for two teams that are using +different development workflows within the `Test Group` (top-level namespace). + +The first value stream uses standard timestamp-based events for defining the stages. The second +value stream uses label events. + +Each value stream and stage item from the example will be persisted in the database. Notice that +the `Deployment` stage is identical for both value streams; that means that the underlying +`stage_event_hash_id` is the same for both stages. The `stage_event_hash_id` reduces +the amount of data the backend collects and plays a vital role in database partitioning. + +We expect value streams and stages to be rarely changed. When stages (start and end events) are +changed, the aggregated data gets stale. This is fixed by the periodical aggregation occurring +every day. + +### Feature availability + +The aggregated VSA feature is available on the group and project level however, the aggregated +backend is only available for Premium and Ultimate customers due to data storage and data +computation costs. Storing de-normalized, aggregated data requires significant disk space. + +## Aggregated value stream analytics architecture + +The main idea behind the aggregated VSA backend is separation: VSA database tables and queries do +not use the core domain models directly (Issue, MergeRequest). This allows us to scale and +optimize VSA independently from the other parts of the application. + +The architecture consists of two main mechanisms: + +- Periodical data collection and loading (happens in the background). +- Querying the collected data (invoked by the user). + +### Data loading + +The aggregated nature of VSA comes from the periodical data loading. The system queries the core +domain models to collect the stage and timestamp data. This data is periodically inserted into the +VSA database tables. + +High-level overview for each top-level namespace with Premium or Ultimate license: + +1. Load all stages in the group. +1. Iterate over the issues and merge requests records. +1. Based on the stage configurations (start and end event identifiers) collect the timestamp data. +1. `INSERT` or `UPDATE` the data into the VSA database tables. + +The data loading is implemented within the [`Analytics::CycleAnalytics::DataLoaderService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/analytics/cycle_analytics/data_loader_service.rb) +class. There are groups containing a lot of data, so to avoid overloading the primary database, +the service performs operations in batches and enforces strict application limits: + +- Load records in batches. +- Insert records in batches. +- Stop processing when a limit is reached, schedule a background job to continue the processing +later. +- Continue processing data from a specific point. + +As of GitLab 14.7, the data loading is done manually. Once the feature is ready, the service will +be invoked periodically by the system via a cron job (this part is not implemented yet). + +#### Record iteration + +The batched iteration is implemented with the +[efficient IN operator](../database/efficient_in_operator_queries.md). The background job scans +all issues and merge request records in the group hierarchy ordered by the `updated_at` and the +`id` columns. For already aggregated groups, the `DataLoaderService` continues the aggregation +from a specific point which saves time. + +Collecting the timestamp data happens on every iteration. The `DataLoaderService` determines which +stage events are configured within the group hierarchy and builds a query that selects the +required timestamps. The stage record knows which events are configured and the events know how to +select the timestamp columns. + +Example for collected stage events: merge request merged, merge request created, merge request +closed + +Generated SQL query for loading the timestamps: + +```sql +SELECT + -- the list of columns depends on the configured stages + "merge_request_metrics"."merged_at", + "merge_requests"."created_at", + "merge_request_metrics"."latest_closed_at" + FROM "merge_requests" + LEFT OUTER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id" + WHERE "merge_requests"."id" IN (1, 2, 3, 4) -- ids are coming from the batching query +``` + +The `merged_at` column is located in a separate table (`merge_request_metrics`). The +`Gitlab::Analytics::CycleAnalytics::StagEvents::MergeRequestMerged` class adds itself to a scope +for loading the timestamp data without affecting the number of rows (uses `LEFT JOIN`). This +behavior is implemented for each `StageEvent` class with the `include_in` method. + +The data collection query works on the event level. It extracts the event timestamps from the +stages and ensures that we don't collect the same data multiple times. The events mentioned above +could come from the following stage configuration: + +- merge request created - merge request merged +- merge request created - merge request closed + +Other combinations might be also possible, but we prevent the ones that make no sense, for example: + +- merge request merged - merge request created + +Creation time always happens first, so this stage always reports negative duration. + +#### Data scope + +The data collection scans and processes all issues and merge requests records in the group +hierarchy, starting from the top-level group. This means that if a group only has one value stream +in a sub-group, we nevertheless collect data of all issues and merge requests in the hierarchy of +this group. This aims to simplify the data collection mechanism. Moreover, data research shows +that most group hierarchies have their stages configured on the top level. + +During the data collection process, the collected timestamp data is transformed into rows. For +each configured stage, if the start event timestamp is present, the system inserts or updates one +event record. This allows us to determine the upper limit of the inserted rows per group by +counting all issues and merge requests and multiplying the sum by the stage count. + +#### Data consistency concerns + +Due to the async nature of the data collection, data consistency issues are bound to happen. This +is a trade-off that makes the query performance significantly faster. We think that for analytical +workload a slight lag in the data is acceptable. + +Before the rollout we plan to implement some indicators on the VSA page that shows the most +recent backend activities. For example, indicators that show the last data collection timestamp +and the last consistency check timestamp. + +#### Database structure + +VSA collects data for the following domain models: `Issue` and `MergeRequest`. To keep the +aggregated data separated, we use two additional database tables: + +- `analytics_cycle_analytics_issue_stage_events` +- `analytics_cycle_analytics_merge_request_stage_events` + +Both tables are hash partitioned by the `stage_event_hash_id`. Each table uses 32 partitions. It's +an arbitrary number and it could be changed. Important is to keep the partitions under 100GB in +size (which gives the feature a lot of headroom). + +|Column|Description| +|-|-| +|`stage_event_hash_id`|partitioning key| +|`merge_request_id` or `issue_id`|reference to the domain record (Issuable)| +|`group_id`|reference to the group (de-normalization)| +|`project_id`|reference to the project| +|`milestone_id`|duplicated data from the domain record table| +|`author_id`|duplicated data from the domain record table| +|`state_id`|duplicated data from the domain record table| +|`start_event_timestamp`|timestamp derived from the stage configuration| +|`end_event_timestamp`|timestamp derived from the stage configuration| + +With accordance to the data separation requirements, the table doesn't have any foreign keys. The +consistency is ensured by a background job (eventually consistent). + +### Data querying + +The base query always includes the following filters: + +- `stage_event_hash_id` - partition key +- `project_id` or `group_id` - depending if it's a project or group level query +- `end_event_timestamp` - date range filter (last 30 days) + +Example: Selecting review stage duration for the GitLab project + +```sql +SELECT end_event_timestamp - start_event_timestamp +FROM analytics_cycle_analytics_merge_request_stage_events +WHERE +stage_event_hash_id = 16 AND -- hits a specific partition +project_id = 278964 AND +end_event_timestamp > '2022-01-01' AND end_event_timestamp < '2022-01-30' +``` + +#### Query generation + +The query backend is hidden behind the same interface that the old backend implementation uses. +Thanks to this, we can easily switch between the old and new query backends. + +- `DataCollector`: entrypoint for querying VSA data + - `BaseQueryBuilder`: provides the base `ActiveRecord` scope (filters are applied here). + - `average`: average aggregation. + - `median`: median aggregation. + - `count`: row counting. + - `records`: list of issue or merge request records. + +#### Filters + +VSA supports various filters on the base query. Most of the filters require no additional JOINs: + +|Filter name|Description| +|-|-| +|`milestone_title`|The backend translates it to `milestone_id` filter| +|`author_username`|The backend translates it to `author_id` filter| +|`project_ids`|Only used on the group-level| + +Exceptions: these filters are applied on other tables which means we `JOIN` them. + +|Filter name|Description| +|-|-| +|`label_name`|Array filter, using the `label_links` table| +|`assignee_username`|Array filter, using the `*_assignees` table| + +To fully decompose the database, the required ID values would need to be replicated in the VSA +database tables. This change could be implemented using array columns. + +### Endpoints + +The feature uses private JSON APIs for delivering the data to the frontend. On the first page load +, the following requests are invoked: + +- Initial HTML page load which is mostly empty. Some configuration data is exposed via `data` +attributes. +- `value_streams` - Load the available value streams for the given group. +- `stages` - Load the stages for the currently selected value stream. +- `median` - For each stage, request the median duration. +- `count` - For each stage, request the number of items in the stage (this is a +[limit count](../merge_request_performance_guidelines.md#badge-counters), maximum 1000 rows). +- `average_duration_chart` - Data for the duration chart. +- `summary`, `time_summary` - Top-level aggregations, most of the metrics are using different APIs/ +finders and not invoking the aggregated backend. + +When clicking on a specific stage, the `records` endpoint is invoked, which returns the related +records (paginated) for the chosen stage in a specific order. + +### Database decomposition + +By separating the query logic from the main application code, the feature is ready for database +decomposition. If we decide that VSA requires a separate database instance, then moving the +aggregated tables can be accomplished with little effort. + +A different database technology could also be used to further improve the performance of the +feature, for example [Timescale DB](https://www.timescale.com). |