Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/snowplow')
-rw-r--r--doc/development/snowplow/implementation.md5
-rw-r--r--doc/development/snowplow/index.md15
-rw-r--r--doc/development/snowplow/infrastructure.md101
3 files changed, 107 insertions, 14 deletions
diff --git a/doc/development/snowplow/implementation.md b/doc/development/snowplow/implementation.md
index f4123e3ba86..88fb1d5cfe4 100644
--- a/doc/development/snowplow/implementation.md
+++ b/doc/development/snowplow/implementation.md
@@ -464,7 +464,10 @@ Page titles are hardcoded as `GitLab` for the same reason.
#### Snowplow Inspector Chrome Extension
-Snowplow Inspector Chrome Extension is a browser extension for testing frontend events. This works in production, staging, and local development environments.
+Snowplow Inspector Chrome Extension is a browser extension for testing frontend events. This works in production, staging, and local development environments.
+
+<i class="fa fa-youtube-play youtube" aria-hidden="true"></i>
+For a video tutorial, see the [Snowplow plugin walk through](https://www.youtube.com/watch?v=g4rqnIZ1Mb4).
1. Install [Snowplow Inspector](https://chrome.google.com/webstore/detail/snowplow-inspector/maplkdomeamdlngconidoefjpogkmljm?hl=en).
1. To open the extension, select the Snowplow Inspector icon beside the address bar.
diff --git a/doc/development/snowplow/index.md b/doc/development/snowplow/index.md
index 9b684757fe1..d6a7b900629 100644
--- a/doc/development/snowplow/index.md
+++ b/doc/development/snowplow/index.md
@@ -78,6 +78,8 @@ sequenceDiagram
Snowflake DW->>Sisense Dashboards: Data available for querying
```
+For more details about the architecture, see [Snowplow infrastructure](infrastructure.md).
+
## Structured event taxonomy
Click events must be consistent. If each feature captures events differently, it can be difficult
@@ -184,19 +186,6 @@ LIMIT 100
Snowplow JavaScript adds [web-specific parameters](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/snowplow-tracker-protocol/#Web-specific_parameters) to all web events by default.
-## Snowplow monitoring
-
-For different stages in the processing pipeline, there are several tools that monitor Snowplow events tracking:
-
-- [Product Intelligence Grafana dashboard](https://dashboards.gitlab.net/d/product-intelligence-main/product-intelligence-product-intelligence?orgId=1) monitors backend events sent from GitLab.com instance to collectors fleet. This dashboard provides information about:
- - The number of events that successfully reach Snowplow collectors.
- - The number of events that failed to reach Snowplow collectors.
- - The number of backend events that were sent.
-- [AWS CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=SnowPlow;start=P3D) monitors the state of the events processing pipeline. The pipeline starts from Snowplow collectors, through to enrichers and pseudonymization, and up to persistence on S3 bucket from which events are imported to Snowflake Data Warehouse. To view this dashboard AWS access is required, follow this [instruction](https://gitlab.com/gitlab-org/growth/product-intelligence/snowplow-pseudonymization#monitoring) if you are interested in getting one.
-- [SiSense dashboard](https://app.periscopedata.com/app/gitlab/417669/Snowplow-Summary-Dashboard) provides information about the number of good and bad events imported into the Data Warehouse, in addition to the total number of imported Snowplow events.
-
-For more information, see this [video walk-through](https://www.youtube.com/watch?v=NxPS0aKa_oU).
-
## Related topics
- [Snowplow data structure](https://docs.snowplowanalytics.com/docs/understanding-your-pipeline/canonical-event/)
diff --git a/doc/development/snowplow/infrastructure.md b/doc/development/snowplow/infrastructure.md
new file mode 100644
index 00000000000..28541874e98
--- /dev/null
+++ b/doc/development/snowplow/infrastructure.md
@@ -0,0 +1,101 @@
+---
+stage: Growth
+group: Product Intelligence
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Snowplow infrastructure
+
+Snowplow events on GitLab SaaS fired by a [tracker](implementation.md) go through an AWS pipeline, managed by GitLab.
+
+## Event flow in the AWS pipeline
+
+Every event goes through a collector, enricher, and pseudonymization lambda. The event is then dumped to S3 storage where it can be picked up by the Snowflake data warehouse.
+
+Deploying and managing the infrastructure is automated using Terraform in the current [Terraform repository](https://gitlab.com/gitlab-com/gl-infra/config-mgmt/-/tree/master/environments/aws-snowplow).
+
+```mermaid
+graph LR
+ GL[GitLab.com]-->COL
+
+ subgraph aws-cloud[AWS]
+ COL[Collector]-->|snowplow-raw-good|ENR
+ COL[Collector]-->|snowplow-raw-bad|FRBE
+ subgraph firehoserbe[Firehose]
+ FRBE[AWS Lambda]
+ end
+ FRBE-->S3RBE
+
+ ENR[Enricher]-->|snowplow-enriched-bad|FEBE
+ subgraph firehoseebe[Firehose]
+ FEBE[AWS Lambda]
+ end
+ FEBE-->S3EBE
+
+ ENR[Enricher]-->|snowplow-enriched-good|FRGE
+ subgraph firehosege[Firehose]
+ FRGE[AWS Lambda]
+ end
+ FRGE-->S3GE
+ end
+
+ subgraph snowflake[Data warehouse]
+ S3RBE[S3 raw-bad]-->BE[gitlab_bad_events]
+ S3EBE[S3 enriched-bad]-->BE[gitlab_bad_events]
+ S3GE[S3 output]-->GE[gitlab_events]
+ end
+```
+
+See [Snowplow technology 101](https://github.com/snowplow/snowplow/#snowplow-technology-101) for Snowplow's own documentation and an overview how collectors and enrichers work.
+
+### Pseudonymization
+
+In contrast to a typical Snowplow pipeline, after enrichment, GitLab Snowplow events go through a [pseudonymization service](https://gitlab.com/gitlab-org/growth/product-intelligence/snowplow-pseudonymization) in the form of an AWS Lambda service before they are stored in S3 storage.
+
+#### Why events need to be pseudonymized
+
+GitLab is bound by its [obligations to community](https://about.gitlab.com/handbook/product/product-intelligence-guide/service-usage-data-commitment/)
+and by [legal regulations](https://about.gitlab.com/handbook/legal/privacy/services-usage-data/) to protect the privacy of its users.
+
+GitLab must provide valuable insights for business decisions, and there is a need
+for a better understanding of different users' behavior patterns. The
+pseudonymization process helps you find a compromise between these two requirements.
+
+Pseudonymization processes personally identifiable information inside a Snowplow event in an irreversible fashion
+maintaining deterministic output for given input, while masking any relation to that input.
+
+#### How events are pseudonymized
+
+Pseudonymization uses an allowlist that provides privacy by default. Therefore, each
+attribute received as part of a Snowplow event is pseudonymized unless the attribute
+is an allowed exception.
+
+Pseudonymization is done using the HMAC-SHA256 keyed hash algorithm.
+Attributes are combined with a secret salt to replace each identifiable information with a pseudonym.
+
+### S3 bucket data lake to Snowflake
+
+See [Data team's Snowplow Overview](https://about.gitlab.com/handbook/business-technology/data-team/platform/snowplow/) for further details how data is ingested into our Snowflake data warehouse.
+
+## Monitoring
+
+There are several tools that monitor Snowplow events tracking in different stages of the processing pipeline:
+
+- [Product Intelligence Grafana dashboard](https://dashboards.gitlab.net/d/product-intelligence-main/product-intelligence-product-intelligence?orgId=1) monitors backend events sent from a GitLab.com instance to a collectors fleet. This dashboard provides information about:
+ - The number of events that successfully reach Snowplow collectors.
+ - The number of events that failed to reach Snowplow collectors.
+ - The number of backend events that were sent.
+- [AWS CloudWatch dashboard](https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=SnowPlow;start=P3D) monitors the state of the events in a processing pipeline. The pipeline starts from Snowplow collectors, goes through to enrichers and pseudonymization, and then up to persistence in an S3 bucket. From S3, the events are imported into the Snowflake Data Warehouse. You must have AWS access rights to view this dashboard. For more information, see [monitoring](https://gitlab.com/gitlab-org/growth/product-intelligence/snowplow-pseudonymization#monitoring) in the Snowplow Events pseudonymization service documentation.
+- [Sisense dashboard](https://app.periscopedata.com/app/gitlab/417669/Snowplow-Summary-Dashboard) provides information about the number of good and bad events imported into the Data Warehouse, in addition to the total number of imported Snowplow events.
+
+For more information, see this [video walk-through](https://www.youtube.com/watch?v=NxPS0aKa_oU).
+
+## Related topics
+
+- [Snowplow technology 101](https://github.com/snowplow/snowplow/#snowplow-technology-101)
+- [Snowplow pseudonymization AWS Lambda project](https://gitlab.com/gitlab-org/growth/product-intelligence/snowplow-pseudonymization)
+- [Product Intelligence Guide](https://about.gitlab.com/handbook/product/product-intelligence-guide/)
+- [Data Infrastructure](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/)
+- [Snowplow architecture overview (internal)](https://www.youtube.com/watch?v=eVYJjzspsLU)
+- [Snowplow architecture overview slide deck (internal)](https://docs.google.com/presentation/d/16gQEO5CAg8Tx4NBtfnZj-GF4juFI6HfEPWcZgH4Rn14/edit?usp=sharing)
+- [AWS Lambda implementation (internal)](https://youtu.be/cQd0mdMhkQA)