diff options
Diffstat (limited to 'doc/architecture')
-rw-r--r-- | doc/architecture/blueprints/cloud_native_build_logs/index.md | 141 | ||||
-rw-r--r-- | doc/architecture/blueprints/cloud_native_gitlab_pages/index.md | 135 | ||||
-rw-r--r-- | doc/architecture/blueprints/feature_flags_development/index.md | 140 | ||||
-rw-r--r-- | doc/architecture/index.md | 9 |
4 files changed, 425 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md new file mode 100644 index 00000000000..25abfe36e88 --- /dev/null +++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md @@ -0,0 +1,141 @@ +--- +comments: false +description: 'Next iteration of build logs architecture at GitLab' +--- + +# Cloud Native Build Logs + +Cloud native and the adoption of Kubernetes has been recognised by GitLab to be +one of the top two biggest tailwinds that are helping us grow faster as a +company behind the project. + +This effort is described in a more details [in the infrastructure team +handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/). + +## Traditional build logs + +Traditional job logs depend a lot on availability of a local shared storage. + +Every time a GitLab Runner sends a new partial build output, we write this +output to a file on a disk. This is simple, but this mechanism depends on +shared local storage - the same file needs to be available on every GitLab web +node machine, because GitLab Runner might connect to a different one every time +it performs an API request. Sidekiq also needs access to the file because when +a job is complete, a trace file contents will be sent to the object store. + +## New architecture + +New architecture writes data to Redis instead of writing build logs into a +file. + +In order to make this performant and resilient enough, we implemented a chunked +I/O mechanism - we store data in Redis in chunks, and migrate them to an object +store once we reach a desired chunk size. + +Simplified sequence diagram is available below. + +```mermaid +sequenceDiagram + autonumber + participant U as User + participant R as Runner + participant G as GitLab (rails) + participant I as Redis + participant D as Database + participant O as Object store + + loop incremental trace update sent by a runner + Note right of R: Runner appends a build trace + R->>+G: PATCH trace [build.id, offset, data] + G->>+D: find or create chunk [chunk.index] + D-->>-G: chunk [id, index] + G->>I: append chunk data [chunk.index, data] + G-->>-R: 200 OK + end + + Note right of R: User retrieves a trace + U->>+G: GET build trace + loop every trace chunk + G->>+D: find chunk [index] + D-->>-G: chunk [id] + G->>+I: read chunk data [chunk.index] + I-->>-G: chunk data [data, size] + end + G-->>-U: build trace + + Note right of R: Trace chunk is full + R->>+G: PATCH trace [build.id, offset, data] + G->>+D: find or create chunk [chunk.index] + D-->>-G: chunk [id, index] + G->>I: append chunk data [chunk.index, data] + G->>G: chunk full [index] + G-->>-R: 200 OK + G->>+I: read chunk data [chunk.index] + I-->>-G: chunk data [data, size] + G->>O: send chunk data [data, size] + G->>+D: update data store type [chunk.id] + G->>+I: delete chunk data [chunk.index] +``` + +## NFS coupling + +In 2017, we experienced serious problems of scaling our NFS infrastructure. We +even tried to replace NFS with +[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully. + +Since that time it has become apparent that the cost of operations and +maintenance of a NFS cluster is significant and that if we ever decide to +migrate to Kubernetes [we need to decouple GitLab from a shared local storage +and +NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396). + +1. NFS might be a single point of failure +1. NFS can only be reliably scaled vertically +1. Moving to Kubernetes means increasing the number of mount points by an order + of magnitude +1. NFS depends on extremely reliable network which can be difficult to provide + in Kubernetes environment +1. Storing customer data on NFS involves additional security risks + +Moving GitLab to Kubernetes without NFS decoupling would result in an explosion +of complexity, maintenance cost and enormous, negative impact on availability. + +## Iterations + +1. ✓ Implement the new architecture in way that it does not depend on shared local storage +1. ✓ Evaluate performance and edge-cases, iterate to improve the new architecture +1. ✓ Design cloud native build logs correctness verification mechanisms +1. ✓ Build observability mechanisms around performance and correctness +1. Rollout the feature into production environment incrementally + +The work needed to make the new architecture production ready and enabled on +GitLab.com is being tracked in [Cloud Native Build Logs on +GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic. + +Enabling this feature on GitLab.com is a subtask of [making the new +architecture generally +available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Architecture Evolution Coach | Gerardo Lopez-Fernandez | +| Engineering Leader | Darby Frey | +| Domain Expert | Kamil Trzciński | +| Domain Expert | Sean McGivern | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Product | Jason Yavorska | +| Leadership | Darby Frey | +| Engineering | Grzegorz Bizon | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md new file mode 100644 index 00000000000..37e69d46ae1 --- /dev/null +++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md @@ -0,0 +1,135 @@ +--- +comments: false +description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.' +--- + +# GitLab Pages New Architecture + +GitLab Pages is an important component of the GitLab product. It is mostly +being used to serve static content, and has a limited set of well defined +responsibilities. That being said, unfortunately it has become a blocker for +GitLab.com Kubernetes migration. + +Cloud Native and the adoption of Kubernetes has been recognised by GitLab to be +one of the top two biggest tailwinds that are helping us grow faster as a +company behind the project. + +This effort is described in more detail [in the infrastructure team handbook +page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/). + +GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes +migration a significant change to GitLab Pages' architecture is required. This +is an ongoing work that we have started more than a year ago. This blueprint +might be useful to understand why it is important, and what is the roadmap. + +## How GitLab Pages Works + +GitLab Pages is a daemon designed to serve static content, written in +[Go](https://golang.org/). + +Initially, GitLab Pages has been designed to store static content on a local +shared block storage (NFS) in a hierarchical group > project directory +structure. Each directory, representing a project, was supposed to contain a +configuration file and static content that GitLab Pages daemon was supposed to +read and serve. + +```mermaid +graph LR + A(GitLab Rails) -- Writes new pages deployment --> B[(NFS)] + C(GitLab Pages) -. Reads static content .-> B +``` + +This initial design has become outdated because of a few reasons - NFS coupling +being one of them - and we decided to replace it with more "decoupled +service"-like architecture. The new architecture, that we are working on, is +described in this blueprint. + +## NFS coupling + +In 2017, we experienced serious problems of scaling our NFS infrastructure. We +even tried to replace NFS with +[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully. + +Since that time it has become apparent that the cost of operations and +maintenance of a NFS cluster is significant and that if we ever decide to +migrate to Kubernetes [we need to decouple GitLab from a shared local storage +and +NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396). + +1. NFS might be a single point of failure +1. NFS can only be reliably scaled vertically +1. Moving to Kubernetes means increasing the number of mount points by an order + of magnitude +1. NFS depends on extremely reliable network which can be difficult to provide + in Kubernetes environment +1. Storing customer data on NFS involves additional security risks + +Moving GitLab to Kubernetes without NFS decoupling would result in an explosion +of complexity, maintenance cost and enormous, negative impact on availability. + +## New GitLab Pages Architecture + +- GitLab Pages is going to source domains' configuration from GitLab's internal + API, instead of reading `config.json` files from a local shared storage. +- GitLab Pages is going to serve static content from Object Storage. + +```mermaid +graph TD + A(User) -- Pushes pages deployment --> B{GitLab} + C((GitLab Pages)) -. Reads configuration from API .-> B + C -. Reads static content .-> D[(Object Storage)] + C -- Serves static content --> E(Visitors) +``` + +This new architecture has been briefly described in [the blog +post](https://about.gitlab.com/blog/2020/08/03/how-gitlab-pages-uses-the-gitlab-api-to-serve-content/) +too. + +## Iterations + +1. ✓ Redesign GitLab Pages configuration source to use GitLab's API +1. ✓ Evaluate performance and build reliable caching mechanisms +1. ✓ Incrementally rollout the new source on GitLab.com +1. ✓ Make GitLab Pages API domains config source enabled by default +1. Enable experimentation with different servings through feature flags +1. Triangulate object store serving design through meaningful experiments +1. Design pages migration mechanisms that can work incrementally +1. Gradually migrate towards object storage serving on GitLab.com + +[GitLab Pages Architecture](https://gitlab.com/groups/gitlab-org/-/epics/1316) +epic with detailed roadmap is also available. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Architecture Evolution Coach | Kamil Trzciński | +| Engineering Leader | Daniel Croft | +| Domain Expert | Grzegorz Bizon | +| Domain Expert | Vladimir Shushlin | +| Domain Expert | Jaime Martinez | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Product | Jackie Porter | +| Leadership | Daniel Croft | +| Engineering | Kamil Trzciński | + +Domain Experts: + +| Role | Who +|------------------------------|------------------------| +| Domain Expert | Kamil Trzciński | +| Domain Expert | Grzegorz Bizon | +| Domain Expert | Vladimir Shushlin | +| Domain Expert | Jaime Martinez | +| Domain Expert | Krasimir Angelov | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md new file mode 100644 index 00000000000..0aeb2b51b39 --- /dev/null +++ b/doc/architecture/blueprints/feature_flags_development/index.md @@ -0,0 +1,140 @@ +--- +comments: false +description: 'Internal usage of Feature Flags for GitLab development' +--- + +# Usage of Feature Flags for GitLab development + +Usage of feature flags become crucial for the development of GitLab. The +feature flags are a convenient way to ship changes early, and safely rollout +them to wide audience ensuring that feature is stable and performant. + +Since the presence of feature is controlled with a dedicated condition, a +developer can decide for a best time for testing the feature, ensuring that +feature is not enable prematurely. + +## Challenges + +The extensive usage of feature flags poses a few challenges + +- Each feature flag that we add to codebase is a ~"technical debt" as it adds a + matrix of configurations. +- Testing each combination of feature flags is close to impossible, so we + instead try to optimise our testing of feature flags to the most common + scenarios. +- There's a growing challenge of maintaining a growing number of feature flags. + We sometimes forget how our feature flags are configured or why we haven't + yet removed the feature flag. +- The usage of feature flags can also be confusing to people outside of + development that might not fully understand dependence of ~feature or ~bug + fix on feature flag and how this feature flag is configured. Or if the feature + should be announced as part of release post. +- Maintaining feature flags poses additional challenge of having to manage + different configurations across different environments/target. We have + different configuration of feature flags for testing, for development, for + staging, for production and what is being shipped to our customers as part of + on-premise offering. + +## Goals + +The biggest challenge today with our feature flags usage is their implicit +nature. Feature flags are part of the codebase, making them hard to understand +outside of development function. + +We should aim to make our feature flag based development to be accessible to +any interested party. + +- developer / engineer + - can easily add a new feature flag, and configure it's state + - can quickly find who to reach if touches another feature flag + - can quickly find stale feature flags +- engineering manager + - can understand what feature flags her/his group manages +- engineering manager and director + - can understand how much ~"technical debt" is inflicted due to amount of feature flags that we have to manage + - can understand how many feature flags are added and removed in each release +- product manager and documentation writer + - can understand what features are gated by what feature flags + - can understand if feature and thus feature flag is generally available on GitLab.com + - can understand if feature and thus feature flag is enabled by default for on-premise installations +- delivery engineer + - can understand what feature flags are introduced and changed between subsequent deployments +- support and reliability engineer + - can understand how feature flags changed between releases: what feature flags become enabled, what removed + - can quickly find relevant information about feature flag to know individuals which might help with an ongoing support request or incident + +## Proposal + +To help with above goals we should aim to make our feature flags usage explicit +and understood by all involved parties. + +Introduce a YAML-described `feature-flags/<name-of-feature.yml>` that would +allow us to have: + +1. A central place where all feature flags are documented, +1. A description of why the given feature flag was introduced, +1. A what relevant issue and merge request it was introduced by, +1. Build automated documentation with all feature flags in the codebase, +1. Track how many feature flags are per given group +1. Track how many feature flags are added and removed between releases +1. Make this information easily accessible for all +1. Allow our customers to easily discover how to enable features and quickly + find out information what did change between different releases + +### The `YAML` + +```yaml +--- +name: ci_disallow_to_create_merge_request_pipelines_in_target_project +introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724 +rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119 +group: group::progressive delivery +type: development +default_enabled: false +``` + +## Reasons + +These are reason why these changes are needed: + +- we have around 500 different feature flags today +- we have hard time tracking their usage +- we have ambiguous usage of feature flag with different `default_enabled:` and + different `actors` used +- we lack a clear indication who owns what feature flag and where to find + relevant informations +- we do not emphasise the desire to create feature flag rollout issue to + indicate that feature flag is in fact a ~"technical debt" +- we don't know exactly what feature flags we have in our codebase +- we don't know exactly how our feature flags are configured for different + environments: what is being used for `test`, what we ship for `on-premise`, + what is our settings for `staging`, `qa` and `production` + +## Iterations + +This work is being done as part of dedicated epic: [Improve internal usage of +Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). This epic +describes a meta reasons for making these changes. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Kamil Trzciński | +| Architecture Evolution Coach | Gerardo Lopez-Fernandez | +| Engineering Leader | Kamil Trzciński | +| Domain Expert | Shinya Maeda | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Product | Kenny Johnston | +| Leadership | Craig Gomes | +| Engineering | Kamil Trzciński | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/index.md b/doc/architecture/index.md new file mode 100644 index 00000000000..0a2ade6b7b0 --- /dev/null +++ b/doc/architecture/index.md @@ -0,0 +1,9 @@ +--- +comments: false +description: 'Architecture Practice at GitLab' +--- + +# Architecture at GitLab + +- [Architecture at GitLab](https://about.gitlab.com/handbook/engineering/architecture/) +- [Architecture Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/) |