diff options
Diffstat (limited to 'doc/development/contributing/verify/index.md')
-rw-r--r-- | doc/development/contributing/verify/index.md | 236 |
1 files changed, 236 insertions, 0 deletions
diff --git a/doc/development/contributing/verify/index.md b/doc/development/contributing/verify/index.md new file mode 100644 index 00000000000..a2bb0eca733 --- /dev/null +++ b/doc/development/contributing/verify/index.md @@ -0,0 +1,236 @@ +--- +type: reference, dev +stage: none +group: Verify +--- + +# Contribute to Verify stage codebase + +## What are we working on in Verify? + +Verify stage is working on a comprehensive Continuous Integration platform +integrated into the GitLab product. Our goal is to empower our users to make +great technical and business decisions, by delivering a fast, reliable, secure +platform that verifies assumptions that our users make, and check them against +the criteria defined in CI/CD configuration. They could be unit tests, end-to-end +tests, benchmarking, performance validation, code coverage enforcement, and so on. + +Feedback delivered by GitLab CI/CD makes it possible for our users to make well +informed decisions about technological and business choices they need to make +to succeed. Why is Continuous Integration a mission critical product? + +GitLab CI/CD is our platform to deliver feedback to our users and customers. + +They contribute their continuous integration configuration files +`.gitlab-ci.yml` to describe the questions they want to get answers for. Each +time someone pushes a commit or triggers a pipeline we need to find answers for +very important questions that have been asked in CI/CD configuration. + +Failing to answer these questions or, what might be even worse, providing false +answers, might result in a user making a wrong decision. Such wrong decisions +can have very severe consequences. + +## Core principles of our CI/CD platform + +Data produced by the platform should be: + +1. Accurate. +1. Durable. +1. Accessible. + +The platform itself should be: + +1. Reliable. +1. Secure. +1. Deterministic. +1. Trustworthy. +1. Fast. +1. Simple. + +Since the inception of GitLab CI/CD, we have lived by these principles, +and they serve us and our users well. Some examples of these principles are that: + +- The feedback delivered by GitLab CI/CD and data produced by the platform should be accurate. + If a job fails and we notify a user that it was successful, it can have severe negative consequences. +- Feedback needs to be available when a user needs it and data can not disappear unexpectedly when engineers need it. +- It all doesn’t matter if the platform is not secure and we +are leaking credentials or secrets. +- When a user provides a set of preconditions in a form of CI/CD configuration, the result should be deterministic each time a pipeline runs, because otherwise the platform might not be trustworthy. +- If it is fast, simple to use and has a great UX it will serve our users well. + +## Building things in Verify + +### Measure before you optimize, and make data-informed decisions + +It is very difficult to optimize something that you can not measure. How would you +know if you succeeded, or how significant the success was? If you are working on +a performance or reliability improvement, make sure that you measure things before +you optimize them. + +The best way to measure stuff is to add a Prometheus metric. Counters, gauges, and +histograms are great ways to quickly get approximated results. Unfortunately this +is not the best way to measure tail latency. Prometheus metrics, especially histograms, +are usually approximations. + +If you have to measure tail latency, like how slow something could be or how +large a request payload might be, consider adding custom application logs and +always use structured logging. + +It's useful to use profiling and flamegraphs to understand what the code execution +path truly looks like! + +### Strive for simple solutions, avoid clever solutions + +It is sometimes tempting to use a clever solution to deliver something more +quickly. We want to avoid shipping clever code, because it is usually more +difficult to understand and maintain in the long term. Instead, we want to +focus on boring solutions that make it easier to evolve the codebase and keep the +contribution barrier low. We want to find solutions that are as simple as +possible. + +### Do not confuse boring solutions with easy solutions + +Boring solutions are sometimes confused with easy solutions. Very often the +opposite is true. An easy solution might not be simple - for example, a complex +new library can be included to add a very small functionality that otherwise +could be implemented quickly - it is easier to include this library than to +build this thing, but it would bring a lot of complexity into the product. + +On the other hand, it is also possible to over-engineer a solution when a simple, +well tested, and well maintained library is available. In that case using the +library might make sense. We recognize that we are constantly balancing simple +and easy solutions, and that finding the right balance is important. + +### "Simple" is not mutually exclusive with "flexible" + +Building simple things does not mean that more advanced and flexible solutions +will not be available. A good example here is an expanding complexity of +writing `.gitlab-ci.yml` configuration. For example, you can use a simple +method to define an environment name: + +```yaml +deploy: + environment: production + script: cap deploy +``` + +But the `environment` keyword can be also expanded into another level of +configuration that can offer more flexibility. + +```yaml +deploy: + environment: + name: review/$CI_COMMIT_REF_SLUG + url: https://prod.example.com + script: cap deploy +``` + +This kind of approach shields new users from the complexities of the platform, +but still allows them to go deeper if they need to. This approach can be +applied to many other technical implementations. + +### Make things observable + +GitLab is a DevOps platform. We popularize DevOps because it helps companies +be more efficient and achieve better results. One important component of +DevOps culture is to take ownership over features and code that you are +building. It is very difficult to do that when you don’t know how your features +perform and behave in the production environment. + +This is why we want to make our features and code observable. It +should be written in a way that an author can understand how well or how poorly +the feature or code behaves in the production environment. We usually accomplish +that by introducing the proper mix of Prometheus metrics and application +loggers. + +**TODO** document when to use Prometheus metrics, when to use loggers. Write a +few sentences about histograms and counters. Write a few sentences highlighting +importance of metrics when doing incremental rollouts. + +### Protect customer data + +Making data produced by our CI/CD platform durable is important. We recognize that +data generated in the CI/CD by users and customers is +something important and we must protect it. This data is not only important +because it can contain important information, we also do have compliance and +auditing responsibilities. + +Therefore we must take extra care when we are writing migrations +that permanently removes data from our database, or when we are define +new retention policies. + +As a general rule, when you are writing code that is supposed to remove +data from the database, file system, or object storage, you should get an extra pair +of eyes on your changes. When you are defining a new retention policy, you +should double check with PMs and EMs. + +### Get your changes reviewed + +When your merge request is ready for reviews you must assign +reviewers and then maintainers. Depending on the complexity of a change, you +might want to involve the people that know the most about the codebase area you are +changing. We do have many domain experts in Verify and it is absolutely acceptable to +ask them to review your code when you are not certain if a reviewer or +maintainer assigned by the Reviewer Roulette has enough context about the +change. + +The reviewer roulette offers useful suggestions, but as assigning the right +reviewers is important it should not be done automatically every time. It might +not make sense to assign someone who knows nothing about the area you are +updating, because their feedback might be limited to code style and syntax. +Depending on the complexity and impact of a change, assigning the right people +to review your changes might be very important. + +If you don’t know who to assign, consult `git blame` or ask in the `#verify` +Slack channel (GitLab team members only). + +### Incremental rollouts + +After your merge request is merged by a maintainer, it is time to release it to +users and the wider community. We usually do this with feature flags. +While not every merge request needs a feature flag, most merge +requests in Verify should have feature flags. [**TODO** link to docs about what +needs a feature flag and what doesn’t]. + +If you already follow the advice on this page, you probably already have a +few metrics and perhaps a few loggers added that make your new code observable +in the production environment. You can now use these metrics to incrementally +roll out your changes! + +A typical scenario involves enabling a few features in a few internal projects +while observing your metrics or loggers. Be aware that there might be a +small delay involved in ingesting logs in Elastic or Kibana. After you confirm +the feature works well with internal projects you can start an +incremental rollout for other projects. + +Avoid using "percent of time" incremental rollouts. These are error prone, +especially when you are checking feature flags in a few places in the codebase +and you have not memoized the result of a check in a single place. + +### Do not cause our Universe to implode + +During one of the first GitLab Contributes events we had a discussion about the importance +of keeping CI/CD pipeline, stage, and job statuses accurate. We considered a hypothetical +scenario relating to a software being built by one of our [early customers](https://about.gitlab.com/blog/2016/11/23/gitlab-adoption-growing-at-cern/) + +> What happens if software deployed to the [Large Hadron Collider (LHC)](https://en.wikipedia.org/wiki/Large_Hadron_Collider), +> breaks because of a bug in GitLab CI/CD that showed that a pipeline +> passed, but this data was not accurate and the software deployed was actually +> invalid? A problem like this could cause the LHC to malfunction, which +> could generate a new particle that would then cause the universe to implode. + +That would be quite an undesirable outcome of a small bug in GitLab CI/CD status +processing. Please take extra care when you are working on CI/CD statuses, +we don’t want to implode our Universe! + +This is an extreme and unlikely scenario, but presenting data that is not accurate +can potentially cause a myriad of problems through the +[butterfly effect](https://en.wikipedia.org/wiki/Butterfly_effect). +There are much more likely scenarios that +can have disastrous consequences. GitLab CI/CD is being used by companies +building medical, aviation, and automotive software. Continuous Integration is +a mission critical part of software engineering. + +When you are working on a subsystem for pipeline processing and transitioning +CI/CD statuses, request an additional review from a domain expert and hold +others accountable for doing the same. |