diff options
Diffstat (limited to '.gitlab/issue_templates/rca.md')
-rw-r--r-- | .gitlab/issue_templates/rca.md | 125 |
1 files changed, 125 insertions, 0 deletions
diff --git a/.gitlab/issue_templates/rca.md b/.gitlab/issue_templates/rca.md new file mode 100644 index 00000000000..238039bd712 --- /dev/null +++ b/.gitlab/issue_templates/rca.md @@ -0,0 +1,125 @@ +**Please note:** if the incident relates to sensitive data or is security-related, consider +labeling this issue with ~security and mark it confidential, or create it in a private repository. + +There is now a separate internal-only RCA template for SIRT issues referenced https://about.gitlab.com/handbook/security/root-cause-analysis.html +*** + +## Summary + +A brief summary of what happened. Try to make it as executive-friendly as possible. + +- Service(s) affected: +- Team attribution: +- Minutes downtime or degradation: + +## Impact & Metrics + +Start with the following: + +| Question | Answer | +| ----- | ----- | +| What was the impact? | (i.e. service outage, sub-service brown-out, exposure of sensitive data, ...) | +| Who was impacted? | (i.e. external customers, internal customers, specific teams, ...) | +| How did this impact customers? | (i.e. preventing them from doing X, incorrect display of Y, ...) | +| How many attempts made to access? | | +| How many customers affected? | | +| How many customers tried to access? | | + +Include any additional metrics that are of relevance. + +Provide any relevant graphs that could help understand the impact of the incident and its dynamics. + +## Detection & Response + +Start with the following: + +| Question | Answer | +| ----- | ----- | +| When was the incident detected? | YYYY-MM-DD UTC | +| How was the incident detected? | (i.e. DELKE, H1 Report, ...) | +| Did alarming work as expected? | | +| How long did it take from the start of the incident to its detection? | | +| How long did it take from detection to remediation? | | +| What steps were taken to remediate? | | +| Were there any issues with the response? | (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...) | + +## MR Checklist + +Consider these questions if a code change introduced the issue. + +| Question | Answer | +| ----- | ----- | +| Was the [MR acceptance checklist](https://docs.gitlab.com/ee/development/code_review.html#acceptance-checklist) marked as reviewed in the MR? | | +| Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? | | + +## Timeline + +YYYY-MM-DD + +- 00:00 UTC - something happened +- 00:01 UTC - something else happened +- ... + +YYYY-MM-DD+1 + +- 00:00 UTC - and then this happened +- 00:01 UTC - and more happened +- ... + + +## Root Cause Analysis + +The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can **never be a person**, the way of writing has to refer to the system and the context rather than the specific actors. + +Follow the "**5 whys**" in a **blameless** manner as the core of the root cause analysis. + +For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause. + +Keep in mind that from one "why?" there may come more than one answer, consider following the different branches. + +### Example of the usage of "5 whys" + +The vehicle will not start. (the problem) + +1. Why? - The battery is dead. +2. Why? - The alternator is not functioning. +3. Why? - The alternator belt has broken. +4. Why? - The alternator belt was well beyond its useful service life and not replaced. +5. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause) + +## What went well + +Start with the following: + +- Identify the things that worked well or as expected. +- Any additional call-outs for what went particularly well. + +## What can be improved + +Start with the following: + +- Using the root cause analysis, explain what can be improved to prevent this from happening again. +- Is there anything that could have been done to improve the detection or time to detection? +- Is there anything that could have been done to improve the response or time to response? +- Is there an existing issue that would have either prevented this incident or reduced the impact? +- Did we have any indication or beforehand knowledge that this incident might take place? +- Was the [MR acceptance checklist](https://docs.gitlab.com/ee/development/code_review.html#acceptance-checklist) marked as reviewed in the MR? +- Should the checklist be updated to help reduce chances of future recurrences? + + + +## Corrective actions + +- List issues that have been created as corrective actions from this incident. +- For each issue, include the following: + - `<Bare issue link>` - Issue labeled as ~"corrective action". + - An estimated date of completion of the corrective action. + - The named individual who owns the delivery of the corrective action. + +## Guidelines + +- [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html) +- [5 whys](https://en.wikipedia.org/wiki/5_Whys) + +/confidential +/label ~RCA |