1 files changed, 137 insertions, 2 deletions
diff --git a/doc/development/testing_guide/flaky_tests.md b/doc/development/testing_guide/flaky_tests.md
index 45f1d0ddf7e..cc62a0ebf03 100644
--- a/doc/development/testing_guide/flaky_tests.md
+++ b/doc/development/testing_guide/flaky_tests.md
@@ -11,9 +11,144 @@ info: To determine the technical writer assigned to the Stage/Group associated w
 It's a test that sometimes fails, but if you retry it enough times, it passes,
 eventually.
 
+## What are the potential cause for a test to be flaky?
+
+### Unclean environment
+
+**Label:** `flaky-test::unclean environment`
+
+**Description:** The environment got dirtied by a previous test. The actual cause is probably not the flaky test here.
+
+**Difficulty to reproduce:** Moderate. Usually, running the same spec files until the one that's failing reproduces the problem.
+
+**Resolution:** Fix the previous tests and/or places where the environment is modified, so that
+it's reset to a pristine test after each test.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab/-/issues/378414#note_1142026988): A migration
+  test might roll-back the database, perform its testing, and then roll-up the database in an
+  inconsistent state, so that following tests might not know about certain columns.
+- [Example 2](https://gitlab.com/gitlab-org/gitlab/-/issues/368500): A test modifies data that is
+  used by a following test.
+- [Example 3](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/103434#note_1172316521): A test for a database query passes in a fresh database, but in a 
+  CI/CD pipeline where the database is used to process previous test sequences, the test fails. This likely
+    means that the query itself needs to be updated to work in a non-clean database.
+
+### Ordering assertion
+
+**Label:** `flaky-test::ordering assertion`
+
+**Description:** The test is expecting a specific order in the data under test yet the data is in
+a non-deterministic order.
+
+**Difficulty to reproduce:** Easy. Usually, running the test locally several times would reproduce
+the problem.
+
+**Resolution:** Depending on the problem, you might want to:
+
+- loosen the assertion if the test shouldn't care about ordering but only on the elements
+- fix the test by specifying a deterministic ordering
+- fix the app code by specifying a deterministic ordering
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10148/diffs): Without
+  specifying `ORDER BY`, database will not give deterministic ordering, or data race happening
+  in the tests.
+
+### Dataset-specific
+
+**Label:** `flaky-test::dataset-specific`
+
+**Description:** The test assumes the dataset is in a particular (usually limited) state, which
+might not be true depending on when the test run during the test suite.
+
+**Difficulty to reproduce:** Moderate, as the amount of data needed to reproduce the issue might be
+difficult to achieve locally.
+
+**Resolution:** Fix the test to not assume that the dataset is in a particular state, don't hardcode IDs.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab/-/issues/378381): The database is recreated when
+  any table has more than 500 columns. It could pass in the merge request, but fail later in
+  `master` if the order of tests changes.
+- [Example 2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/91016/diffs): A test asserts
+  that trying to find a record with an unexisting ID retuns an error message. The test uses an
+  hardcoded ID that's supposed to not exist (e.g. `42`). If the test is run early in the test
+  suite, it might pass as not enough records were created before it, but as soon as it would run
+  later in the suite, there could be a record that actually has the ID `42`, hence the test would
+  start to fail.
+
+### Random input
+
+**Label:** `flaky-test::random input`
+
+**Description:** The test use random values, that sometimes match the expectations, and sometimes not.
+
+**Difficulty to reproduce:** Easy, as the test can be modified locally to use the "random value"
+used at the time the test failed
+
+**Resolution:** Once the problem is reproduced, it should be easy to debug and fix either the test
+or the app.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/20121): The test isn't robust enough to handle a specific data, that only appears sporadically since the data input is random.
+
+### Unreliable DOM Selector
+
+**Label:** `flaky-test::unreliable dom selector`
+
+**Description:** The DOM selector used in the test is unreliable.
+
+**Difficulty to reproduce:** Moderate to difficult. Depending on whether the DOM selector is duplicated, or appears after a delay etc.
+
+**Resolution:** It really depends on the problem here. It could be to wait for requests to finish, to scroll down the page etc.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab/-/issues/338341): A non-unique CSS selector
+  matching more than one element, or a non-waiting selector method that does not allow rendering
+  time before throwing an `element not found` error.
+- [Example 2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/101728/diffs): A CSS selector
+  only appears after a GraphQL requests has finished, and the UI has updated.
+
+### Datetime-sensitive
+
+**Label:** `flaky-test::datetime-sensitive`
+
+**Description:** The test is assuming a specific date or time.
+
+**Difficulty to reproduce:** Easy to moderate, depending on whether the test consistently fails after a certain date, or only fails at a given time or date.
+
+**Resolution:** Freezing the time is usually a good solution.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab/-/issues/118612): A test that breaks after some time passed.
+
+### Unstable infrastructure
+
+**Label:** `flaky-test::unstable infrastructure`
+
+**Description:** The test fails from time to time due to infrastructure issues.
+
+**Difficulty to reproduce:** Hard. It's really hard to reproduce CI infrastructure issues. It might
+be possible by using containers locally.
+
+**Resolution:** Starting a conversation with the Infrastructure department in a dedicated issue is
+usually a good idea.
+
+**Examples:**
+
+- [Example 1](https://gitlab.com/gitlab-org/gitlab/-/issues/363214): The runner is under heavy load at this time.
+- [Example 2](https://gitlab.com/gitlab-org/gitlab/-/issues/360559): The runner is having networking issues, making a job failing early
+
 ## Quarantined tests
 
-When a test frequently fails in `main`,
+When a test frequently fails in `master`,
 create [a ~"failure::flaky-test" issue](https://about.gitlab.com/handbook/engineering/workflow/#broken-master).
 
 If the test cannot be fixed in a timely fashion, there is an impact on the
@@ -62,7 +197,7 @@ For example, `FLAKY_RSPEC_GENERATE_REPORT=1 bin/rspec ...`.
 
 The `rspec/flaky/report-suite.json` report is:
 
-- Used for [automatically skipping known flaky tests](../pipelines.md#automatic-skipping-of-flaky-tests).
+- Used for [automatically skipping known flaky tests](../pipelines/index.md#automatic-skipping-of-flaky-tests).
 - [Imported into Snowflake](https://gitlab.com/gitlab-data/analytics/-/blob/master/extract/gitlab_flaky_tests/upload.py)
   once per day, for monitoring with the [internal dashboard](https://app.periscopedata.com/app/gitlab/888968/EP---Flaky-tests).