Document the need for feature flags

This adds a development guide explaining that we are going to use feature flags more often, why, what the benefits are, and so on. See https://gitlab.com/gitlab-org/gitlab-ce/issues/49619 for more information.
author: Yorick Peterse <yorickpeterse@gmail.com> 2018-09-03 18:35:44 +0300
committer: Yorick Peterse <yorickpeterse@gmail.com> 2018-09-18 17:52:44 +0300
commit: 9393ff2fb64cadc4f2f5527f07d29b4e1190aa2e (patch)
tree: 32a0f3bd4b8fab500894997c407741f77b5ca7ca
parent: 78b3eea7d248c6d3c48b615c9df24a95cb5fd1d8 (diff)
3 files changed, 195 insertions, 20 deletions
diff --git a/PROCESS.md b/PROCESS.md
index 583f36b820f..38ec01f9de0 100644
--- a/PROCESS.md
+++ b/PROCESS.md
@@ -74,14 +74,31 @@ star, smile, etc.). Some good tips about code reviews can be found in our
 
 ## Feature freeze on the 7th for the release on the 22nd
 
-After 7th at 23:59 (Pacific Time Zone) of each month, RC1 of the upcoming release (to be shipped on the 22nd) is created and deployed to GitLab.com and the stable branch for this release is frozen, which means master is no longer merged into it.
-Merge requests may still be merged into master during this period,
-but they will go into the _next_ release, unless they are manually cherry-picked into the stable branch.
+After 7th at 23:59 (Pacific Time Zone) of each month, RC1 of the upcoming
+release (to be shipped on the 22nd) is created and deployed to GitLab.com and
+the stable branch for this release is frozen, which means master is no longer
+merged into it.  Merge requests may still be merged into master during this
+period, but they will go into the _next_ release, unless they are manually
+cherry-picked into the stable branch.
 
-By freezing the stable branches 2 weeks prior to a release, we reduce the risk of a last minute merge request potentially breaking things.
+By freezing the stable branches 2 weeks prior to a release, we reduce the risk
+of a last minute merge request potentially breaking things.
 
-Any release candidate that gets created after this date can become a final release,
-hence the name release candidate.
+Any release candidate that gets created after this date can become a final
+release, hence the name release candidate.
+
+### Feature flags
+
+Merge requests that make changes hidden behind a feature flag, or remove an
+existing feature flag because a feature is deemed stable, may be merged (and
+picked into the stable branches) up to the 19th of the month. Such merge
+requests should have the ~"feature flag" label assigned, and don't require a
+corresponding exception request to be created.
+
+While rare, release managers may decide to reject picking a change into a stable
+branch, even when feature flags are used. This might be necessary if the changes
+are deemed problematic, too invasive, or there simply isn't enough time to
+properly test how the changes behave on GitLab.com.
 
 ### Between the 1st and the 7th
 
@@ -223,36 +240,36 @@ Check [this guide](https://gitlab.com/gitlab-org/release/docs/blob/master/genera
 
 A ~bug is a defect, error, failure which causes the system to behave incorrectly or prevents it from fulfilling the product requirements.
 
-The level of impact of a ~bug can vary from blocking a whole functionality 
-or a feature usability bug. A bug should always be linked to a severity level. 
+The level of impact of a ~bug can vary from blocking a whole functionality
+or a feature usability bug. A bug should always be linked to a severity level.
 Refer to our [severity levels](../CONTRIBUTING.md#severity-labels)
 
-Whether the bug is also a regression or not, the triage process should start as soon as possible. 
+Whether the bug is also a regression or not, the triage process should start as soon as possible.
 Ensure that the Engineering Manager and/or the Product Manager for the relative area is involved to prioritize the work as needed.
 
 ### Regressions
 
 A ~regression implies that a previously **verified working functionality** no longer works.
 Regressions are a subset of bugs. We use the ~regression label to imply that the defect caused the functionality to regress.
-The label tells us that something worked before and it needs extra attention from Engineering and Product Managers to schedule/reschedule. 
+The label tells us that something worked before and it needs extra attention from Engineering and Product Managers to schedule/reschedule.
 
-The regression label does not apply to ~bugs for new features for which functionality was **never verified as working**. 
-These, by definition, are not regressions. 
+The regression label does not apply to ~bugs for new features for which functionality was **never verified as working**.
+These, by definition, are not regressions.
 
 A regression should always have the `regression:xx.x` label on it to designate when it was introduced.
 
-Regressions should be considered high priority issues that should be solved as soon as possible, especially if they have severe impact on users. 
+Regressions should be considered high priority issues that should be solved as soon as possible, especially if they have severe impact on users.
 
 ### Managing bugs
 
-**Prioritization:** We give higher priority to regressions on features that worked in the last recent monthly release and the current release candidates. 
-The two scenarios below can [bypass the exception request in the release process](https://gitlab.com/gitlab-org/release/docs/blob/master/general/exception-request/process.md#after-the-7th), where the affected regression version matches the current monthly release version. 
+**Prioritization:** We give higher priority to regressions on features that worked in the last recent monthly release and the current release candidates.
+The two scenarios below can [bypass the exception request in the release process](https://gitlab.com/gitlab-org/release/docs/blob/master/general/exception-request/process.md#after-the-7th), where the affected regression version matches the current monthly release version.
 * A regression which worked in the **Last monthly release**
    * **Example:** In 11.0 we released a new `feature X` that is verified as working. Then in release 11.1 the feature no longer works, this is regression for 11.1. The issue should have the `regression:11.1` label.
    * *Note:* When we say `the last recent monthly release`, this can refer to either the version currently running on GitLab.com, or the most recent version available in the package repositories.
 * A regression which worked in the **Current release candidates**
    * **Example:** In 11.1-RC3 we shipped a new feature which has been verified as working. Then in 11.1-RC5 the feature no longer works, this is regression for 11.1. The issue should have the `regression:11.1` label.
-   * *Note:* Because GitLab.com runs release candidates of new releases, a regression can be reported in a release before its 'official' release date on the 22nd of the month. 
+   * *Note:* Because GitLab.com runs release candidates of new releases, a regression can be reported in a release before its 'official' release date on the 22nd of the month.
 
 When a bug is found:
 1. Create an issue describing the problem in the most detailed way possible.
@@ -264,11 +281,11 @@ When a bug is found:
 The counterpart Product Manager is included to weigh-in on prioritization as needed.
 1. If the ~bug is **NOT** a regression:
    1. The Engineering Manager decides which milestone the bug will be fixed. The appropriate milestone is applied.
-1. If the bug is a ~regression: 
+1. If the bug is a ~regression:
    1. Determine the release that the regression affects and add the corresponding `regression:xx.x` label.
       1. If the affected release version can't be determined, add the generic ~regression label for the time being.
-   1. If the affected version `xx.x` in `regression:xx.x` is the **current release**, it's recommended to schedule the fix for the current milestone. 
-      1. This falls under regressions which worked in the last release and the current RCs. More detailed explanations in the **Prioritization** section above. 
+   1. If the affected version `xx.x` in `regression:xx.x` is the **current release**, it's recommended to schedule the fix for the current milestone.
+      1. This falls under regressions which worked in the last release and the current RCs. More detailed explanations in the **Prioritization** section above.
    1. If the affected version `xx.x` in `regression:xx.x` is older than the **current release**
       1. If the regression is an ~S1 severity, it's recommended to schedule the fix for the current milestone. We would like to fix the highest severity regression as soon as we can.
       1. If the regression is an ~S2, ~S3 or ~S4 severity, the regression may be scheduled for later milestones at the discretion of the Engineering Manager and Product Manager.
diff --git a/doc/development/feature_flags.md b/doc/development/feature_flags.md
index 6f757f1ce7b..417298205f5 100644
--- a/doc/development/feature_flags.md
+++ b/doc/development/feature_flags.md
@@ -65,13 +65,18 @@ In the rare case that you need the feature flag to be on automatically, use
 Feature.enabled?(:feature_flag, project, default_enabled: true)
 ```
 
+For more information about rolling out changes using feature flags, refer to the
+[Rolling out changes using feature flags](rolling_out_changes_using_feature_flags.md)
+guide.
+
 ### Specs
 
 In the test environment `Feature.enabled?` is stubbed to always respond to `true`,
 so we make sure behavior under feature flag doesn't go untested in some non-specific
 contexts.
 
-If you need to test the feature flag in a different state, you need to stub it with:
+Whenever a feature flag is present, make sure to test _both_ states of the
+feature flag. You can stub a feature flag as follows:
 
 ```ruby
 stub_feature_flags(my_feature_flag: false)
diff --git a/doc/development/rolling_out_changes_using_feature_flags.md b/doc/development/rolling_out_changes_using_feature_flags.md
new file mode 100644
index 00000000000..905aa26a40b
--- /dev/null
+++ b/doc/development/rolling_out_changes_using_feature_flags.md
@@ -0,0 +1,153 @@
+# Rolling out changes using feature flags
+
+[Feature flags](feature_flags.md) can be used to gradually roll out changes, be
+it a new feature, or a performance improvement. By using feature flags, we can
+comfortably measure the impact of our changes, while still being able to easily
+disable those changes, without having to revert an entire release.
+
+## When to use feature flags
+
+Starting with GitLab 11.4, developers are required to use feature flags for
+non-trivial changes. Such changes include:
+
+* New features (e.g. a new merge request widget, epics, etc).
+* Complex performance improvements that may require additional testing in
+  production, such as rewriting complex queries.
+* Invasive changes to the user interface, such as a new navigation bar or the
+  removal of a sidebar.
+* Adding support for importing projects from a third-party service.
+
+In all cases, those working on the changes can best decide if a feature flag is
+necessary. For example, changing the color of a button doesn't need a feature
+flag, while changing the navigation bar definitely needs one. In case you are
+uncertain if a feature flag is necessary, simply ask about this in the merge
+request, and those reviewing the changes will likely provide you with an answer.
+
+When using a feature flag for UI elements, make sure to _also_ use a feature
+flag for the underlying backend code, if there is any. This ensures there is
+absolutely no way to use the feature until it is enabled.
+
+## The cost of feature flags
+
+When reading the above, one might be tempted to think this procedure is going to
+add a lot of work. Fortunately, this is not the case, and we'll show why. For
+this example we'll specify the cost of the work to do as a number, ranging from
+0 to infinity. The greater the number, the more expensive the work is. The cost
+does _not_ translate to time, it's just a way of measuring complexity of one
+change relative to another.
+
+Let's say we are building a new feature, and we have determined that the cost of
+this is 10. We have also determined that the cost of adding a feature flag check
+in a variety of places is 1. If we do not use feature flags, and our feature
+works as intended, our total cost is 10. This however is the best case scenario.
+Optimising for the best case scenario is guaranteed to lead to trouble, whereas
+optimising for the worst case scenario is almost always better.
+
+To illustrate this, let's say our feature causes an outage, and there's no
+immediate way to resolve it. This means we'd have to take the following steps to
+resolve the outage:
+
+1. Revert the release.
+1. Perform any cleanups that might be necessary, depending on the changes that
+   were made.
+1. Revert the commit, ensuring the "master" branch remains stable. This is
+   especially necessary if solving the problem can take days or even weeks.
+1. Pick the revert commit into the appropriate stable branches, ensuring we
+   don't block any future releases until the problem is resolved.
+
+As history has shown, these steps are time consuming, complex, often involve
+many developers, and worst of all: our users will have a bad experience using
+GitLab.com until the problem is resolved.
+
+Now let's say that all of this has an associated cost of 10. This means that in
+the worst case scenario, which we should optimise for, our total cost is now 20.
+
+If we had used a feature flag, things would have been very different. We don't
+need to revert a release, and because feature flags are disabled by default we
+don't need to revert and pick any Git commits. In fact, all we have to do is
+disable the feature, and _maybe_ perform some cleanup. Let's say that the cost
+of this is 1. In this case, our best case cost is 11: 10 to build the feature,
+and 1 to add the feature flag. The worst case cost is now 12: 10 to build the
+feature, 1 to add the feature flag, and 1 to disable it.
+
+Here we can see that in the best case scenario the work necessary is only a tiny
+bit more compared to not using a feature flag. Meanwhile, the process of
+reverting our changes has been made significantly cheaper, to the point of being
+trivial.
+
+In other words, feature flags do not slow down the development process. Instead,
+they speed up the process as managing incidents now becomes _much_ easier. Once
+continuous deployments are easier to perform, the time to iterate on a feature
+is reduced even further, as you no longer need to wait weeks before your changes
+are available on GitLab.com.
+
+## Rolling out changes
+
+The procedure of using feature flags is straightforward, and similar to not
+using them. You add the necessary tests (make sure to test both the on and off
+states of your feature flag(s)), make sure they all pass, have the code
+reviewed, etc. You then submit your merge request, and add the ~"feature flag"
+label. This label is used to signal to release managers that your changes are
+hidden behind a feature flag and that it is safe to pick the MR into a stable
+branch, without the need for an exception request.
+
+When the changes are deployed it is time to start rolling out the feature to our
+users. The exact procedure of rolling out a change is unspecified, as this can
+vary from change to change. However, in general we recommend rolling out changes
+incrementally, instead of enabling them for everybody right away. We also
+recommend you to _not_ enable a feature _before_ the code is being deployed.
+This allows you to separate rolling out a feature from a deploy, making it
+easier to measure the impact of both separately.
+
+GitLab's feature library (using
+[Flipper](https://github.com/jnunemaker/flipper), and covered in the [Feature
+Flags](feature_flags.md) guide) supports rolling out changes to a percentage of
+users. This in turn can be controlled using [GitLab
+chatops](https://docs.gitlab.com/ee/ci/chatops/).
+
+For example, to enable a feature for 25% of all users, run the following in
+Slack:
+
+```
+/chatops run feature set new_navigation_bar 25
+```
+
+This will enable the feature for GitLab.com, with `new_navigation_bar` being the
+name of the feature. We can also enable the feature for <https://dev.gitlab.org>
+or <https://staging.gitlab.com>:
+
+```
+/chatops run feature set new_navigation_bar 25 --dev
+/chatops run feature set new_navigation_bar 25 --staging
+```
+
+If you are not certain what percentages to use, simply use the following steps:
+
+1. 25%
+1. 50%
+1. 75%
+1. 100%
+
+Between every step you'll want to wait a little while and monitor the
+appropriate graphs on <https://dashboards.gitlab.net>. The exact time to wait
+may differ. For some features a few minutes is enough, while for others you may
+want to wait several hours or even days. This is entirely up to you, just make
+sure it is clearly communicated to your team, and the Production team if you
+anticipate any potential problems.
+
+Once a change is deemed stable, submit a new merge request to remove the
+feature flag. This ensures the change is available to all users and self-hosted
+instances. Make sure to add the ~"feature flag" label to this merge request so
+release managers are aware the changes are hidden behind a feature flag. If the
+merge request has to be picked into a stable branch (e.g. after the 7th), make
+sure to also add the appropriate "Pick into X" label (e.g. "Pick into 11.4").
+
+One might be tempted to think this will delay the release of a feature by at
+least one month (= one release). This is not the case. A feature flag does not
+have to stick around for a specific amount of time (e.g. at least one release),
+instead they should stick around until the feature is deemed stable. Stable
+means it works on GitLab.com without causing any problems, such as outages. In
+most cases this will translate to a feature (with a feature flag) being shipped
+in RC1, followed by the feature flag being removed in RC2. This in turn means
+the feature will be stable by the time we publish a stable package around the
+22nd of the month.
author	Yorick Peterse <yorickpeterse@gmail.com>	2018-09-03 18:35:44 +0300
committer	Yorick Peterse <yorickpeterse@gmail.com>	2018-09-18 17:52:44 +0300
commit	9393ff2fb64cadc4f2f5527f07d29b4e1190aa2e (patch)
tree	32a0f3bd4b8fab500894997c407741f77b5ca7ca
parent	78b3eea7d248c6d3c48b615c9df24a95cb5fd1d8 (diff)