diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2022-10-20 12:40:42 +0300 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2022-10-20 12:40:42 +0300 |
commit | ee664acb356f8123f4f6b00b73c1e1cf0866c7fb (patch) | |
tree | f8479f94a28f66654c6a4f6fb99bad6b4e86a40e /doc/architecture/blueprints/ci_scale/index.md | |
parent | 62f7d5c5b69180e82ae8196b7b429eeffc8e7b4f (diff) |
Add latest changes from gitlab-org/gitlab@15-5-stable-eev15.5.0-rc42
Diffstat (limited to 'doc/architecture/blueprints/ci_scale/index.md')
-rw-r--r-- | doc/architecture/blueprints/ci_scale/index.md | 149 |
1 files changed, 75 insertions, 74 deletions
diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md index 75c4d05c334..c02fb35974b 100644 --- a/doc/architecture/blueprints/ci_scale/index.md +++ b/doc/architecture/blueprints/ci_scale/index.md @@ -17,11 +17,15 @@ and has become [one of the most beloved CI/CD solutions](https://about.gitlab.co GitLab CI/CD has come a long way since the initial release, but the design of the data storage for pipeline builds remains almost the same since 2012. We store all the builds in PostgreSQL in `ci_builds` table, and because we are -creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU), -we are reaching database limits that are slowing our development velocity down. +creating more than 5 million builds each day on GitLab.com we are reaching +database limits that are slowing our development velocity down. -On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created and the number of -builds continues to grow exponentially. +On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created. In +February 2022 we reached 2 billion of CI/CD build stored in the database. The +number of builds continues to grow exponentially. + +The screenshot below shows our forecast created at the beginning of 2021, that +turned out to be quite accurate. ![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png) @@ -34,9 +38,9 @@ builds continues to grow exponentially. The current state of CI/CD product architecture needs to be updated if we want to sustain future growth. -### We are running out of the capacity to store primary keys +### We were running out of the capacity to store primary keys: DONE -The primary key in `ci_builds` table is an integer generated in a sequence. +The primary key in `ci_builds` table is an integer value, generated in a sequence. Historically, Rails used to use [integer](https://www.postgresql.org/docs/14/datatype-numeric.html) type when creating primary keys for a table. We did use the default when we [created the `ci_builds` table in 2012](https://gitlab.com/gitlab-org/gitlab/-/blob/046b28312704f3131e72dcd2dbdacc5264d4aa62/db/ci/migrate/20121004165038_create_builds.rb). @@ -45,34 +49,32 @@ since the release of Rails 5. The framework is now using `bigint` type that is 8 bytes long, however we have not migrated primary keys for `ci_builds` table to `bigint` yet. -We will run out of the capacity of the integer type to store primary keys in -`ci_builds` table before December 2021. When it happens without a viable -workaround or an emergency plan, GitLab.com will go down. - -`ci_builds` is just one of the tables that are running out of the primary keys -available in Int4 sequence. There are multiple other tables storing CI/CD data -that have the same problem. +In early 2021 we had estimated that would run out of the capacity of the integer +type to store primary keys in `ci_builds` table before December 2021. If it had +happened without a viable workaround or an emergency plan, GitLab.com would go +down. `ci_builds` was just one of many tables that were running out of the +primary keys available in Int4 sequence. -Primary keys problem will be tackled by our Database Team. +Before October 2021, our Database team had managed to migrate all the risky +tables' primary keys to big integers. -**Status**: In October 2021, the primary keys in CI tables were migrated -to big integers. See the [related Epic](https://gitlab.com/groups/gitlab-org/-/epics/5657) for more details. +See the [related Epic](https://gitlab.com/groups/gitlab-org/-/epics/5657) for more details. -### The table is too large +### Some CI/CD database tables are too large: IN PROGRESS -There is more than a billion rows in `ci_builds` table. We store more than 2 -terabytes of data in that table, and the total size of indexes is more than 1 -terabyte (as of February 2021). +There is more than two billion rows in `ci_builds` table. We store many +terabytes of data in that table, and the total size of indexes is measured in +terabytes as well. -This amount of data contributes to a significant performance problems we -experience on our primary PostgreSQL database. +This amount of data contributes to a significant number of performance +problems we experience on our CI PostgreSQL database. -Most of the problem are related to how PostgreSQL database works internally, +Most of the problems are related to how PostgreSQL database works internally, and how it is making use of resources on a node the database runs on. We are at -the limits of vertical scaling of the primary database nodes and we frequently -see a negative impact of the `ci_builds` table on the overall performance, -stability, scalability and predictability of the database GitLab.com depends -on. +the limits of vertical scaling of the CI primary database nodes and we +frequently see a negative impact of the `ci_builds` table on the overall +performance, stability, scalability and predictability of the CI database +GitLab.com depends on. The size of the table also hinders development velocity because queries that seem fine in the development environment may not work on GitLab.com. The @@ -90,41 +92,40 @@ environment. We also expect a significant, exponential growth in the upcoming years. One of the forecasts done using [Facebook's Prophet](https://facebook.github.io/prophet/) -shows that in the first half of -2024 we expect seeing 20M builds created on GitLab.com each day. In comparison -to around 2M we see created today, this is 10x growth our product might need to -sustain in upcoming years. +shows that in the first half of 2024 we expect seeing 20M builds created on +GitLab.com each day. In comparison to around 5M we see created today. This is +10x growth from numbers we saw in 2021. ![CI builds daily forecast](ci_builds_daily_forecast.png) **Status**: As of October 2021 we reduced the growth rate of `ci_builds` table -by writing build options and variables to `ci_builds_metadata` table. We plan -to ship further improvements that will be described in a separate blueprint. +by writing build options and variables to `ci_builds_metadata` table. We are +also working on partitioning the largest CI/CD database tables using +[time decay pattern](../ci_data_decay/index.md). -### Queuing mechanisms are using the large table +### Queuing mechanisms were using the large table: DONE -Because of how large the table is, mechanisms that we use to build queues of -pending builds (there is more than one queue), are not very efficient. Pending -builds represent a small fraction of what we store in the `ci_builds` table, -yet we need to find them in this big dataset to determine an order in which we -want to process them. +Because of how large the table is, mechanisms that we used to build queues of +pending builds (there is more than one queue), were not very efficient. Pending +builds represented a small fraction of what we store in the `ci_builds` table, +yet we needed to find them in this big dataset to determine an order in which we +wanted to process them. -This mechanism is very inefficient, and it has been causing problems on the -production environment frequently. This usually results in a significant drop -of the CI/CD Apdex score, and sometimes even causes a significant performance +This mechanism was very inefficient, and it had been causing problems on the +production environment frequently. This usually resulted in a significant drop +of the CI/CD Apdex score, and sometimes even caused a significant performance degradation in the production environment. -There are multiple other strategies that can improve performance and -reliability. We can use [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or -[a separate table that will accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) -and we want to explore them. +There were multiple other strategies that we considered to improve performance and +reliability. We evaluated using [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or +[a separate table that would accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766). +We decided to proceed with the latter. -**Status**: As of October 2021 the new architecture -[has been implemented on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908). -The following epic tracks making it generally available: -[Make the new pending builds architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). +In October 2021 we finished shipping the new architecture of builds queuing +[on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908). +We then made the new architecture [generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). -### Moving big amounts of data is challenging +### Moving big amounts of data is challenging: IN PROGRESS We store a significant amount of data in `ci_builds` table. Some of the columns in that table store a serialized user-provided data. Column `ci_builds.options` @@ -144,24 +145,27 @@ described in a separate architectural blueprint. ## Proposal -Making GitLab CI/CD product ready for the scale we expect to see in the -upcoming years is a multi-phase effort. - -First, we want to focus on things that are urgently needed right now. We need -to fix primary keys overflow risk and unblock other teams that are working on -database partitioning and sharding. - -We want to improve known bottlenecks, like -builds queuing mechanisms that is using the large table, and other things that -are holding other teams back. - -Extending CI/CD metrics is important to get a better sense of how the system -performs and to what growth should we expect. This will make it easier for us -to identify bottlenecks and perform more advanced capacity planning. - -Next step is to better understand how we can leverage strong time-decay -characteristic of CI/CD data. This might help us to partition CI/CD dataset to -reduce the size of CI/CD database tables. +Below you can find the original proposal made in early 2021 about how we want +to move forward with CI Scaling effort: + +> Making GitLab CI/CD product ready for the scale we expect to see in the +> upcoming years is a multi-phase effort. +> +> First, we want to focus on things that are urgently needed right now. We need +> to fix primary keys overflow risk and unblock other teams that are working on +> database partitioning and sharding. +> +> We want to improve known bottlenecks, like +> builds queuing mechanisms that is using the large table, and other things that +> are holding other teams back. +> +> Extending CI/CD metrics is important to get a better sense of how the system +> performs and to what growth should we expect. This will make it easier for us +> to identify bottlenecks and perform more advanced capacity planning. +> +> Next step is to better understand how we can leverage strong time-decay +> characteristic of CI/CD data. This might help us to partition CI/CD dataset to +> reduce the size of CI/CD database tables. ## Iterations @@ -170,15 +174,12 @@ Work required to achieve our next CI/CD scaling target is tracked in the 1. ✓ Migrate primary keys to big integers on GitLab.com. 1. ✓ Implement the new architecture of builds queuing on GitLab.com. -1. [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). +1. ✓ [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). 1. [Partition CI/CD data using time-decay pattern](../ci_data_decay/index.md). ## Status -|-------------|--------------| -| Created at | 21.01.2021 | -| Approved at | 26.04.2021 | -| Updated at | 28.02.2022 | +Created at 21.01.2021, approved at 26.04.2021. Status: In progress. |