diff options
Diffstat (limited to 'doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md')
-rw-r--r-- | doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md | 238 |
1 files changed, 238 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md new file mode 100644 index 00000000000..a73f6335218 --- /dev/null +++ b/doc/architecture/blueprints/container_registry_metadata_database_self_managed_rollout/index.md @@ -0,0 +1,238 @@ +--- +status: proposed +creation-date: "2023-06-09" +authors: [ "@hswimelar" ] +coach: "@grzesiek" +approvers: [ "@trizzi ", "@sgoldstein" ] +owning-stage: "~devops::package" +participating-stages: [] +--- + +<!-- Blueprints often contain forward-looking statements --> +<!-- vale gitlab.FutureTense = NO --> + +# Container Registry Self-Managed Database Rollout + +## Summary + +The latest iteration of the [Container Registry](https://gitlab.com/gitlab-org/container-registry) +has been rearchitected to use a PostgreSQL database and deployed on GitLab.com. +Now we must bring the advantages provided by the database to self-managed users. +While the container registry retains the capacity to run without the new database, +many new and highly desired features cannot be implemented without it. +Additionally, unifying the registry used for GitLab.com and for self-managed +allows us to provide a cohesive user experience and reduces the burden +associated with maintaining the old registry implementation. To accomplish this, +we plan to eventually require all self-managed to migrate to the new registry +database, so that we may deprecate and remove support for the old object storage +metadata subsystem. + +This document seeks to describe how we may use the proven core migration +functionality, which was used to migrate millions of container images on GitLab.com, +to enable self-managed users to enjoy the benefits of the metadata database. + +## Motivation + +Enabling self-managed users to migrate to the new metadata database allows these +users to take advantage of the new features that require the database. Additionally, +the greater adoption of the database allows the container registry team to focus +our knowledge and capacity, and will eventually allow us to fully remove the old +registry metadata subsystem, greatly improving the maintainability and stability +of the container registry for both GitLab.com and for self-managed users. + +### Goals + +- Progressively rollout the new dependency of a PostgreSQL database instance for the registry for charts and omnibus deployments. +- Progressively rollout automation for the registry PostgreSQL database instance for charts and omnibus deployments. +- Develop processes and tools that self-managed admins can use to migrate existing registry deployments to the metadata database. +- Develop processes and tools that self-managed admins can use spin up fresh installs of the Container Registry which use the metadata database. +- Create a plan which will eventually allow us to fully drop support for original object storage metadata subsystem. + +### Non-Goals + +- Developing new Container Registry features outside the scope of enabling admins to migrate to the metadata database. +- Determining lifecycle support decisions, such as when to default to the database, and when to end support for non-database registries. + +## Proposal + +There are two main components that must be further developed in order for +self-managed admins to move to the registry database: the deployment environment and +the registry migration tooling. + +For the deployment environments need to document what the user needs to do to set up their +deployment such that the registry has access to a suitable database given the +expected registry workload. As well as develop tooling and automation to ease +the setup and maintenance of the registry database for new and existing deploys. + +For the registry, we need to develop and validate import tooling which +coordinates with the core import functionality which was used to migrate all +container images on GitLab.com. Additionally, we must validate that each supported +storage driver works as expected with the import process and provide estimated +import times for admins. + +We can structure our work to meet the standards outlined in support for +Experiment, Beta, and Alpha features. Doing so will help to prioritize core +functionality and to allow users who wish to be early adopters to begin using +the database and providing us with invaluable feedback. + +These levels of support could be advertised to self-managed users via a simple +chart, allowing them to tell at a glance the status of this project as it relates +to their situation. + +| Installation | GCS | AWS | Filesystem | Azure | OSS | Swift| +| ------ | ------ |------ | ------ | ------ |------ | ------ | +| Omnibus | GA | GA | Beta | Experimental | Experimental | Experimental | +| Charts | GA | GA |Beta | Experimental | Experimental | Experimental | + +### Justification of Structuring Support by Driver + +It's possible that we could simplify the proposed support matrix by structuring +it only by deployment environment and not differentiate by storage driver. The +following two sections briefly summarize several points for and against. + +#### Arguments Opposed to Structuring Support by Driver + +Each storage driver is well abstracted in the code, specifically the import process +makes use of the following Methods: + +- Walk +- List +- GetContent +- Stat +- Reader + +Each of the methods is a read method we do not need to create or delete data via +the object storage methods. Additionally, all of these methods are standard API +methods. + +Given that we're not mutating data via object storage as part of the import +process, we should not need to double-check these drivers or try to predict +potential errors. Relying on user feedback during the beta to direct any efforts +we should be making here could prevent us from scheduling unnecessary work. + +#### Arguments in Favor of Structuring Support by Driver + +Our experience with enhancing and supporting offline garbage collection has +shown that while the storage driver implementation should not matter, it does. +The drivers have proven to have important differences in performance and +reliability. Many of the planned possible driver-related improvements are +related to testing and stability, rather than outright new work for each driver. + +In particular, retries and error reporting across storage drivers are not as +standardized as one would hope for, and therefore there is a potential that a +long-running import process could be interrupted by an error that could have +been retried. + +Creating import estimates based on combinations of the registry size and storage +driver, would also be of use to self-managed admins, looking to schedule their +migration. There will be a difference here between local filesystem storage and +object storage and there could be a difference between the object storage +providers as well. + +Also, we could work with the importer to smooth out the differences in the +storage drivers. Even without unified retryable error reporting from the storage +drivers, we could have the importer retry more time and for more errors. There's +a risk we would retry several times on non-retryable errors, but since no writes +are being made to object storage, this should not ultimately be harmful. + +Additionally, implementing [Validate Self-Managed Imports](https://gitlab.com/gitlab-org/container-registry/-/issues/938) +would perform a consistency check against a sample of images before and after +import which would lead to greater consistency across all storage driver implementations. + +## Design and Implementation Details + +### The Import Tool + +The [import tool](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/database-import-tool.md) +is a well-validated component of the Container Registry project that we have used +from the beginning as a way to perform local testing. This tool is a thin wrapper +over the core import functionality — the code which handles the import logic has +been extensively validated. + +While the core import functionality is solid, we must ensure that this tool and +the surrounding process will enable non-expert users to import their registries +with both minimal risk and with minimal support from GitLab team members. +Therefore, the most important work remaining is crafting the UX of this tooling +such that those goals are met. This +[epic](https://gitlab.com/groups/gitlab-org/-/epics/8602) captures many of the +proposed improvements. + +#### Design + +The tool is designed such that a single execution flow can support both users +with large registries with strict uptime requirements who can take advantage of +a more involved process to reduce read-only time to the absolute minimum as well +as users with small registries who benefit from a streamlined workflow. This is +achieved via the same pre import, then full import cycle that was used on +GitLab.com, along with an additional step to catalog all unreferenced blobs held +in common storage. + +##### One-Shot Import + +In most cases, a user can simply choose to run the import tool while the registry +is offline or read-only in mode. This will be similar to what admins must +already do in order to run offline garbage collection. Each step completes in +sequence, moving directly to the next. The command exits when the import process +is complete and the registry is ready to make full use of the metadata database. + +##### Minimal Downtime Import + +For users with large registries and who are interested in the minimum possible +downtime, each step can be ran independently when the tool is passed the appropriate +flag. The user will first run the pre-import step while the registry is +performing its usual workload. Once that has completed, and the user is ready +to stop writes to the registry, the tag import step can be ran. As with the GitLab.com +migration, importing tags requires that the registry be offline or in +read-only mode. This step does the minimum possible work to achieve fast and +efficient tag imports and will always be the fastest of the three steps, reducing +the downtime component to a fraction of the total import time. The user can then +bring up the registry configured to use the metadata database. After that, the +user is free to run the third step during standard registry operations. This step +makes any dangling blobs in common storage visible to the database and therefore +the online garbage collection process. + +### Distribution Paths + +Tooling, process, and documentation will need to be developed in order to +support users who wish to use the metadata database, especially in regards to +providing a foundation for the new database instance required for the migration. + +For new deployments, we should wait until we've moved to general support, have +automation in place for the registry database and migration, and have a major +GitLab version bump before enabling the database by default for self-managed. + +#### Omnibus + +#### Charts + +## Alternative Solutions + +### Do Nothing + +#### Pros + +- The database and associated features are generally most useful for large-scale, high-availability deployments. +- Eliminate the need to support an additional logical or physical database for self-managed deployments. + +#### Cons + +- The registry on GitLab.com and the registry used by self-managed will greatly diverge in supported features over time. +- The maintenance burden of supporting two registry implementations will reduce the velocity at which new registry features can be released. +- The registry on GitLab.com stops being an effective way to validate changes before they are released to self-managed. +- Large self-managed users continue to not be able to scale the registry to suit their needs. + +### Gradual Migration + +This approach would be to exactly replicate the GitLab.com migration on +self-managed. + +#### Pros + +- Replicate an already successful process. +- Scope downtime by repository, rather than instance. + +#### Cons + +- Dramatically increased complexity in all aspects of the migration process. +- Greatly increased possibility of data consistency issues. +- Less clear demarcation of registry migration progress. |