8 files changed, 825 insertions, 15 deletions
diff --git a/doc/development/database/avoiding_downtime_in_migrations.md b/doc/development/database/avoiding_downtime_in_migrations.md
index 371df5b45ff..5a2343e883c 100644
--- a/doc/development/database/avoiding_downtime_in_migrations.md
+++ b/doc/development/database/avoiding_downtime_in_migrations.md
@@ -14,8 +14,7 @@ requiring downtime.
 
 ## Dropping columns
 
-Removing columns is tricky because running GitLab processes may still be using
-the columns. To work around this safely, you need three steps in three releases:
+Removing columns is tricky because running GitLab processes expect these columns to exist, as ActiveRecord caches the tables schema, even if the columns are not referenced. This happens if the columns are not explicitly marked as ignored. To work around this safely, you need three steps in three releases:
 
 1. [Ignoring the column](#ignoring-the-column-release-m) (release M)
 1. [Dropping the column](#dropping-the-column-release-m1) (release M+1)
diff --git a/doc/development/database/batched_background_migrations.md b/doc/development/database/batched_background_migrations.md
index 10490df7b5e..a6d827df820 100644
--- a/doc/development/database/batched_background_migrations.md
+++ b/doc/development/database/batched_background_migrations.md
@@ -193,9 +193,10 @@ Because batched migrations are update heavy and there were few incidents in the
 These database indicators are checked to throttle a migration. On getting a
 stop signal, the migration is paused for a set time (10 minutes):
 
-- WAL queue pending archival crossing a threshold.
+- WAL queue pending archival crossing the threshold.
 - Active autovacuum on the tables on which the migration works on.
 - Patroni apdex SLI dropping below the SLO.
+- WAL rate crossing the threshold.
 
 It's an ongoing effort to add more indicators to further enhance the
 database health check framework. For more details, see
diff --git a/doc/development/database/clickhouse/clickhouse_within_gitlab.md b/doc/development/database/clickhouse/clickhouse_within_gitlab.md
new file mode 100644
index 00000000000..297776429d7
--- /dev/null
+++ b/doc/development/database/clickhouse/clickhouse_within_gitlab.md
@@ -0,0 +1,237 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# ClickHouse within GitLab
+
+This document gives a high-level overview of how to develop features using ClickHouse in the GitLab Rails application.
+
+NOTE:
+Most of the tooling and APIs are considered unstable.
+
+## GDK setup
+
+For instructions on how to set up a ClickHouse server locally, see the [ClickHouse installation documentation](https://clickhouse.com/docs/en/install).
+
+### Configure your Rails application
+
+1. Copy the example file and configure the credentials:
+
+   ```shell
+   cp config/click_house.yml.example
+   config/click_house.yml
+   ```
+
+1. Create the database using the `clickhouse-client` CLI tool:
+
+   ```shell
+   clickhouse-client --password
+   ```
+
+   ```sql
+   create database gitlab_clickhouse_development;
+   ```
+
+### Validate your setup
+
+Run the Rails console and invoke a simple query:
+
+```ruby
+ClickHouse::Client.select('SELECT 1', :main)
+# => [{"1"=>1}]
+```
+
+## Database schema and migrations
+
+For the ClickHouse database there are no established schema migration procedures yet. We have very basic tooling to build up the database schema in the test environment from scratch using timestamp-prefixed SQL files.
+
+You can create a table by placing a new SQL file in the `db/click_house/main` folder:
+
+```sql
+// 20230811124511_create_issues.sql
+CREATE TABLE issues
+(
+  id UInt64 DEFAULT 0,
+  title String DEFAULT ''
+)
+ENGINE = MergeTree
+PRIMARY KEY (id)
+```
+
+When you're working locally in your development environment, you can create or re-create your table schema by executing the respective `CREATE TABLE` statement. Alternatively, you can use the following snippet in the Rails console:
+
+```ruby
+require_relative 'spec/support/database/click_house/hooks.rb'
+
+# Drops and re-creates all tables
+ClickHouseTestRunner.new.ensure_schema
+```
+
+## Writing database queries
+
+For the ClickHouse database we don't use ORM (Object Relational Mapping). The main reason is that the GitLab application has many customizations for the `ActiveRecord` PostgresSQL adapter and the application generally assumes that all databases are using `PostgreSQL`. Since ClickHouse-related features are still in a very early stage of development, we decided to implement a simple HTTP client to avoid hard to discover bugs and long debugging time when dealing with multiple `ActiveRecord` adapters.
+
+Additionally, ClickHouse might not be used the same way as other adapters for `ActiveRecord`. The access patterns differ from traditional transactional databases, in that ClickHouse:
+
+- Uses nested aggregation `SELECT` queries with `GROUP BY` clauses.
+- Doesn't use single `INSERT` statements. Data is inserted in batches via background jobs.
+- Has different consistency characteristics, no transactions.
+- Has very little database-level validations.
+
+Database queries are written and executed with the help of the `ClickHouse::Client` gem.
+
+A simple query from the `events` table:
+
+```ruby
+rows = ClickHouse::Client.select('SELECT * FROM events', :main)
+```
+
+When working with queries with placeholders you can use the `ClickHouse::Query` object where you need to specify the placeholder name and its data type. The actual variable replacement, quoting and escaping will be done by the ClickHouse server.
+
+```ruby
+raw_query = 'SELECT * FROM events WHERE id > {min_id:UInt64}'
+placeholders = { min_id: Integer(100) }
+query = ClickHouse::Client::Query.new(raw_query: raw_query, placeholders: placeholders)
+
+rows = ClickHouse::Client.select(query, :main)
+```
+
+When using placeholders the client can provide the query with redacted placeholder values which can be ingested by our logging system. You can see the redacted version of your query by calling the `to_redacted_sql` method:
+
+```ruby
+puts query.to_redacted_sql
+```
+
+ClickHouse allows only one statement per request. This means that the common SQL injection vulnerability where the statement is closed with a `;` character and then another query is "injected" cannot be exploited:
+
+```ruby
+ClickHouse::Client.select('SELECT 1; SELECT 2', :main)
+
+# ClickHouse::Client::DatabaseError: Code: 62. DB::Exception: Syntax error (Multi-statements are not allowed): failed at position 9 (end of query): ; SELECT 2. . (SYNTAX_ERROR) (version 23.4.2.11 (official build))
+```
+
+### Subqueries
+
+You can compose complex queries with the `ClickHouse::Client::Query` class by specifying the query placeholder with the special `Subquery` type. The library will make sure to correctly merge the queries and the placeholders:
+
+```ruby
+subquery = ClickHouse::Client::Query.new(raw_query: 'SELECT id FROM events WHERE id = {id:UInt64}', placeholders: { id: Integer(10) })
+
+raw_query = 'SELECT * FROM events WHERE id > {id:UInt64} AND id IN ({q:Subquery})'
+placeholders = { id: Integer(10), q: subquery }
+
+query = ClickHouse::Client::Query.new(raw_query: raw_query, placeholders: placeholders)
+rows = ClickHouse::Client.select(query, :main)
+
+# ClickHouse will replace the placeholders
+puts query.to_sql # SELECT * FROM events WHERE id > {id:UInt64} AND id IN (SELECT id FROM events WHERE id = {id:UInt64})
+
+puts query.to_redacted_sql # SELECT * FROM events WHERE id > $1 AND id IN (SELECT id FROM events WHERE id = $2)
+
+puts query.placeholders # { id: 10 }
+```
+
+In case there are placeholders with the same name but different values the query will raise an error.
+
+### Writing query conditions
+
+When working with complex forms where multiple filter conditions are present, building queries by concatenating query fragments as string can get out of hands very quickly. For queries with several conditions you may use the `ClickHouse::QueryBuilder` class. The class uses the `Arel` gem to generate queries and provides a similar query interface like `ActiveRecord`.
+
+```ruby
+builder = ClickHouse::QueryBuilder.new('events')
+
+query = builder
+  .where(builder.table[:created_at].lteq(Date.today))
+  .where(id: [1,2,3])
+
+rows = ClickHouse::Client.select(query, :main)
+```
+
+## Inserting data
+
+The ClickHouse client supports inserting data through the standard query interface:
+
+```ruby
+raw_query = 'INSERT INTO events (id, target_type) VALUES ({id:UInt64}, {target_type:String})'
+placeholders = { id: 1, target_type: 'Issue' }
+
+query = ClickHouse::Client::Query.new(raw_query: raw_query, placeholders: placeholders)
+rows = ClickHouse::Client.execute(query, :main)
+```
+
+Inserting data this way is acceptable if:
+
+- The table contains settings or configuration data where we need to add one row.
+- For testing, test data has to be prepared in the database.
+
+When inserting data, we should always try to use batch processing where multiple rows are inserted at once. Building large `INSERT` queries in memory is discouraged because of the increased memory usage. Additionally, values specified within such queries cannot be redacted automatically by the client.
+
+To compress data and reduce memory usage, insert CSV data. You can do this with the internal [`CsvBuilder`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/gems/csv_builder) gem:
+
+```ruby
+iterator = Event.find_each
+
+# insert from events table using only the id and the target_type columns
+column_mapping = {
+  id: :id,
+  target_type: :target_type
+}
+
+CsvBuilder::Gzip.new(iterator, column_mapping).render do |tempfile|
+  query = 'INSERT INTO events (id, target_type) FORMAT CSV'
+  ClickHouse::Client.insert_csv(query, File.open(tempfile.path), :main)
+end
+```
+
+NOTE:
+It's important to test and verify efficient batching of database records from PostgreSQL. Consider using the techniques described in the [Iterating tables in batches](../iterating_tables_in_batches.md).
+
+## Testing
+
+ClickHouse is enabled on CI/CD but to avoid significantly affecting the pipeline runtime we've decided to run the ClickHouse server for test cases tagged with `:click_house` only.
+
+The `:click_house` tag ensures that the database schema is properly set up before every test case.
+
+```ruby
+RSpec.describe MyClickHouseFeature, :click_house do
+  it 'returns rows' do
+    rows = ClickHouse::Client.select('SELECT 1', :main)
+    expect(rows.size).to eq(1)
+  end
+end
+```
+
+## Multiple databases
+
+By design, the `ClickHouse::Client` library supports configuring multiple databases. Because we're still at a very early stage of development, we only have one database called `main`.
+
+Multi database configuration example:
+
+```yaml
+development:
+  main:
+    database: gitlab_clickhouse_main_development
+    url: 'http://localhost:8123'
+    username: clickhouse
+    password: clickhouse
+
+  user_analytics: # made up database
+    database: gitlab_clickhouse_user_analytics_development
+    url: 'http://localhost:8123'
+    username: clickhouse
+    password: clickhouse
+```
+
+## Observability
+
+All queries executed via the `ClickHouse::Client` library expose the query with performance metrics (timings, read bytes) via `ActiveSupport::Notifications`.
+
+```ruby
+ActiveSupport::Notifications.subscribe('sql.click_house') do |_, _, _, _, data|
+  puts data.inspect
+end
+```
+
+Additionally, to view the executed ClickHouse queries in web interactions, on the performance bar, next to the `ch` label select the count.
diff --git a/doc/development/database/foreign_keys.md b/doc/development/database/foreign_keys.md
index 5dda3dd55a3..84ab32d0c0b 100644
--- a/doc/development/database/foreign_keys.md
+++ b/doc/development/database/foreign_keys.md
@@ -195,5 +195,5 @@ end
 ```
 
 Using a foreign key as primary key saves space but can make
-[batch counting](../internal_analytics/service_ping/implement.md#batch-counters) in [Service Ping](../service_ping/index.md) less efficient.
+[batch counting](../internal_analytics/service_ping/implement.md#batch-counters) in [Service Ping](../internal_analytics/service_ping/index.md) less efficient.
 Consider using a regular `id` column if the table is relevant for Service Ping.
diff --git a/doc/development/database/index.md b/doc/development/database/index.md
index 1ee6aeaa213..70681994229 100644
--- a/doc/development/database/index.md
+++ b/doc/development/database/index.md
@@ -109,6 +109,7 @@ including the major methods:
 ## ClickHouse
 
 - [Introduction](clickhouse/index.md)
+- [ClickHouse within GitLab](clickhouse/clickhouse_within_gitlab.md)
 - [Optimizing query execution](clickhouse/optimization.md)
 - [Rebuild GitLab features using ClickHouse 1: Activity data](clickhouse/gitlab_activity_data.md)
 - [Rebuild GitLab features using ClickHouse 2: Merge Request analytics](clickhouse/merge_request_analytics.md)
@@ -118,3 +119,4 @@ including the major methods:
 
 - [Maintenance operations](maintenance_operations.md)
 - [Update multiple database objects](setting_multiple_values.md)
+- [Batch iteration in a tree hierarchy proof of concept](poc_tree_iterator.md)
diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md
index 7037ab22983..4387e19b6df 100644
--- a/doc/development/database/multiple_databases.md
+++ b/doc/development/database/multiple_databases.md
@@ -11,6 +11,9 @@ To allow GitLab to scale further we
 The two databases are `main` and `ci`. GitLab supports being run with either one database or two databases.
 On GitLab.com we are using two separate databases.
 
+For the purpose of building the [Cells](../../architecture/blueprints/cells/index.md) architecture, we are decomposing
+the databases further, to introduce another database `gitlab_main_clusterwide`.
+
 ## GitLab Schema
 
 For properly discovering allowed patterns between different databases
@@ -23,17 +26,22 @@ that we cannot use PostgreSQL schema due to complex migration procedures. Instea
 the concept of application-level classification.
 Each table of GitLab needs to have a `gitlab_schema` assigned:
 
-- `gitlab_main`: describes all tables that are being stored in the `main:` database (for example, like `projects`, `users`).
-- `gitlab_ci`: describes all CI tables that are being stored in the `ci:` database (for example, `ci_pipelines`, `ci_builds`).
-- `gitlab_geo`: describes all Geo tables that are being stored in the `geo:` database (for example, like `project_registry`, `secondary_usage_data`).
-- `gitlab_shared`: describes all application tables that contain data across all decomposed databases (for example, `loose_foreign_keys_deleted_records`) for models that inherit from `Gitlab::Database::SharedModel`.
-- `gitlab_internal`: describes all internal tables of Rails and PostgreSQL (for example, `ar_internal_metadata`, `schema_migrations`, `pg_*`).
-- `gitlab_pm`: describes all tables that store `package_metadata` (it is an alias for `gitlab_main`).
-- `...`: more schemas to be introduced with additional decomposed databases
+| Database | Description | Notes |
+| -------- | ----------- | ------- |
+| `gitlab_main`| All tables that are being stored in the `main:` database. | Currently, this is being replaced with `gitlab_main_cell`, for the purpose of building the [Cells](../../architecture/blueprints/cells/index.md) architecture. `gitlab_main_cell` schema describes all tables that are local to a cell in a GitLab installation. For example, `projects` and `groups` |
+| `gitlab_main_clusterwide` | All tables that are being stored cluster-wide in a GitLab installation, in the [Cells](../../architecture/blueprints/cells/index.md) architecture. For example, `users` and `application_settings` | |
+| `gitlab_ci` | All CI tables that are being stored in the `ci:` database (for example, `ci_pipelines`, `ci_builds`) | |
+| `gitlab_geo` | All Geo tables that are being stored in the `geo:` database (for example, like `project_registry`, `secondary_usage_data`) | |
+| `gitlab_shared` | All application tables that contain data across all decomposed databases (for example, `loose_foreign_keys_deleted_records`) for models that inherit from `Gitlab::Database::SharedModel`. | |
+| `gitlab_internal` | All internal tables of Rails and PostgreSQL (for example, `ar_internal_metadata`, `schema_migrations`, `pg_*`) | |
+| `gitlab_pm` | All tables that store `package_metadata`| It is an alias for `gitlab_main`|
+
+More schemas to be introduced with additional decomposed databases
 
 The usage of schema enforces the base class to be used:
 
-- `ApplicationRecord` for `gitlab_main`
+- `ApplicationRecord` for `gitlab_main`/`gitlab_main_cell.`
+- `MainClusterwide::ApplicationRecord` for `gitlab_main_clusterwide`.
 - `Ci::ApplicationRecord` for `gitlab_ci`
 - `Geo::TrackingBase` for `gitlab_geo`
 - `Gitlab::Database::SharedModel` for `gitlab_shared`
@@ -465,6 +473,20 @@ You can see a real example of using this method for fixing a cross-join in
 
 #### Allowlist for existing cross-joins
 
+The easiest way of identifying a cross-join is via failing pipelines.
+
+As an example, in [!130038](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/130038/diffs) we moved the `notification_settings` table to the `gitlab_main_cell` schema, by marking it as such in the `db/docs/notification_settings.yml` file.
+
+The pipeline failed with the following [error](https://gitlab.com/gitlab-org/gitlab/-/jobs/4929130983):
+
+```ruby
+Database::PreventCrossJoins::CrossJoinAcrossUnsupportedTablesError:
+
+Unsupported cross-join across 'users, notification_settings' querying 'gitlab_main_clusterwide, gitlab_main_cell' discovered when executing query 'SELECT "users".* FROM "users" WHERE "users"."id" IN (SELECT "notification_settings"."user_id" FROM ((SELECT "notification_settings"."user_id" FROM "notification_settings" WHERE "notification_settings"."source_id" = 119 AND "notification_settings"."source_type" = 'Project' AND (("notification_settings"."level" = 3 AND EXISTS (SELECT true FROM "notification_settings" "notification_settings_2" WHERE "notification_settings_2"."user_id" = "notification_settings"."user_id" AND "notification_settings_2"."source_id" IS NULL AND "notification_settings_2"."source_type" IS NULL AND "notification_settings_2"."level" = 2)) OR "notification_settings"."level" = 2))) notification_settings)'
+```
+
+To make the pipeline green, this cross-join query must be allow-listed.
+
 A cross-join across databases can be explicitly allowed by wrapping the code in the
 `::Gitlab::Database.allow_cross_joins_across_databases` helper method. Alternative
 way is to mark a given relation as `relation.allow_cross_joins_across_databases`.
@@ -494,6 +516,30 @@ def find_actual_head_pipeline
 end
 ```
 
+In model associations or scopes, this can be used as in the following example:
+
+```ruby
+class Group < Namespace
+ has_many :users, -> {
+    allow_cross_joins_across_databases(url: "https://gitlab.com/gitlab-org/gitlab/-/issues/422405")
+  }, through: :group_members
+end
+```
+
+WARNING:
+Overriding an association can have unintended consequences and may even lead to data loss, as we noticed in [issue 424307](https://gitlab.com/gitlab-org/gitlab/-/issues/424307). Do not override existing ActiveRecord associations to mark a cross-join as allowed, as in the example below.
+
+```ruby
+class Group < Namespace
+  has_many :users, through: :group_members
+
+  # DO NOT override an association like this.
+  def users
+    super.allow_cross_joins_across_databases(url: "https://gitlab.com/gitlab-org/gitlab/-/issues/422405")
+  end
+end
+```
+
 The `url` parameter should point to an issue with a milestone for when we intend
 to fix the cross-join. If the cross-join is being used in a migration, we do not
 need to fix the code. See <https://gitlab.com/gitlab-org/gitlab/-/issues/340017>
@@ -530,7 +576,42 @@ more information, look at the
 [transaction guidelines](transaction_guidelines.md#dangerous-example-third-party-api-calls)
 page.
 
-#### Fixing cross-database errors
+#### Fixing cross-database transactions
+
+A transaction across databases can be explicitly allowed by wrapping the code in the
+`Gitlab::Database::QueryAnalyzers::PreventCrossDatabaseModification.temporary_ignore_tables_in_transaction` helper method.
+
+For cross-database transactions in Rails callbacks, the `cross_database_ignore_tables` method can be used.
+
+These methods should only be used for existing code.
+
+The `temporary_ignore_tables_in_transaction` helper method can be used as follows:
+
+```ruby
+class GroupMember < Member
+   def update_two_factor_requirement
+     return unless user
+
+     # To mark and ignore cross-database transactions involving members and users/user_details/user_preferences
+     Gitlab::Database::QueryAnalyzers::PreventCrossDatabaseModification.temporary_ignore_tables_in_transaction(
+       %w[users user_details user_preferences], url: 'https://gitlab.com/gitlab-org/gitlab/-/issues/424288'
+     ) do
+       user.update_two_factor_requirement
+     end
+   end
+end
+```
+
+The `cross_database_ignore_tables` method can be used as follows:
+
+```ruby
+class Namespace < ApplicationRecord
+  include CrossDatabaseIgnoredTables
+
+  # To mark and ignore cross-database transactions involving namespaces and routes/redirect_routes happening within Rails callbacks.
+  cross_database_ignore_tables %w[routes redirect_routes], url: 'https://gitlab.com/gitlab-org/gitlab/-/issues/424277'
+end
+```
 
 ##### Removing the transaction block
 
@@ -616,6 +697,23 @@ or records that point to nowhere, which might lead to bugs. As such we created
 ["loose foreign keys"](loose_foreign_keys.md) which is an asynchronous
 process of cleaning up orphaned records.
 
+### Allowlist for existing cross-database foreign keys
+
+The easiest way of identifying a cross-database foreign key is via failing pipelines.
+
+As an example, in [!130038](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/130038/diffs) we moved the `notification_settings` table to the `gitlab_main_cell` schema, by marking it in the `db/docs/notification_settings.yml` file.
+
+`notification_settings.user_id` is a column that points to `users`, but the `users` table belongs to a different database, thus this is now treated as a cross-database foreign key.
+
+We have a spec to capture such cases of cross-database foreign keys in [`no_cross_db_foreign_keys_spec.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/01d3a1e41513200368a22bbab5d4312174762ee0/spec/lib/gitlab/database/no_cross_db_foreign_keys_spec.rb), which would fail if such a cross-database foreign key is encountered.
+
+To make the pipeline green, this cross-database foreign key must be allow-listed.
+
+To do this, explicitly allow the existing cross-database foreign key to exist by adding it as an exception in the same spec (as in [this example](https://gitlab.com/gitlab-org/gitlab/-/blob/7d99387f399c548af24d93d564b35f2f9510662d/spec/lib/gitlab/database/no_cross_db_foreign_keys_spec.rb#L26)).
+This way, the spec will not fail.
+
+Later, this foreign key can be converted to a loose foreign key, like we did in [!130080](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/130080/diffs).
+
 ## Testing for multiple databases
 
 In our testing CI pipelines, we test GitLab by default with multiple databases set up, using
diff --git a/doc/development/database/not_null_constraints.md b/doc/development/database/not_null_constraints.md
index e1b6868c68e..05b1081fc4d 100644
--- a/doc/development/database/not_null_constraints.md
+++ b/doc/development/database/not_null_constraints.md
@@ -72,8 +72,6 @@ The steps required are:
      Depending on the size of the table, a background migration for cleanup could be required in the next release.
      See the [`NOT NULL` constraints on large tables](not_null_constraints.md#not-null-constraints-on-large-tables) section for more information.
 
-   - Create an issue for the next milestone to validate the `NOT NULL` constraint.
-
 1. Release `N.M+1` (next release)
 
    1. Make sure all existing records on GitLab.com have attribute set. If not, go back to step 1 from Release `N.M`.
diff --git a/doc/development/database/poc_tree_iterator.md b/doc/development/database/poc_tree_iterator.md
new file mode 100644
index 00000000000..453f77f0cde
--- /dev/null
+++ b/doc/development/database/poc_tree_iterator.md
@@ -0,0 +1,475 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Batch iteration in a tree hierarchy (proof of concept)
+
+The group hierarchy in GitLab is represented with a tree, where the root element
+is the top-level namespace, and the child elements are the subgroups or the
+recently introduced `Namespaces::ProjectNamespace` records.
+
+The tree is implemented in the `namespaces` table ,via the `parent_id` column.
+The column points to the parent namespace record. The top level namespace has no
+`parent_id`.
+
+Partial hierarchy of `gitlab-org`:
+
+```mermaid
+flowchart TD
+    A("gitlab-org (9979)") --- B("quality (2750817)")
+    B --- C("engineering-productivity (16947798)")
+    B --- D("performance-testing (9453799)")
+    A --- F("charts (5032027)")
+    A --- E("ruby (14018648)")
+```
+
+Efficiently iterating over the group hierarchy has several potential use cases.
+This is true especially in background jobs, which need to perform queries on the group hierarchy,
+where stable and safe execution is more important than fast runtime. Batch iteration
+requires more network round-trips, but each batch provides similar performance
+characteristics.
+
+A few examples:
+
+- For each subgroup, do something.
+- For each project in the hierarchy, do something.
+- For each issue in the hierarchy, do something.
+
+## Problem statement
+
+A group hierarchy could grow so big that a single query would not be able to load
+it in time. The query would fail with statement timeout error.
+
+Addressing scalability issues related to very large groups requires us to store
+the same data in different formats (de-normalization). However, if we're unable
+to load the group hierarchy, then de-normalization could not be implemented.
+
+One de-normalization technique would be to store all descendant group IDs for a
+given group. This would speed up queries where we need to load the group and its
+subgroups. Example:
+
+```mermaid
+flowchart TD
+    A(1) --- B(2)
+    A --- C(3)
+    C --- D(4)
+```
+
+| GROUP_ID | DESCENDANT_GROUP_IDS   |
+|----------|------------------------|
+| 1        | `[2,3,4]`              |
+| 2        | `[]`                   |
+| 3        | `[4]`                  |
+| 4        | `[]`                   |
+
+With this structure, determining all the subgroups would require us to read only
+one row from the database, instead of 4 rows. For a hierarchy as big as 1000 groups,
+this could make a huge difference.
+
+The reading of the hierarchy problem is solved with this de-normalization. However,
+we still need to find a way to persist this data in a table. Because a group and
+its hierarchy could grow very large, we cannot expect a single query to work here.
+
+```sql
+SELECT id FROM namespaces WHERE traversal_ids && ARRAY[9970]
+```
+
+The query above could time out for large groups, so we need to process the data in batches.
+
+Implementing batching logic in a tree is not something we've looked at before,
+and it's fairly complex to implement. An `EachBatch` or `find_in_batches` based
+solution would not work because:
+
+- The data (group IDs) are not sorted in the hierarchy.
+- Groups in sub groups don't know about the top-level group ID.
+
+## Algorithm
+
+The batching query is implemented as a recursive CTE SQL query, where one batch
+would read a maximum of N rows. Due to the tree structure, reading N rows might
+not necessarily mean that we're reading N group IDs. If the tree is structured in
+a non-optimal way, a batch could return less (but never more) group IDs.
+
+The query implements a [depth-first](https://en.wikipedia.org/wiki/Depth-first_search)
+tree walking logic, where the DB scans the first branch of the tree until the leaf
+element. We're implementing depth-first algorithm because, when a batch is finished,
+the query must return enough information for the next batch (cursor). In GitLab,
+we limit the depth of the tree to 20, which means that in the worst case, the
+query would return a cursor containing 19 elements.
+
+Implementing a [breadth-first](https://en.wikipedia.org/wiki/Breadth-first_search)
+tree walking algorithm would be impractical, because a group can have unbounded
+number of descendants, thus we might end up with a huge cursor.
+
+1. Create an initializer row that contains:
+   1. The currently processed group ID (top-level group ID)
+   1. Two arrays (tree depth and the collected IDs)
+   1. A counter for tracking the number of row reads in the query.
+1. Recursively process the row and do one of the following (whenever the condition matches):
+    - Load the first child namespace and update the currently processed namespace
+      ID if we're not at the leaf node. (Walking down a branch)
+    - Load the next namespace record on the current depth if there are any rows left.
+    - Walk up one node and process rows at one level higher.
+1. Continue the processing until the number of reads reaches our `LIMIT` (batch size).
+1. Find the last processed row which contains the data for the cursor, and all the collected record IDs.
+
+```sql
+WITH RECURSIVE result AS (
+  (
+    SELECT
+      9970 AS current_id, /* current namespace id we're processing */
+      ARRAY[9970]::int[] AS depth, /* cursor */
+      ARRAY[9970]::int[] AS ids,  /* collected ids */
+      1::bigint AS reads,
+      'initialize' AS action
+  ) UNION ALL
+  (
+    WITH cte AS ( /* trick for referencing the result cte multiple times */
+      select * FROM result
+    )
+    SELECT * FROM (
+      (
+        SELECT /* walk down the branch */
+          namespaces.id,
+          cte.depth || namespaces.id,
+          cte.ids || namespaces.id,
+          cte.reads + 1,
+          'walkdown'
+        FROM namespaces, cte
+        WHERE
+        namespaces.parent_id = cte.current_id
+        ORDER BY namespaces.id ASC
+        LIMIT 1
+      ) UNION ALL
+      (
+        SELECT /* find next element on the same level */
+          namespaces.id,
+          cte.depth[:array_length(cte.depth, 1) - 1] || namespaces.id,
+          cte.ids || namespaces.id,
+          cte.reads + 1,
+          'next'
+        FROM namespaces, cte
+        WHERE
+        namespaces.parent_id = cte.depth[array_length(cte.depth, 1) - 1] AND
+        namespaces.id > cte.depth[array_length(cte.depth, 1)]
+        ORDER BY namespaces.id ASC
+        LIMIT 1
+      ) UNION ALL
+      (
+        SELECT /* jump up one node when finished with the current level */
+          cte.current_id,
+          cte.depth[:array_length(cte.depth, 1) - 1],
+          cte.ids,
+          cte.reads + 1,
+          'jump'
+        FROM cte
+        WHERE cte.depth <> ARRAY[]::int[]
+        LIMIT 1
+      )
+    ) next_row LIMIT 1
+  )
+)
+SELECT current_id, depth, ids, action
+FROM result
+```
+
+```plaintext
+ current_id |    depth     |          ids           |   action
+------------+--------------+------------------------+------------
+         24 | {24}         | {24}                   | initialize
+         25 | {24,25}      | {24,25}                | walkdown
+         26 | {24,26}      | {24,25,26}             | next
+        112 | {24,112}     | {24,25,26,112}         | next
+        113 | {24,113}     | {24,25,26,112,113}     | next
+        114 | {24,113,114} | {24,25,26,112,113,114} | walkdown
+        114 | {24,113}     | {24,25,26,112,113,114} | jump
+        114 | {24}         | {24,25,26,112,113,114} | jump
+        114 | {}           | {24,25,26,112,113,114} | jump
+```
+
+NOTE:
+Using this query to find all the namespace IDs in a group hierarchy is likely slower
+than other querying methods, such as the current `self_and_descendants` implementation
+based on the `traversal_ids` column. The query above should be only used when
+implementing batch iteration over the group hierarchy.
+
+Rudimentary batching implementation in Ruby:
+
+```ruby
+class NamespaceEachBatch
+  def initialize(namespace_id:, cursor: nil)
+    @namespace_id = namespace_id
+    @cursor = cursor || { current_id: namespace_id, depth: [namespace_id] }
+  end
+
+  def each_batch(of: 500)
+    current_cursor = cursor.dup
+
+    first_iteration = true
+    loop do
+      new_cursor, ids = load_batch(cursor: current_cursor, of: of, first_iteration: first_iteration)
+      first_iteration = false
+      current_cursor = new_cursor
+
+      yield ids
+
+      break if new_cursor[:depth].empty?
+    end
+  end
+
+  private
+
+  # yields array of namespace ids
+  def load_batch(cursor:, of:, first_iteration: false)
+    recursive_cte = Gitlab::SQL::RecursiveCTE.new(:result,
+      union_args: { remove_order: false, remove_duplicates: false })
+
+    ids = first_iteration ? namespace_id.to_s : ""
+
+    recursive_cte << Namespace.select(
+      Arel.sql(Integer(cursor.fetch(:current_id)).to_s).as('current_id'),
+      Arel.sql("ARRAY[#{cursor.fetch(:depth).join(',')}]::int[]").as('depth'),
+      Arel.sql("ARRAY[#{ids}]::int[]").as('ids'),
+      Arel.sql("1::bigint AS count")
+    ).from('(VALUES (1)) AS does_not_matter').limit(1)
+
+    cte = Gitlab::SQL::CTE.new(:cte, Namespace.select('*').from('result'))
+
+    union_query = Namespace.with(cte.to_arel).from_union(
+      walk_down,
+      next_elements,
+      up_one_level,
+      remove_duplicates: false,
+      remove_order: false
+    ).select('current_id', 'depth', 'ids', 'count').limit(1)
+
+    recursive_cte << union_query
+
+    scope = Namespace.with
+      .recursive(recursive_cte.to_arel)
+      .from(recursive_cte.alias_to(Namespace.arel_table))
+      .limit(of)
+    row = Namespace.from(scope.arel.as('namespaces')).order(count: :desc).limit(1).first
+
+    [
+      { current_id: row[:current_id], depth: row[:depth] },
+      row[:ids]
+    ]
+  end
+
+  attr_reader :namespace_id, :cursor
+
+  def walk_down
+    Namespace.select(
+      Arel.sql('namespaces.id').as('current_id'),
+      Arel.sql('cte.depth || namespaces.id').as('depth'),
+      Arel.sql('cte.ids || namespaces.id').as('ids'),
+      Arel.sql('cte.count + 1').as('count')
+    ).from('cte, LATERAL (SELECT id FROM namespaces WHERE parent_id = cte.current_id ORDER BY id LIMIT 1) namespaces')
+  end
+
+  def next_elements
+    Namespace.select(
+      Arel.sql('namespaces.id').as('current_id'),
+      Arel.sql('cte.depth[:array_length(cte.depth, 1) - 1] || namespaces.id').as('depth'),
+      Arel.sql('cte.ids || namespaces.id').as('ids'),
+      Arel.sql('cte.count + 1').as('count')
+    ).from('cte, LATERAL (SELECT id FROM namespaces WHERE namespaces.parent_id = cte.depth[array_length(cte.depth, 1) - 1] AND namespaces.id > cte.depth[array_length(cte.depth, 1)] ORDER BY id LIMIT 1) namespaces')
+  end
+
+  def up_one_level
+    Namespace.select(
+      Arel.sql('cte.current_id').as('current_id'),
+      Arel.sql('cte.depth[:array_length(cte.depth, 1) - 1]').as('depth'),
+      Arel.sql('cte.ids').as('ids'),
+      Arel.sql('cte.count + 1').as('count')
+    ).from('cte')
+      .where('cte.depth <> ARRAY[]::int[]')
+      .limit(1)
+  end
+end
+
+iterator = NamespaceEachBatch.new(namespace_id: 9970)
+all_ids = []
+iterator.each_batch do |ids|
+  all_ids.concat(ids)
+end
+
+# Test
+puts all_ids.count
+puts all_ids.sort == Namespace.where('traversal_ids && ARRAY[9970]').pluck(:id).sort
+```
+
+Example batch query:
+
+```sql
+SELECT
+    "namespaces".*
+FROM ( WITH RECURSIVE "result" AS ((
+            SELECT
+                15847356 AS current_id,
+                ARRAY[9970,
+                12061481,
+                12128714,
+                12445111,
+                15847356]::int[] AS depth,
+                ARRAY[]::int[] AS ids,
+                1::bigint AS count
+            FROM (
+                VALUES (1)) AS does_not_matter
+            LIMIT 1)
+    UNION ALL ( WITH "cte" AS MATERIALIZED (
+            SELECT
+                *
+            FROM
+                result
+)
+            SELECT
+                current_id,
+                depth,
+                ids,
+                count
+            FROM ((
+                    SELECT
+                        namespaces.id AS current_id,
+                        cte.depth || namespaces.id AS depth,
+                        cte.ids || namespaces.id AS ids,
+                        cte.count + 1 AS count
+                    FROM
+                        cte,
+                        LATERAL (
+                            SELECT
+                                id
+                            FROM
+                                namespaces
+                            WHERE
+                                parent_id = cte.current_id
+                            ORDER BY
+                                id
+                            LIMIT 1
+) namespaces
+)
+                UNION ALL (
+                    SELECT
+                        namespaces.id AS current_id,
+                        cte.depth[:array_length(
+                            cte.depth, 1
+) - 1] || namespaces.id AS depth,
+                        cte.ids || namespaces.id AS ids,
+                        cte.count + 1 AS count
+                    FROM
+                        cte,
+                        LATERAL (
+                            SELECT
+                                id
+                            FROM
+                                namespaces
+                            WHERE
+                                namespaces.parent_id = cte.depth[array_length(
+                                    cte.depth, 1
+) - 1]
+                                AND namespaces.id > cte.depth[array_length(
+                                    cte.depth, 1
+)]
+                            ORDER BY
+                                id
+                            LIMIT 1
+) namespaces
+)
+                UNION ALL (
+                    SELECT
+                        cte.current_id AS current_id,
+                        cte.depth[:array_length(
+                            cte.depth, 1
+) - 1] AS depth,
+                        cte.ids AS ids,
+                        cte.count + 1 AS count
+                    FROM
+                        cte
+                    WHERE (
+                        cte.depth <> ARRAY[]::int[]
+)
+                LIMIT 1
+)
+) namespaces
+    LIMIT 1
+))
+SELECT
+    "namespaces".*
+FROM
+    "result" AS "namespaces"
+LIMIT 500) namespaces
+ORDER BY
+    "count" DESC
+LIMIT 1
+```
+
+Execution plan:
+
+```plaintext
+ Limit  (cost=16.36..16.36 rows=1 width=76) (actual time=436.963..436.970 rows=1 loops=1)
+   Buffers: shared hit=3721 read=423 dirtied=8
+   I/O Timings: read=412.590 write=0.000
+   ->  Sort  (cost=16.36..16.39 rows=11 width=76) (actual time=436.961..436.968 rows=1 loops=1)
+         Sort Key: namespaces.count DESC
+         Sort Method: top-N heapsort  Memory: 27kB
+         Buffers: shared hit=3721 read=423 dirtied=8
+         I/O Timings: read=412.590 write=0.000
+         ->  Limit  (cost=15.98..16.20 rows=11 width=76) (actual time=0.005..436.394 rows=500 loops=1)
+               Buffers: shared hit=3718 read=423 dirtied=8
+               I/O Timings: read=412.590 write=0.000
+               CTE result
+                 ->  Recursive Union  (cost=0.00..15.98 rows=11 width=76) (actual time=0.003..432.924 rows=500 loops=1)
+                       Buffers: shared hit=3718 read=423 dirtied=8
+                       I/O Timings: read=412.590 write=0.000
+                       ->  Limit  (cost=0.00..0.01 rows=1 width=76) (actual time=0.002..0.003 rows=1 loops=1)
+                             I/O Timings: read=0.000 write=0.000
+                             ->  Result  (cost=0.00..0.01 rows=1 width=76) (actual time=0.001..0.002 rows=1 loops=1)
+                                   I/O Timings: read=0.000 write=0.000
+                       ->  Limit  (cost=0.76..1.57 rows=1 width=76) (actual time=0.862..0.862 rows=1 loops=499)
+                             Buffers: shared hit=3718 read=423 dirtied=8
+                             I/O Timings: read=412.590 write=0.000
+                             CTE cte
+                               ->  WorkTable Scan on result  (cost=0.00..0.20 rows=10 width=76) (actual time=0.000..0.000 rows=1 loops=499)
+                                     I/O Timings: read=0.000 write=0.000
+                             ->  Append  (cost=0.56..17.57 rows=21 width=76) (actual time=0.862..0.862 rows=1 loops=499)
+                                   Buffers: shared hit=3718 read=423 dirtied=8
+                                   I/O Timings: read=412.590 write=0.000
+                                   ->  Nested Loop  (cost=0.56..7.77 rows=10 width=76) (actual time=0.675..0.675 rows=0 loops=499)
+                                         Buffers: shared hit=1693 read=357 dirtied=1
+                                         I/O Timings: read=327.812 write=0.000
+                                         ->  CTE Scan on cte  (cost=0.00..0.20 rows=10 width=76) (actual time=0.001..0.001 rows=1 loops=499)
+                                               I/O Timings: read=0.000 write=0.000
+                                         ->  Limit  (cost=0.56..0.73 rows=1 width=4) (actual time=0.672..0.672 rows=0 loops=499)
+                                               Buffers: shared hit=1693 read=357 dirtied=1
+                                               I/O Timings: read=327.812 write=0.000
+                                               ->  Index Only Scan using index_namespaces_on_parent_id_and_id on public.namespaces namespaces_1  (cost=0.56..5.33 rows=29 width=4) (actual time=0.671..0.671 rows=0 loops=499)
+                                                     Index Cond: (namespaces_1.parent_id = cte.current_id)
+                                                     Heap Fetches: 7
+                                                     Buffers: shared hit=1693 read=357 dirtied=1
+                                                     I/O Timings: read=327.812 write=0.000
+                                   ->  Nested Loop  (cost=0.57..9.45 rows=10 width=76) (actual time=0.208..0.208 rows=1 loops=442)
+                                         Buffers: shared hit=2025 read=66 dirtied=7
+                                         I/O Timings: read=84.778 write=0.000
+                                         ->  CTE Scan on cte cte_1  (cost=0.00..0.20 rows=10 width=72) (actual time=0.000..0.000 rows=1 loops=442)
+                                               I/O Timings: read=0.000 write=0.000
+                                         ->  Limit  (cost=0.57..0.89 rows=1 width=4) (actual time=0.203..0.203 rows=1 loops=442)
+                                               Buffers: shared hit=2025 read=66 dirtied=7
+                                               I/O Timings: read=84.778 write=0.000
+                                               ->  Index Only Scan using index_namespaces_on_parent_id_and_id on public.namespaces namespaces_2  (cost=0.57..3.77 rows=10 width=4) (actual time=0.201..0.201 rows=1 loops=442)
+                                                     Index Cond: ((namespaces_2.parent_id = (cte_1.depth)[(array_length(cte_1.depth, 1) - 1)]) AND (namespaces_2.id > (cte_1.depth)[array_length(cte_1.depth, 1)]))
+                                                     Heap Fetches: 35
+                                                     Buffers: shared hit=2025 read=66 dirtied=6
+                                                     I/O Timings: read=84.778 write=0.000
+                                   ->  Limit  (cost=0.00..0.03 rows=1 width=76) (actual time=0.003..0.003 rows=1 loops=59)
+                                         I/O Timings: read=0.000 write=0.000
+                                         ->  CTE Scan on cte cte_2  (cost=0.00..0.29 rows=9 width=76) (actual time=0.002..0.002 rows=1 loops=59)
+                                               Filter: (cte_2.depth <> '{}'::integer[])
+                                               Rows Removed by Filter: 0
+                                               I/O Timings: read=0.000 write=0.000
+               ->  CTE Scan on result namespaces  (cost=0.00..0.22 rows=11 width=76) (actual time=0.005..436.240 rows=500 loops=1)
+                     Buffers: shared hit=3718 read=423 dirtied=8
+                     I/O Timings: read=412.590 write=0.000
+```