1 files changed, 171 insertions, 0 deletions
diff --git a/doc/administration/sidekiq/sidekiq_troubleshooting.md b/doc/administration/sidekiq/sidekiq_troubleshooting.md
index d2afe171e9c..b261e385949 100644
--- a/doc/administration/sidekiq/sidekiq_troubleshooting.md
+++ b/doc/administration/sidekiq/sidekiq_troubleshooting.md
@@ -56,6 +56,120 @@ gitlab_rails['env'] = {"SIDEKIQ_LOG_ARGUMENTS" => "0"}
 
 In GitLab 13.5 and earlier, set `SIDEKIQ_LOG_ARGUMENTS` to `1` to start logging arguments passed to Sidekiq.
 
+## Investigating Sidekiq queue backlogs or slow performance
+
+Symptoms of slow Sidekiq performance include problems with merge request status updates,
+and delays before CI pipelines start running.
+
+Potential causes include:
+
+- The GitLab instance may need more Sidekiq workers. By default, a single-node Omnibus GitLab
+  runs one worker, restricting the execution of Sidekiq jobs to a maximum of one CPU core.
+  [Read more about running multiple Sidekiq workers](extra_sidekiq_processes.md).
+
+- The instance is configured with more Sidekiq workers, but most of the extra workers are
+  not configured to run any job that is queued. This can result in a backlog of jobs
+  when the instance is busy, if the workload has changed in the months or years since
+  the workers were configured, or as a result of GitLab product changes.
+
+Gather data on the state of the Sidekiq workers with the following Ruby script.
+
+1. Create the script:
+
+   ```ruby
+   cat > /var/opt/gitlab/sidekiqcheck.rb <<EOF
+   require 'sidekiq/monitor'
+   Sidekiq::Monitor::Status.new.display('overview')
+   Sidekiq::Monitor::Status.new.display('processes'); nil
+   Sidekiq::Monitor::Status.new.display('queues'); nil
+   puts "----------- workers ----------- "
+   workers = Sidekiq::Workers.new
+   workers.each do |_process_id, _thread_id, work|
+     pp work
+   end
+   puts "----------- Queued Jobs ----------- "
+   Sidekiq::Queue.all.each do |queue|
+     queue.each do |job|
+       pp job
+     end
+   end ;nil
+   puts "----------- done! ----------- "
+   EOF
+   ```
+
+1. Execute and capture the output:
+
+   ```shell
+   sudo gitlab-rails runner /var/opt/gitlab/sidekiqcheck.rb > /tmp/sidekiqcheck_$(date '+%Y%m%d-%H:%M').out
+   ```
+
+   If the performance issue is intermittent:
+
+   - Run this in a cron job every five minutes. Write the files to a location with enough space: allow for 500KB per file.
+   - Refer back to the data to see what went wrong.
+
+1. Analyze the output. The following commands assume that you have a directory of output files.
+
+   1. `grep 'Busy: ' *` shows how many jobs were being run. `grep 'Enqueued: ' *`
+      shows the backlog of work at that time.
+
+   1. Look at the number of busy threads across the workers in samples where Sidekiq is under load:
+
+      ```shell
+      ls | while read f ; do if grep -q 'Enqueued: 0' $f; then :
+        else echo $f; egrep 'Busy:|Enqueued:|---- Processes' $f
+        grep 'Threads:' $f ; fi
+      done | more
+      ```
+
+      Example output:
+
+      ```plaintext
+      sidekiqcheck_20221024-14:00.out
+             Busy: 47
+         Enqueued: 363
+      ---- Processes (13) ----
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 23 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (0 busy)
+        Threads: 30 (24 busy)
+        Threads: 30 (23 busy)
+      ```
+
+      - In this output file, 47 threads were busy, and there was a backlog of 363 jobs.
+      - Of the 13 worker processes, only two were busy.
+      - This indicates that the other workers are configured too specifically.
+      - Look at the full output to work out which workers were busy.
+        Correlate with your `sidekiq_queues` configuration in `gitlab.rb`.
+      - An overloaded single-worker environment might look like this:
+
+        ```plaintext
+        sidekiqcheck_20221024-14:00.out
+               Busy: 25
+           Enqueued: 363
+        ---- Processes (1) ----
+          Threads: 25 (25 busy)
+        ```
+
+   1. Look at the `---- Queues (xxx) ----` section of the output file to
+      determine what jobs were queued up at the time.
+
+   1. The files also include low level details about the state of Sidekiq at the time.
+      This could be useful for identifying where spikes in workload are coming from.
+
+      - The `----------- workers -----------` section details the jobs that make up the
+        `Busy` count in the summary.
+      - The `----------- Queued Jobs -----------` section provides details on
+        jobs that are `Enqueued`.
+
 ## Thread dump
 
 Send the Sidekiq process ID the `TTIN` signal to output thread
@@ -379,3 +493,60 @@ has number of drawbacks, as mentioned in [Why Ruby's Timeout is dangerous (and T
 > - in any of your code, regardless of whether it could have possibly raised an exception before
 >
 > Nobody writes code to defend against an exception being raised on literally any line. That's not even possible. So Thread.raise is basically like a sneak attack on your code that could result in almost anything. It would probably be okay if it were pure-functional code that did not modify any state. But this is Ruby, so that's unlikely :)
+
+## Omnibus GitLab 14.0 and later: remove the `sidekiq-cluster` service
+
+Omnibus GitLab instances that were configured to run `sidekiq-cluster` prior to GitLab 14.0
+might still have this service running along side `sidekiq` in later releases.
+
+The code to manage `sidekiq-cluster` was removed in GitLab 14.0.
+The configuration files remain on disk so the `sidekiq-cluster` process continues
+to be started by the GitLab systemd service .
+
+The extra service can be identified as running by:
+
+- `gitlab-ctl status` showing both services:
+
+  ```plaintext
+  run: sidekiq: (pid 1386) 445s; run: log: (pid 1385) 445s
+  run: sidekiq-cluster: (pid 1388) 445s; run: log: (pid 1381) 445s
+  ```
+
+- `ps -ef | grep 'runsv sidekiq'` showing two processes:
+
+  ```plaintext
+  root     31047 31045  0 13:54 ?        00:00:00 runsv sidekiq-cluster
+  root     31054 31045  0 13:54 ?        00:00:00 runsv sidekiq
+  ```
+
+To remove the `sidekiq-cluster` service from servers running GitLab 14.0 and later:
+
+1. Stop GitLab and the systemd service:
+
+   ```shell
+   sudo gitlab-ctl stop
+   sudo systemctl stop gitlab-runsvdir.service
+   ```
+
+1. Remove the `runsv` service definition:
+
+   ```shell
+   sudo rm -rf /opt/gitlab/sv/sidekiq-cluster
+   ```
+
+1. Restart GitLab:
+
+   ```shell
+   sudo systemctl start gitlab-runsvdir.service
+   ```
+
+1. Check that all services are up, and the `sidekiq-cluster` service is not listed:
+
+   ```shell
+   sudo gitlab-ctl status
+   ```
+
+This change might reduce the amount of work Sidekiq can do. Symptoms like delays creating pipelines
+indicate that additional Sidekiq processes would be beneficial.
+Consider [adding additional Sidekiq processes](extra_sidekiq_processes.md)
+to compensate for removing the `sidekiq-cluster` service.