diff options
Diffstat (limited to 'doc/administration/sidekiq/sidekiq_troubleshooting.md')
-rw-r--r-- | doc/administration/sidekiq/sidekiq_troubleshooting.md | 171 |
1 files changed, 171 insertions, 0 deletions
diff --git a/doc/administration/sidekiq/sidekiq_troubleshooting.md b/doc/administration/sidekiq/sidekiq_troubleshooting.md index d2afe171e9c..b261e385949 100644 --- a/doc/administration/sidekiq/sidekiq_troubleshooting.md +++ b/doc/administration/sidekiq/sidekiq_troubleshooting.md @@ -56,6 +56,120 @@ gitlab_rails['env'] = {"SIDEKIQ_LOG_ARGUMENTS" => "0"} In GitLab 13.5 and earlier, set `SIDEKIQ_LOG_ARGUMENTS` to `1` to start logging arguments passed to Sidekiq. +## Investigating Sidekiq queue backlogs or slow performance + +Symptoms of slow Sidekiq performance include problems with merge request status updates, +and delays before CI pipelines start running. + +Potential causes include: + +- The GitLab instance may need more Sidekiq workers. By default, a single-node Omnibus GitLab + runs one worker, restricting the execution of Sidekiq jobs to a maximum of one CPU core. + [Read more about running multiple Sidekiq workers](extra_sidekiq_processes.md). + +- The instance is configured with more Sidekiq workers, but most of the extra workers are + not configured to run any job that is queued. This can result in a backlog of jobs + when the instance is busy, if the workload has changed in the months or years since + the workers were configured, or as a result of GitLab product changes. + +Gather data on the state of the Sidekiq workers with the following Ruby script. + +1. Create the script: + + ```ruby + cat > /var/opt/gitlab/sidekiqcheck.rb <<EOF + require 'sidekiq/monitor' + Sidekiq::Monitor::Status.new.display('overview') + Sidekiq::Monitor::Status.new.display('processes'); nil + Sidekiq::Monitor::Status.new.display('queues'); nil + puts "----------- workers ----------- " + workers = Sidekiq::Workers.new + workers.each do |_process_id, _thread_id, work| + pp work + end + puts "----------- Queued Jobs ----------- " + Sidekiq::Queue.all.each do |queue| + queue.each do |job| + pp job + end + end ;nil + puts "----------- done! ----------- " + EOF + ``` + +1. Execute and capture the output: + + ```shell + sudo gitlab-rails runner /var/opt/gitlab/sidekiqcheck.rb > /tmp/sidekiqcheck_$(date '+%Y%m%d-%H:%M').out + ``` + + If the performance issue is intermittent: + + - Run this in a cron job every five minutes. Write the files to a location with enough space: allow for 500KB per file. + - Refer back to the data to see what went wrong. + +1. Analyze the output. The following commands assume that you have a directory of output files. + + 1. `grep 'Busy: ' *` shows how many jobs were being run. `grep 'Enqueued: ' *` + shows the backlog of work at that time. + + 1. Look at the number of busy threads across the workers in samples where Sidekiq is under load: + + ```shell + ls | while read f ; do if grep -q 'Enqueued: 0' $f; then : + else echo $f; egrep 'Busy:|Enqueued:|---- Processes' $f + grep 'Threads:' $f ; fi + done | more + ``` + + Example output: + + ```plaintext + sidekiqcheck_20221024-14:00.out + Busy: 47 + Enqueued: 363 + ---- Processes (13) ---- + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 23 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (0 busy) + Threads: 30 (24 busy) + Threads: 30 (23 busy) + ``` + + - In this output file, 47 threads were busy, and there was a backlog of 363 jobs. + - Of the 13 worker processes, only two were busy. + - This indicates that the other workers are configured too specifically. + - Look at the full output to work out which workers were busy. + Correlate with your `sidekiq_queues` configuration in `gitlab.rb`. + - An overloaded single-worker environment might look like this: + + ```plaintext + sidekiqcheck_20221024-14:00.out + Busy: 25 + Enqueued: 363 + ---- Processes (1) ---- + Threads: 25 (25 busy) + ``` + + 1. Look at the `---- Queues (xxx) ----` section of the output file to + determine what jobs were queued up at the time. + + 1. The files also include low level details about the state of Sidekiq at the time. + This could be useful for identifying where spikes in workload are coming from. + + - The `----------- workers -----------` section details the jobs that make up the + `Busy` count in the summary. + - The `----------- Queued Jobs -----------` section provides details on + jobs that are `Enqueued`. + ## Thread dump Send the Sidekiq process ID the `TTIN` signal to output thread @@ -379,3 +493,60 @@ has number of drawbacks, as mentioned in [Why Ruby's Timeout is dangerous (and T > - in any of your code, regardless of whether it could have possibly raised an exception before > > Nobody writes code to defend against an exception being raised on literally any line. That's not even possible. So Thread.raise is basically like a sneak attack on your code that could result in almost anything. It would probably be okay if it were pure-functional code that did not modify any state. But this is Ruby, so that's unlikely :) + +## Omnibus GitLab 14.0 and later: remove the `sidekiq-cluster` service + +Omnibus GitLab instances that were configured to run `sidekiq-cluster` prior to GitLab 14.0 +might still have this service running along side `sidekiq` in later releases. + +The code to manage `sidekiq-cluster` was removed in GitLab 14.0. +The configuration files remain on disk so the `sidekiq-cluster` process continues +to be started by the GitLab systemd service . + +The extra service can be identified as running by: + +- `gitlab-ctl status` showing both services: + + ```plaintext + run: sidekiq: (pid 1386) 445s; run: log: (pid 1385) 445s + run: sidekiq-cluster: (pid 1388) 445s; run: log: (pid 1381) 445s + ``` + +- `ps -ef | grep 'runsv sidekiq'` showing two processes: + + ```plaintext + root 31047 31045 0 13:54 ? 00:00:00 runsv sidekiq-cluster + root 31054 31045 0 13:54 ? 00:00:00 runsv sidekiq + ``` + +To remove the `sidekiq-cluster` service from servers running GitLab 14.0 and later: + +1. Stop GitLab and the systemd service: + + ```shell + sudo gitlab-ctl stop + sudo systemctl stop gitlab-runsvdir.service + ``` + +1. Remove the `runsv` service definition: + + ```shell + sudo rm -rf /opt/gitlab/sv/sidekiq-cluster + ``` + +1. Restart GitLab: + + ```shell + sudo systemctl start gitlab-runsvdir.service + ``` + +1. Check that all services are up, and the `sidekiq-cluster` service is not listed: + + ```shell + sudo gitlab-ctl status + ``` + +This change might reduce the amount of work Sidekiq can do. Symptoms like delays creating pipelines +indicate that additional Sidekiq processes would be beneficial. +Consider [adding additional Sidekiq processes](extra_sidekiq_processes.md) +to compensate for removing the `sidekiq-cluster` service. |