doc/development/sidekiq/limited_capacity_worker.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Sidekiq limited capacity worker

It is possible to limit the number of concurrent running jobs for a worker class
by using the `LimitedCapacity::Worker` concern.

The worker must implement three methods:

- `perform_work`: The concern implements the usual `perform` method and calls
  `perform_work` if there's any available capacity.
- `remaining_work_count`: Number of jobs that have work to perform.
- `max_running_jobs`: Maximum number of jobs allowed to run concurrently.

```ruby
class MyDummyWorker
  include ApplicationWorker
  include LimitedCapacity::Worker

  def perform_work(*args)
  end

  def remaining_work_count(*args)
    5
  end

  def max_running_jobs
    25
  end
end
```

Additional to the regular worker, a cron worker must be defined as well to
backfill the queue with jobs. the arguments passed to `perform_with_capacity`
are passed to the `perform_work` method.

```ruby
class ScheduleMyDummyCronWorker
  include ApplicationWorker
  include CronjobQueue

  def perform(*args)
    MyDummyWorker.perform_with_capacity(*args)
  end
end
```

## How many jobs are running?

It runs `max_running_jobs` at almost all times.

The cron worker checks the remaining capacity on each execution and it
schedules at most `max_running_jobs` jobs. Those jobs on completion
re-enqueue themselves immediately, but not on failure. The cron worker is in
charge of replacing those failed jobs.

## Handling errors and idempotence

This concern disables Sidekiq retries, logs the errors, and sends the job to the
dead queue. This is done to have only one source that produces jobs and because
the retry would occupy a slot with a job to perform in the distant future.

We let the cron worker enqueue new jobs, this could be seen as our retry and
back off mechanism because the job might fail again if executed immediately.
This means that for every failed job, we run at a lower capacity
until the cron worker fills the capacity again. If it is important for the
worker not to get a backlog, exceptions must be handled in `#perform_work` and
the job should not raise.

The jobs are deduplicated using the `:none` strategy, but the worker is not
marked as `idempotent!`.

## Metrics

This concern exposes three Prometheus metrics of gauge type with the worker class
name as label:

- `limited_capacity_worker_running_jobs`
- `limited_capacity_worker_max_running_jobs`
- `limited_capacity_worker_remaining_work_count`