diff options
author | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-08-20 14:32:33 +0300 |
---|---|---|
committer | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-08-20 14:32:33 +0300 |
commit | 4a2ac0ed486f088fd34fdbeddef0c73cdec77e12 (patch) | |
tree | 90a7917c2d6a14ed61a1e4038701f1cd4fd91537 | |
parent | 48d7984d9912c935a2c2abba3b55593cf0be2d8e (diff) |
datastore: Fix acknowledgement of stale jobs considering timezones
For quite some time, tests for the Postgres replication event queue have
been failing in one specific test which is queueing up a replication
job, dequeueingit and then immediately tries to acknowledge stale jobs.
The acknowledged jobs are never in the correct state though: they're
marked as failed, even though they should still be in progress. While
this could be a race, the fact that this only occurs for some developers
strongly hints at the fact that there may be something else going on:
even bumping the threshold to an hour wouldn't fix it.
If one bumps the timeout to slightly above two hours though, then it now
starts to fail. This is a strong indicator of it being timezone-related,
given I'm located at UTC+2. And indeed: while we always make sure to
insert and compare SQL timestamps in the replication queue as UTC, we
don't when acknowledging stale jobs. Depending on the timezone, this
either means that we're taking way too long to update jobs (if in a
positive timezone) or that we always mark jobs as failed immediately (if
in a negative timezone).
Fix the bug by correctly using UTC timezone when acknowledging stale jobs.
Changelog: fixed
-rw-r--r-- | internal/praefect/datastore/queue.go | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/internal/praefect/datastore/queue.go b/internal/praefect/datastore/queue.go index 9e21da9a9..dcff2567e 100644 --- a/internal/praefect/datastore/queue.go +++ b/internal/praefect/datastore/queue.go @@ -475,7 +475,7 @@ func (rq PostgresReplicationEventQueue) StartHealthUpdate(ctx context.Context, t func (rq PostgresReplicationEventQueue) AcknowledgeStale(ctx context.Context, staleAfter time.Duration) error { query := ` WITH stale_job_lock AS ( - DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() - INTERVAL '1 MILLISECOND' * $1 + DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() AT TIME ZONE 'UTC' - INTERVAL '1 MILLISECOND' * $1 RETURNING job_id, lock_id ) , update_job AS ( |