From 4a2ac0ed486f088fd34fdbeddef0c73cdec77e12 Mon Sep 17 00:00:00 2001 From: Patrick Steinhardt Date: Fri, 20 Aug 2021 13:32:33 +0200 Subject: datastore: Fix acknowledgement of stale jobs considering timezones For quite some time, tests for the Postgres replication event queue have been failing in one specific test which is queueing up a replication job, dequeueingit and then immediately tries to acknowledge stale jobs. The acknowledged jobs are never in the correct state though: they're marked as failed, even though they should still be in progress. While this could be a race, the fact that this only occurs for some developers strongly hints at the fact that there may be something else going on: even bumping the threshold to an hour wouldn't fix it. If one bumps the timeout to slightly above two hours though, then it now starts to fail. This is a strong indicator of it being timezone-related, given I'm located at UTC+2. And indeed: while we always make sure to insert and compare SQL timestamps in the replication queue as UTC, we don't when acknowledging stale jobs. Depending on the timezone, this either means that we're taking way too long to update jobs (if in a positive timezone) or that we always mark jobs as failed immediately (if in a negative timezone). Fix the bug by correctly using UTC timezone when acknowledging stale jobs. Changelog: fixed --- internal/praefect/datastore/queue.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/internal/praefect/datastore/queue.go b/internal/praefect/datastore/queue.go index 9e21da9a9..dcff2567e 100644 --- a/internal/praefect/datastore/queue.go +++ b/internal/praefect/datastore/queue.go @@ -475,7 +475,7 @@ func (rq PostgresReplicationEventQueue) StartHealthUpdate(ctx context.Context, t func (rq PostgresReplicationEventQueue) AcknowledgeStale(ctx context.Context, staleAfter time.Duration) error { query := ` WITH stale_job_lock AS ( - DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() - INTERVAL '1 MILLISECOND' * $1 + DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() AT TIME ZONE 'UTC' - INTERVAL '1 MILLISECOND' * $1 RETURNING job_id, lock_id ) , update_job AS ( -- cgit v1.2.3