datastore: Fix acknowledgement of stale jobs considering timezones

For quite some time, tests for the Postgres replication event queue have been failing in one specific test which is queueing up a replication job, dequeueingit and then immediately tries to acknowledge stale jobs. The acknowledged jobs are never in the correct state though: they're marked as failed, even though they should still be in progress. While this could be a race, the fact that this only occurs for some developers strongly hints at the fact that there may be something else going on: even bumping the threshold to an hour wouldn't fix it. If one bumps the timeout to slightly above two hours though, then it now starts to fail. This is a strong indicator of it being timezone-related, given I'm located at UTC+2. And indeed: while we always make sure to insert and compare SQL timestamps in the replication queue as UTC, we don't when acknowledging stale jobs. Depending on the timezone, this either means that we're taking way too long to update jobs (if in a positive timezone) or that we always mark jobs as failed immediately (if in a negative timezone). Fix the bug by correctly using UTC timezone when acknowledging stale jobs. Changelog: fixed
author: Patrick Steinhardt <psteinhardt@gitlab.com> 2021-08-20 14:32:33 +0300
committer: Patrick Steinhardt <psteinhardt@gitlab.com> 2021-08-20 14:32:33 +0300
commit: 4a2ac0ed486f088fd34fdbeddef0c73cdec77e12 (patch)
tree: 90a7917c2d6a14ed61a1e4038701f1cd4fd91537
parent: 48d7984d9912c935a2c2abba3b55593cf0be2d8e (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/internal/praefect/datastore/queue.go b/internal/praefect/datastore/queue.go
index 9e21da9a9..dcff2567e 100644
--- a/internal/praefect/datastore/queue.go
+++ b/internal/praefect/datastore/queue.go
@@ -475,7 +475,7 @@ func (rq PostgresReplicationEventQueue) StartHealthUpdate(ctx context.Context, t
 func (rq PostgresReplicationEventQueue) AcknowledgeStale(ctx context.Context, staleAfter time.Duration) error {
 	query := `
 		WITH stale_job_lock AS (
-			DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() - INTERVAL '1 MILLISECOND' * $1
+			DELETE FROM replication_queue_job_lock WHERE triggered_at < NOW() AT TIME ZONE 'UTC' - INTERVAL '1 MILLISECOND' * $1
 			RETURNING job_id, lock_id
 		)
 		, update_job AS (
author	Patrick Steinhardt <psteinhardt@gitlab.com>	2021-08-20 14:32:33 +0300
committer	Patrick Steinhardt <psteinhardt@gitlab.com>	2021-08-20 14:32:33 +0300
commit	4a2ac0ed486f088fd34fdbeddef0c73cdec77e12 (patch)
tree	90a7917c2d6a14ed61a1e4038701f1cd4fd91537
parent	48d7984d9912c935a2c2abba3b55593cf0be2d8e (diff)