Rework retry strategy for remote mirrors

**Prevention of running 2 simultaneous updates** Instead of using `RemoteMirror#update_status` and raise an error if it's already running to prevent the same mirror being updated at the same time we now use `Gitlab::ExclusiveLease` for that. When we fail to obtain a lease in 3 tries, 30 seconds apart, we bail and reschedule. We'll reschedule faster for the protected branches. If the mirror already ran since it was scheduled, the job will be skipped. **Error handling: Remote side** When an update fails because of a `Gitlab::Git::CommandError`, we won't track this error in sentry, this could be on the remote side: for example when branches have diverged. In this case, we'll try 3 times scheduled 1 or 5 minutes apart. In between, the mirror is marked as "to_retry", the error would be visible to the user when they visit the settings page. After 3 tries we'll mark the mirror as failed and notify the user. We won't track this error in sentry, as it's not likely we can help it. The next event that would trigger a new refresh. **Error handling: our side** If an unexpected error occurs, we mark the mirror as failed, but we'd still retry the job based on the regular sidekiq retries with backoff. Same as we used to The error would be reported in sentry, since its likely we need to do something about it.
author: Bob Van Landuyt <bob@gitlab.com> 2019-08-13 23:52:01 +0300
committer: Douwe Maan <douwe@gitlab.com> 2019-08-13 23:52:01 +0300
commit: 452bc36d603ed89d3fa5e3409338dd905230bd2f (patch)
tree: 3ef260430db93ef2b9fa9236ea601a0b3e53adee /app/workers
parent: 1c3b570c117cc41f5f4838a8366c4367ef0749cb (diff)
2 files changed, 33 insertions, 30 deletions
diff --git a/app/workers/remote_mirror_notification_worker.rb b/app/workers/remote_mirror_notification_worker.rb
index 5bafe8e2046..368abfeda99 100644
--- a/app/workers/remote_mirror_notification_worker.rb
+++ b/app/workers/remote_mirror_notification_worker.rb
@@ -4,7 +4,7 @@ class RemoteMirrorNotificationWorker
   include ApplicationWorker
 
   def perform(remote_mirror_id)
-    remote_mirror = RemoteMirrorFinder.new(id: remote_mirror_id).execute
+    remote_mirror = RemoteMirror.find_by_id(remote_mirror_id)
 
     # We check again if there's an error because a newer run since this job was
     # fired could've completed successfully.
diff --git a/app/workers/repository_update_remote_mirror_worker.rb b/app/workers/repository_update_remote_mirror_worker.rb
index 03a7ff2cd7a..d13c7641eb3 100644
--- a/app/workers/repository_update_remote_mirror_worker.rb
+++ b/app/workers/repository_update_remote_mirror_worker.rb
@@ -1,50 +1,53 @@
 # frozen_string_literal: true
 
 class RepositoryUpdateRemoteMirrorWorker
-  UpdateAlreadyInProgressError = Class.new(StandardError)
   UpdateError = Class.new(StandardError)
 
   include ApplicationWorker
+  include Gitlab::ExclusiveLeaseHelpers
 
   sidekiq_options retry: 3, dead: false
 
-  sidekiq_retry_in { |count| 30 * count }
+  LOCK_WAIT_TIME = 30.seconds
+  MAX_TRIES = 3
 
-  sidekiq_retries_exhausted do |msg, _|
-    Sidekiq.logger.warn "Failed #{msg['class']} with #{msg['args']}: #{msg['error_message']}"
-  end
-
-  def perform(remote_mirror_id, scheduled_time)
-    remote_mirror = RemoteMirrorFinder.new(id: remote_mirror_id).execute
+  def perform(remote_mirror_id, scheduled_time, tries = 0)
+    remote_mirror = RemoteMirror.find_by_id(remote_mirror_id)
+    return unless remote_mirror
     return if remote_mirror.updated_since?(scheduled_time)
 
-    raise UpdateAlreadyInProgressError if remote_mirror.update_in_progress?
+    # If the update is already running, wait for it to finish before running again
+    # This will wait for a total of 90 seconds in 3 steps
+    in_lock(remote_mirror_update_lock(remote_mirror.id),
+            retries: 3,
+            ttl: remote_mirror.max_runtime,
+            sleep_sec: LOCK_WAIT_TIME) do
+      update_mirror(remote_mirror, scheduled_time, tries)
+    end
+  rescue Gitlab::ExclusiveLeaseHelpers::FailedToObtainLockError
+    # If an update runs longer than 1.5 minutes, we'll reschedule it
+    # with a backoff. The next run will check if the previous update would
+    # include the changes that triggered this update and become a no-op.
+    self.class.perform_in(remote_mirror.backoff_delay, remote_mirror.id, scheduled_time, tries)
+  end
 
-    remote_mirror.update_start
+  private
 
-    project = remote_mirror.project
+  def update_mirror(mirror, scheduled_time, tries)
+    project = mirror.project
     current_user = project.creator
-    result = Projects::UpdateRemoteMirrorService.new(project, current_user).execute(remote_mirror)
-    raise UpdateError, result[:message] if result[:status] == :error
-
-    remote_mirror.update_finish
-  rescue UpdateAlreadyInProgressError
-    raise
-  rescue UpdateError => ex
-    fail_remote_mirror(remote_mirror, ex.message)
-    raise
-  rescue => ex
-    return unless remote_mirror
+    result = Projects::UpdateRemoteMirrorService.new(project, current_user).execute(mirror, tries)
 
-    fail_remote_mirror(remote_mirror, ex.message)
-    raise UpdateError, "#{ex.class}: #{ex.message}"
+    if result[:status] == :error && mirror.to_retry?
+      schedule_retry(mirror, scheduled_time, tries)
+    end
   end
 
-  private
-
-  def fail_remote_mirror(remote_mirror, message)
-    remote_mirror.mark_as_failed(message)
+  def remote_mirror_update_lock(mirror_id)
+    [self.class.name, mirror_id].join(':')
+  end
 
-    Rails.logger.error(message) # rubocop:disable Gitlab/RailsLogger
+  def schedule_retry(mirror, scheduled_time, tries)
+    self.class.perform_in(mirror.backoff_delay, mirror.id, scheduled_time, tries + 1)
   end
 end
author	Bob Van Landuyt <bob@gitlab.com>	2019-08-13 23:52:01 +0300
committer	Douwe Maan <douwe@gitlab.com>	2019-08-13 23:52:01 +0300
commit	452bc36d603ed89d3fa5e3409338dd905230bd2f (patch)
tree	3ef260430db93ef2b9fa9236ea601a0b3e53adee /app/workers
parent	1c3b570c117cc41f5f4838a8366c4367ef0749cb (diff)