Welcome to mirror list, hosted at ThFree Co, Russian Federation.

gitlab.com/gitlab-org/gitlab-foss.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/github_importer.md')
-rw-r--r--doc/development/github_importer.md68
1 files changed, 59 insertions, 9 deletions
diff --git a/doc/development/github_importer.md b/doc/development/github_importer.md
index 9ce95cf7da1..56a4ea898df 100644
--- a/doc/development/github_importer.md
+++ b/doc/development/github_importer.md
@@ -1,7 +1,7 @@
---
stage: none
group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
---
# GitHub importer developer documentation
@@ -209,7 +209,8 @@ When you schedule a job, `AdvanceStageWorker`
is given a project ID, a list of Redis keys, and the name of the next
stage. The Redis keys (produced by `Gitlab::JobWaiter`) are used to check if the
running stage has been completed or not. If the stage has not yet been
-completed `AdvanceStageWorker` reschedules itself. After a stage finishes
+completed `AdvanceStageWorker` reschedules itself. After a stage finishes,
+or if more jobs have been finished after the last invocation.
`AdvanceStageworker` refreshes the import JID (more on this below) and
schedule the worker of the next stage.
@@ -221,18 +222,19 @@ also reduces pressure on the system as a whole.
## Refreshing import job IDs
GitLab includes a worker called `Gitlab::Import::StuckProjectImportJobsWorker`
-that periodically runs and marks project imports as failed if they have been
-running for more than 24 hours. For GitHub projects, this poses a bit of a
-problem: importing large projects could take several hours depending on how
+that periodically runs and marks project imports as failed if they have not been
+refreshed for more than 24 hours. For GitHub projects, this poses a bit of a
+problem: importing large projects could take several days depending on how
often we hit the GitHub rate limit (more on this below), but we don't want
`Gitlab::Import::StuckProjectImportJobsWorker` to mark our import as failed because of this.
To prevent this from happening we periodically refresh the expiration time of
-the import process. This works by storing the JID of the import job in the
+the import. This works by storing the JID of the import job in the
database, then refreshing this JID TTL at various stages throughout the import
-process. This is done by calling `ProjectImportState#refresh_jid_expiration`. By
-refreshing this TTL we can ensure our import does not get marked as failed so
-long we're still performing work.
+process. This is done either by calling `ProjectImportState#refresh_jid_expiration`,
+or by using the RefreshImportJidWorker and passing in the current worker's jid.
+By refreshing this TTL we can ensure our import does not get marked as failed so
+long as we're still performing work.
## GitHub rate limit
@@ -297,6 +299,54 @@ The code for this resides in:
- `lib/gitlab/github_import/user_finder.rb`
- `lib/gitlab/github_import/caching.rb`
+## Increasing Sidekiq interrupts
+
+When a Sidekiq process shut downs, it waits for a period of time for running
+jobs to finish before it then interrupts them. An interrupt terminates
+the job and requeues it again. Our
+[vendored `sidekiq-reliable-fetcher` gem](https://gitlab.com/gitlab-org/gitlab/-/blob/master/vendor/gems/sidekiq-reliable-fetch/README.md)
+puts a limit of `3` interrupts before a job is no longer requeued and is
+permanently terminated. Jobs that have been interrupted log a
+`json.interrupted_count` in Kibana.
+
+This limit offers protection from jobs that can never be completed in
+the time between Sidekiq restarts.
+
+For large imports, our GitHub [stage](#stages) workers (namespaced in
+`Stage::`) take many hours to finish. By default, the import is at risk
+of failing because of `sidekiq-reliable-fetcher` permanently stopping these
+workers before they can complete.
+
+Stage workers that pick up from where they left off when restarted can
+increase the interrupt limit of `sidekiq-reliable-fetcher` to `20` by
+calling `.resumes_work_when_interrupted!`:
+
+```ruby
+module Gitlab
+ module GithubImport
+ module Stage
+ class MyWorker
+ resumes_work_when_interrupted!
+
+ # ...
+ end
+ end
+ end
+end
+```
+
+Stage workers that do not fully resume their work when restarted should
+not call this method. For example, a worker that skips already imported
+objects, but starts its loop from the beginning each time.
+
+Examples of stage workers that do resume work fully are ones that
+execute services that:
+
+- [Continue paging](https://gitlab.com/gitlab-org/gitlab/-/blob/487521cc/lib/gitlab/github_import/parallel_scheduling.rb#L114-117)
+ an endpoint from where it left off.
+- [Continue their loop](https://gitlab.com/gitlab-org/gitlab/-/blob/487521cc26c1e2bdba4fc67c14478d2b2a5f2bfa/lib/gitlab/github_import/importer/attachments/issues_importer.rb#L27)
+ from where it left off.
+
## Mapping labels and milestones
To reduce pressure on the database we do not query it when setting labels and