diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2020-03-13 21:09:39 +0300 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2020-03-13 21:09:39 +0300 |
commit | 00fa950a34b1c94617110b150b8b2517d5241249 (patch) | |
tree | 8f2d8683879079da8f520f7867ebd49b8beaadef /doc/development/github_importer.md | |
parent | c36152ff8c41fad2f413f253eb7ac5c927e47c56 (diff) |
Add latest changes from gitlab-org/gitlab@master
Diffstat (limited to 'doc/development/github_importer.md')
-rw-r--r-- | doc/development/github_importer.md | 30 |
1 files changed, 15 insertions, 15 deletions
diff --git a/doc/development/github_importer.md b/doc/development/github_importer.md index 6b8c083d55f..5d37d2f119f 100644 --- a/doc/development/github_importer.md +++ b/doc/development/github_importer.md @@ -9,7 +9,7 @@ importer and a parallel importer. The Rake task `import:github` uses the sequential importer, while everything else uses the parallel importer. The difference between these two importers is quite simple: the sequential importer does all work in a single thread, making it more useful for debugging purposes -or Rake tasks. The parallel importer on the other hand uses Sidekiq. +or Rake tasks. The parallel importer, on the other hand, uses Sidekiq. ## Requirements @@ -31,9 +31,9 @@ The importer's codebase is broken up into the following directories: ## Architecture overview -When a GitHub project is imported we schedule and execute a job for the -`RepositoryImportworker` worker as all other importers. However, unlike other -importers we don't immediately perform the work necessary. Instead work is +When a GitHub project is imported, we schedule and execute a job for the +`RepositoryImportWorker` worker as all other importers. However, unlike other +importers, we don't immediately perform the work necessary. Instead work is divided into separate stages, with each stage consisting out of a set of Sidekiq jobs that are executed. Between every stage a job is scheduled that periodically checks if all work of the current stage is completed, advancing the import @@ -65,9 +65,9 @@ This worker will import all pull requests. For every pull request a job for the ### 5. Stage::ImportIssuesAndDiffNotesWorker -This worker will import all issues and pull request comments. For every issue we +This worker will import all issues and pull request comments. For every issue, we schedule a job for the `Gitlab::GithubImport::ImportIssueWorker` worker. For -pull request comments we instead schedule jobs for the +pull request comments, we instead schedule jobs for the `Gitlab::GithubImport::DiffNoteImporter` worker. This worker processes both issues and diff notes in parallel so we don't need to @@ -82,7 +82,7 @@ project. ### 6. Stage::ImportNotesWorker This worker imports regular comments for both issues and pull requests. For -every comment we schedule a job for the +every comment, we schedule a job for the `Gitlab::GithubImport::ImportNoteWorker` worker. Regular comments have to be imported at the end since the GitHub API used @@ -116,14 +116,14 @@ schedule the worker of the next stage. To reduce the number of `AdvanceStageWorker` jobs scheduled this worker will briefly wait for jobs to complete before deciding what the next action should -be. For small projects this may slow down the import process a bit, but it will +be. For small projects, this may slow down the import process a bit, but it will also reduce pressure on the system as a whole. ## Refreshing import JIDs GitLab includes a worker called `StuckImportJobsWorker` that will periodically run and mark project imports as failed if they have been running for more than -15 hours. For GitHub projects this poses a bit of a problem: importing large +15 hours. For GitHub projects, this poses a bit of a problem: importing large projects could take several hours depending on how often we hit the GitHub rate limit (more on this below), but we don't want `StuckImportJobsWorker` to mark our import as failed because of this. @@ -137,7 +137,7 @@ long we're still performing work. ## GitHub rate limit -GitHub has a rate limit of 5 000 API calls per hour. The number of requests +GitHub has a rate limit of 5,000 API calls per hour. The number of requests necessary to import a project is largely dominated by the number of unique users involved in a project (e.g. issue authors). Other data such as issue pages and comments typically only requires a few dozen requests to import. This is @@ -176,11 +176,11 @@ There are two types of lookups we cache: in our GitLab database. The expiration time of these keys is 24 hours. When retrieving the cache of a -positive lookups we refresh the TTL automatically. The TTL of false lookups is +positive lookup, we refresh the TTL automatically. The TTL of false lookups is never refreshed. -Because of this caching layer it's possible newly registered GitLab accounts -won't be linked to their corresponding GitHub accounts. This however will sort +Because of this caching layer, it's possible newly registered GitLab accounts +won't be linked to their corresponding GitHub accounts. This, however, will sort itself out once the cached keys expire. The user cache lookup is shared across projects. This means that the more @@ -194,12 +194,12 @@ The code for this resides in: ## Mapping labels and milestones To reduce pressure on the database we do not query it when setting labels and -milestones on issues and merge requests. Instead we cache this data when we +milestones on issues and merge requests. Instead, we cache this data when we import labels and milestones, then we reuse this cache when assigning them to issues/merge requests. Similar to the user lookups these cache keys are expired automatically after 24 hours of not being used. -Unlike the user lookup caches these label and milestone caches are scoped to the +Unlike the user lookup caches, these label and milestone caches are scoped to the project that is being imported. The code for this resides in: |