From 8c7f4e9d5f36cff46365a7f8c4b9c21578c1e781 Mon Sep 17 00:00:00 2001 From: GitLab Bot Date: Thu, 18 Jun 2020 11:18:50 +0000 Subject: Add latest changes from gitlab-org/gitlab@13-1-stable-ee --- doc/development/elasticsearch.md | 63 +++++++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 8 deletions(-) (limited to 'doc/development/elasticsearch.md') diff --git a/doc/development/elasticsearch.md b/doc/development/elasticsearch.md index 185f536fc01..9f54386f1af 100644 --- a/doc/development/elasticsearch.md +++ b/doc/development/elasticsearch.md @@ -24,19 +24,19 @@ See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/git - `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index. - `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size. -Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options) +Additionally, if you need large repositories or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options) ## How does it work? -The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a Rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_versioned_search.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/concerns/elastic/application_versioned_search.rb). +The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a Rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [`/ee/app/models/concerns/elastic/application_versioned_search.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/concerns/elastic/application_versioned_search.rb). -After initial indexing is complete, create, update, and delete operations for all models except projects (see [#207494](https://gitlab.com/gitlab-org/gitlab/issues/207494)) are tracked in a Redis [`ZSET`](https://redis.io/topics/data-types#sorted-sets). A regular `sidekiq-cron` `ElasticIndexBulkCronWorker` processes this queue, updating many Elasticsearch documents at a time with the [Bulk Request API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). +After initial indexing is complete, create, update, and delete operations for all models except projects (see [#207494](https://gitlab.com/gitlab-org/gitlab/-/issues/207494)) are tracked in a Redis [`ZSET`](https://redis.io/topics/data-types#sorted-sets). A regular `sidekiq-cron` `ElasticIndexBulkCronWorker` processes this queue, updating many Elasticsearch documents at a time with the [Bulk Request API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). -Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them! +Search queries are generated by the concerns found in [`ee/app/models/concerns/elastic`](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them! ## Existing Analyzers/Tokenizers/Filters -These are all defined in [ee/lib/elastic/latest/config.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/elastic/latest/config.rb) +These are all defined in [`ee/lib/elastic/latest/config.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/elastic/latest/config.rb) ### Analyzers @@ -54,7 +54,7 @@ Please see the `sha_tokenizer` explanation later below for an example. #### `code_analyzer` -Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), [`edgeNGram_filter`](#edgengram_filter), `lowercase`, and `asciifolding` +Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), `lowercase`, and `asciifolding` The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched. @@ -71,7 +71,7 @@ Not directly used for indexing, but rather used to transform a search input. Use #### `sha_tokenizer` -This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars). +This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searchable by any sub-set of it (minimum of 5 chars). Example: @@ -149,7 +149,7 @@ These proxy objects would talk to Elasticsearch server directly (see top half of ![Elasticsearch Architecture](img/elasticsearch_architecture.svg) -In the planned new design, each model would have a pair of corresponding subclassed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively). +In the planned new design, each model would have a pair of corresponding sub-classed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively). `__elasticsearch__` would represent another layer of proxy object, keeping track of multiple actual proxy objects. It would forward method calls to the appropriate index. For example: @@ -175,6 +175,53 @@ If the current version is `v12p1`, and we need to create a new version for `v12p 1. Change the namespace for files under `v12p1` folder from `Latest` to `V12p1` 1. Make changes to files under the `latest` folder as needed +## Performance Monitoring + +### Prometheus + +GitLab exports [Prometheus +metrics](../administration/monitoring/prometheus/gitlab_metrics.md) relating to +the number of requests and timing for all web/API requests and Sidekiq jobs, +which can help diagnose performance trends and compare how Elasticsearch timing +is impacting overall performance relative to the time spent doing other things. + +#### Indexing queues + +GitLab also exports [Prometheus +metrics](../administration/monitoring/prometheus/gitlab_metrics.md) for +indexing queues, which can help diagnose performance bottlenecks and determine +whether or not your GitLab instance or Elasticsearch server can keep up with +the volume of updates. + +### Logs + +All of the indexing happens in Sidekiq, so much of the relevant logs for the +Elasticsearch integration can be found in +[`sidekiq.log`](../administration/logs.md#sidekiqlog). In particular, all +Sidekiq workers that make requests to Elasticsearch in any way will log the +number of requests and time taken querying/writing to Elasticsearch. This can +be useful to understand whether or not your cluster is keeping up with +indexing. + +Searching Elasticsearch is done via ordinary web workers handling requests. Any +requests to load a page or make an API request, which then make requests to +Elasticsearch, will log the number of requests and the time taken to +[`production_json.log`](../administration/logs.md#production_jsonlog). These +logs will also include the time spent on Database and Gitaly requests, which +may help to diagnose which part of the search is performing poorly. + +There are additional logs specific to Elasticsearch that are sent to +[`elasticsearch.log`](../administration/logs.md#elasticsearchlog-starter-only) +that may contain information to help diagnose performance issues. + +### Performance Bar + +Elasticsearch requests will be displayed in the [`Performance +Bar`](../administration/monitoring/performance/performance_bar.md), which can +be used both locally in development and on any deployed GitLab instance to +diagnose poor search performance. This will show the exact queries being made, +which is useful to diagnose why a search might be slow. + ## Troubleshooting ### Getting `flood stage disk watermark [95%] exceeded` -- cgit v1.2.3