Add latest changes from gitlab-org/gitlab@master

author: GitLab Bot <gitlab-bot@gitlab.com> 2023-02-28 21:08:32 +0300
committer: GitLab Bot <gitlab-bot@gitlab.com> 2023-02-28 21:08:32 +0300
commit: 36eff6e5089629619cc55f4771fa949d6ae2b29b (patch)
tree: 6381b0c90f403c535abdde2f712cd346a78770fe /doc/development/advanced_search.md
parent: baed745d21710f1d78ece03558873acd6fd7d358 (diff)
1 files changed, 326 insertions, 0 deletions
diff --git a/doc/development/advanced_search.md b/doc/development/advanced_search.md
new file mode 100644
index 00000000000..dd05c1475ec
--- /dev/null
+++ b/doc/development/advanced_search.md
@@ -0,0 +1,326 @@
+---
+stage: Data Stores
+group: Global Search
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
+---
+
+# Advanced Search development
+
+This page includes information about developing and working with Elasticsearch.
+
+Information on how to enable Elasticsearch and perform the initial indexing is in
+the [Elasticsearch integration documentation](../integration/advanced_search/elasticsearch.md#enable-advanced-search).
+
+## Deep Dive
+
+In June 2019, Mario de la Ossa hosted a Deep Dive (GitLab team members only: `https://gitlab.com/gitlab-org/create-stage/issues/1`) on the GitLab [Elasticsearch integration](../integration/advanced_search/elasticsearch.md) to share his domain specific knowledge with anyone who may work in this part of the codebase in the future. You can find the <i class="fa fa-youtube-play youtube" aria-hidden="true"></i> [recording on YouTube](https://www.youtube.com/watch?v=vrvl-tN2EaA), and the slides on [Google Slides](https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit) and in [PDF](https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf). Everything covered in this deep dive was accurate as of GitLab 12.0, and while specific details might have changed, it should still serve as a good introduction.
+
+In August 2020, a second Deep Dive was hosted, focusing on [GitLab-specific architecture for multi-indices support](#zero-downtime-reindexing-with-multiple-indices). The <i class="fa fa-youtube-play youtube" aria-hidden="true"></i> [recording on YouTube](https://www.youtube.com/watch?v=0WdPR9oB2fg) and the [slides](https://lulalala.gitlab.io/gitlab-elasticsearch-deepdive/) are available. Everything covered in this deep dive was accurate as of GitLab 13.3.
+
+## Supported Versions
+
+See [Version Requirements](../integration/advanced_search/elasticsearch.md#version-requirements).
+
+Developers making significant changes to Elasticsearch queries should test their features against all our supported versions.
+
+## Setting up development environment
+
+See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/main/doc/howto/elasticsearch.md)
+
+## Helpful Rake tasks
+
+- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
+- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.
+
+Additionally, if you need large repositories or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)
+
+## How does it work?
+
+The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a Rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [`/ee/app/models/concerns/elastic/application_versioned_search.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/concerns/elastic/application_versioned_search.rb).
+
+After initial indexing is complete, create, update, and delete operations for all models except projects (see [#207494](https://gitlab.com/gitlab-org/gitlab/-/issues/207494)) are tracked in a Redis [`ZSET`](https://redis.io/docs/manual/data-types/#sorted-sets). A regular `sidekiq-cron` `ElasticIndexBulkCronWorker` processes this queue, updating many Elasticsearch documents at a time with the [Bulk Request API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).
+
+Search queries are generated by the concerns found in [`ee/app/models/concerns/elastic`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
+
+## Existing Analyzers/Tokenizers/Filters
+
+These are all defined in [`ee/lib/elastic/latest/config.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/elastic/latest/config.rb)
+
+### Analyzers
+
+#### `path_analyzer`
+
+Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
+
+Please see the `path_tokenizer` explanation below for an example.
+
+#### `sha_analyzer`
+
+Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
+
+Please see the `sha_tokenizer` explanation later below for an example.
+
+#### `code_analyzer`
+
+Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), `lowercase`, and `asciifolding`
+
+The `whitespace` tokenizer was selected to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` to be properly searched.
+
+Please see the `code` filter for an explanation on how tokens are split.
+
+NOTE:
+The [Elasticsearch `code_analyzer` doesn't account for all code cases](../integration/advanced_search/elasticsearch_troubleshooting.md#elasticsearch-code_analyzer-doesnt-account-for-all-code-cases).
+
+#### `code_search_analyzer`
+
+Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
+
+### Tokenizers
+
+#### `sha_tokenizer`
+
+This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searchable by any sub-set of it (minimum of 5 chars).
+
+Example:
+
+`240c29dc7e` becomes:
+
+- `240c2`
+- `240c29`
+- `240c29d`
+- `240c29dc`
+- `240c29dc7`
+- `240c29dc7e`
+
+#### `path_tokenizer`
+
+This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` to allow searches to find paths no matter how much or how little of the path is given as input.
+
+Example:
+
+`'/some/path/application.js'` becomes:
+
+- `'/some/path/application.js'`
+- `'some/path/application.js'`
+- `'path/application.js'`
+- `'application.js'`
+
+### Filters
+
+#### `code`
+
+Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
+
+Patterns:
+
+- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCase and lowerCamelCase strings as separate tokens
+- `"(\\d+)"`: extracts digits
+- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCase strings recursively. For example: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
+- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
+- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
+- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
+- `'([\p{L}_.-]+)'`: some common chars in file names to keep the whole filename intact (for example `my_file-ñame.txt`)
+- `'([\p{L}\d_]+)'`: letters, numbers and underscores are the most common tokens in programming. Always capture them greedily regardless of context.
+
+## Gotchas
+
+- Searches can have their own analyzers. Remember to check when editing analyzers
+- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
+
+## Zero downtime reindexing with multiple indices
+
+NOTE:
+This is not applicable yet as multiple indices functionality is not fully implemented.
+
+Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime.
+
+To avoid downtime, GitLab is working to support multiple indices that
+can function at the same time. Whenever the schema changes, the administrator
+will be able to create a new index and reindex to it, while searches
+continue to go to the older, stable index. Any data updates will be
+forwarded to both indices. Once the new index is ready, an administrator can
+mark it active, which will direct all searches to it, and remove the old
+index.
+
+This is also helpful for migrating to new servers, for example, moving to/from AWS.
+
+Currently we are on the process of migrating to this new design. Everything is hardwired to work with one single version for now.
+
+### Architecture
+
+The traditional setup, provided by `elasticsearch-rails`, is to communicate through its internal proxy classes. Developers would write model-specific logic in a module for the model to include in (for example, `SnippetsSearch`). The `__elasticsearch__` methods would return a proxy object, for example:
+
+- `Issue.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::ClassMethodsProxy`
+- `Issue.first.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::InstanceMethodsProxy`.
+
+These proxy objects would talk to Elasticsearch server directly (see top half of the diagram).
+
+![Elasticsearch Architecture](img/elasticsearch_architecture.svg)
+
+In the planned new design, each model would have a pair of corresponding sub-classed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively).
+
+`__elasticsearch__` would represent another layer of proxy object, keeping track of multiple actual proxy objects. It would forward method calls to the appropriate index. For example:
+
+- `model.__elasticsearch__.search` would be forwarded to the one stable index, since it is a read operation.
+- `model.__elasticsearch__.update_document` would be forwarded to all indices, to keep all indices up-to-date.
+
+The global configurations per version are now in the `Elastic::(Version)::Config` class. You can change mappings there.
+
+### Creating new version of schema
+
+NOTE:
+This is not applicable yet as multiple indices functionality is not fully implemented.
+
+Folders like `ee/lib/elastic/v12p1` contain snapshots of search logic from different versions. To keep a continuous Git history, the latest version lives under `ee/lib/elastic/latest`, but its classes are aliased under an actual version (for example, `ee/lib/elastic/v12p3`). When referencing these classes, never use the `Latest` namespace directly, but use the actual version (for example, `V12p3`).
+
+The version name basically follows the GitLab release version. If setting is changed in 12.3, we will create a new namespace called `V12p3` (p stands for "point"). Raise an issue if there is a need to name a version differently.
+
+If the current version is `v12p1`, and we need to create a new version for `v12p3`, the steps are as follows:
+
+1. Copy the entire folder of `v12p1` as `v12p3`
+1. Change the namespace for files under `v12p3` folder from `V12p1` to `V12p3` (which are still aliased to `Latest`)
+1. Delete `v12p1` folder
+1. Copy the entire folder of `latest` as `v12p1`
+1. Change the namespace for files under `v12p1` folder from `Latest` to `V12p1`
+1. Make changes to files under the `latest` folder as needed
+
+## Performance Monitoring
+
+### Prometheus
+
+GitLab exports [Prometheus metrics](../administration/monitoring/prometheus/gitlab_metrics.md)
+relating to the number of requests and timing for all web/API requests and Sidekiq jobs,
+which can help diagnose performance trends and compare how Elasticsearch timing
+is impacting overall performance relative to the time spent doing other things.
+
+#### Indexing queues
+
+GitLab also exports [Prometheus metrics](../administration/monitoring/prometheus/gitlab_metrics.md)
+for indexing queues, which can help diagnose performance bottlenecks and determine
+whether or not your GitLab instance or Elasticsearch server can keep up with
+the volume of updates.
+
+### Logs
+
+All of the indexing happens in Sidekiq, so much of the relevant logs for the
+Elasticsearch integration can be found in
+[`sidekiq.log`](../administration/logs/index.md#sidekiqlog). In particular, all
+Sidekiq workers that make requests to Elasticsearch in any way will log the
+number of requests and time taken querying/writing to Elasticsearch. This can
+be useful to understand whether or not your cluster is keeping up with
+indexing.
+
+Searching Elasticsearch is done via ordinary web workers handling requests. Any
+requests to load a page or make an API request, which then make requests to
+Elasticsearch, will log the number of requests and the time taken to
+[`production_json.log`](../administration/logs/index.md#production_jsonlog). These
+logs will also include the time spent on Database and Gitaly requests, which
+may help to diagnose which part of the search is performing poorly.
+
+There are additional logs specific to Elasticsearch that are sent to
+[`elasticsearch.log`](../administration/logs/index.md#elasticsearchlog)
+that may contain information to help diagnose performance issues.
+
+### Performance Bar
+
+Elasticsearch requests will be displayed in the
+[`Performance Bar`](../administration/monitoring/performance/performance_bar.md), which can
+be used both locally in development and on any deployed GitLab instance to
+diagnose poor search performance. This will show the exact queries being made,
+which is useful to diagnose why a search might be slow.
+
+### Correlation ID and `X-Opaque-Id`
+
+Our [correlation ID](distributed_tracing.md#developer-guidelines-for-working-with-correlation-ids)
+is forwarded by all requests from Rails to Elasticsearch as the
+[`X-Opaque-Id`](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#_identifying_running_tasks)
+header which allows us to track any
+[tasks](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html)
+in the cluster back the request in GitLab.
+
+## Troubleshooting
+
+### Getting `flood stage disk watermark [95%] exceeded`
+
+You might get an error such as
+
+```plaintext
+[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
+   flood stage disk watermark [95%] exceeded on
+   [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
+   all indices on this node will be marked read-only
+```
+
+This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
+
+In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
+
+```shell
+curl "http://localhost:9200/gitlab-development/_settings?pretty"
+```
+
+Add this to your `elasticsearch.yml` file:
+
+```yaml
+# turn off the disk allocator
+cluster.routing.allocation.disk.threshold_enabled: false
+```
+
+_or_
+
+```yaml
+# set your own limits
+cluster.routing.allocation.disk.threshold_enabled: true
+cluster.routing.allocation.disk.watermark.flood_stage: 5gb   # ES 6.x only
+cluster.routing.allocation.disk.watermark.low: 15gb
+cluster.routing.allocation.disk.watermark.high: 10gb
+```
+
+Restart Elasticsearch, and the `read_only_allow_delete` will clear on its own.
+
+_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/disk-allocator.html)_
+
+### Disaster recovery/data loss/backups
+
+The use of Elasticsearch in GitLab is only ever as a secondary data store.
+This means that all of the data stored in Elasticsearch can always be derived
+again from other data sources, specifically PostgreSQL and Gitaly. Therefore if
+the Elasticsearch data store is ever corrupted for whatever reason you can reindex
+everything from scratch.
+
+If your Elasticsearch index is incredibly large it may be too time consuming or
+cause too much downtime to reindex from scratch. There aren't any built in
+mechanisms for automatically finding discrepancies and resyncing an
+Elasticsearch index if it gets out of sync but one tool that may be useful is
+looking at the logs for all the updates that occurred in a time range you
+believe may have been missed. This information is very low level and only
+useful for operators that are familiar with the GitLab codebase. It is
+documented here in case it is useful for others. The relevant logs that could
+theoretically be used to figure out what needs to be replayed are:
+
+1. All non-repository updates that were synced can be found in
+   [`elasticsearch.log`](../administration/logs/index.md#elasticsearchlog) by
+   searching for
+   [`track_items`](https://gitlab.com/gitlab-org/gitlab/-/blob/1e60ea99bd8110a97d8fc481e2f41cab14e63d31/ee/app/services/elastic/process_bookkeeping_service.rb#L25)
+   and these can be replayed by sending these items again through
+   `::Elastic::ProcessBookkeepingService.track!`
+1. All repository updates that occurred can be found in
+   [`elasticsearch.log`](../administration/logs/index.md#elasticsearchlog) by
+   searching for
+   [`indexing_commit_range`](https://gitlab.com/gitlab-org/gitlab/-/blob/6f9d75dd3898536b9ec2fb206e0bd677ab59bd6d/ee/lib/gitlab/elastic/indexer.rb#L41).
+   Replaying these requires resetting the
+   [`IndexStatus#last_commit/last_wiki_commit`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/index_status.rb)
+   to the oldest `from_sha` in the logs and then triggering another index of
+   the project using
+   [`ElasticCommitIndexerWorker`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/elastic_commit_indexer_worker.rb)
+1. All project deletes that occurred can be found in
+   [`sidekiq.log`](../administration/logs/index.md#sidekiqlog) by searching for
+   [`ElasticDeleteProjectWorker`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/elastic_delete_project_worker.rb).
+   These updates can be replayed by triggering another
+   `ElasticDeleteProjectWorker`.
+
+With the above methods and taking regular
+[Elasticsearch snapshots](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html)
+we should be able to recover from different kinds of data loss issues in a
+relatively short period of time compared to indexing everything from
+scratch.
author	GitLab Bot <gitlab-bot@gitlab.com>	2023-02-28 21:08:32 +0300
committer	GitLab Bot <gitlab-bot@gitlab.com>	2023-02-28 21:08:32 +0300
commit	36eff6e5089629619cc55f4771fa949d6ae2b29b (patch)
tree	6381b0c90f403c535abdde2f712cd346a78770fe /doc/development/advanced_search.md
parent	baed745d21710f1d78ece03558873acd6fd7d358 (diff)