From 6c44b676312eb6cdffadef45f9ca3e29a8cc92ab Mon Sep 17 00:00:00 2001
From: GitLab Bot <gitlab-bot@gitlab.com>
Date: Fri, 21 Jul 2023 12:08:33 +0000
Subject: Add latest changes from gitlab-org/gitlab@master

---
 doc/development/advanced_search.md | 31 +++++--------------------------
 1 file changed, 5 insertions(+), 26 deletions(-)

(limited to 'doc/development/advanced_search.md')

diff --git a/doc/development/advanced_search.md b/doc/development/advanced_search.md
index 30e1874f1ed..805459cb4ee 100644
--- a/doc/development/advanced_search.md
+++ b/doc/development/advanced_search.md
@@ -52,9 +52,9 @@ during indexing and searching operations. Some of the benefits and tradeoffs to
 - Routing is not used if too many shards would be hit for global and group scoped searches.
 - Shard size imbalance might occur.
 
-## Existing Analyzers/Tokenizers/Filters
+## Existing analyzers and tokenizers
 
-These are all defined in [`ee/lib/elastic/latest/config.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/elastic/latest/config.rb)
+The following analyzers and tokenizers are defined in [`ee/lib/elastic/latest/config.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/elastic/latest/config.rb).
 
 ### Analyzers
 
@@ -72,7 +72,7 @@ Please see the `sha_tokenizer` explanation later below for an example.
 
 #### `code_analyzer`
 
-Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), `lowercase`, and `asciifolding`
+Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the [`word_delimiter_graph`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html), `lowercase`, and `asciifolding` filters.
 
 The `whitespace` tokenizer was selected to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` to be properly searched.
 
@@ -81,10 +81,6 @@ Please see the `code` filter for an explanation on how tokens are split.
 NOTE:
 The [Elasticsearch `code_analyzer` doesn't account for all code cases](../integration/advanced_search/elasticsearch_troubleshooting.md#elasticsearch-code_analyzer-doesnt-account-for-all-code-cases).
 
-#### `code_search_analyzer`
-
-Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
-
 ### Tokenizers
 
 #### `sha_tokenizer`
@@ -115,27 +111,10 @@ Example:
 - `'path/application.js'`
 - `'application.js'`
 
-### Filters
-
-#### `code`
-
-Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
-
-Patterns:
-
-- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCase and lowerCamelCase strings as separate tokens
-- `"(\\d+)"`: extracts digits
-- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCase strings recursively. For example: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
-- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
-- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
-- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
-- `'([\p{L}_.-]+)'`: some common chars in file names to keep the whole filename intact (for example `my_file-ñame.txt`)
-- `'([\p{L}\d_]+)'`: letters, numbers and underscores are the most common tokens in programming. Always capture them greedily regardless of context.
-
 ## Gotchas
 
-- Searches can have their own analyzers. Remember to check when editing analyzers
-- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
+- Searches can have their own analyzers. Remember to check when editing analyzers.
+- `Character` filters (as opposed to token filters) always replace the original character. These filters can hinder exact searches.
 
 ## Zero downtime reindexing with multiple indices
 
-- 
cgit v1.2.3