Welcome to mirror list, hosted at ThFree Co, Russian Federation.

git.kernel.org/pub/scm/git/git.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/grep.c
AgeCommit message (Collapse)Author
2023-03-30Merge branch 'mk/workaround-pcre-jit-ucp-bug'Junio C Hamano
A recent-ish change to allow unicode character classes to be used with "grep -P" triggered a JIT bug in older pcre2 libraries. The problematic change in Git built with these older libraries has been disabled to work around the bug. * mk/workaround-pcre-jit-ucp-bug: grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34
2023-03-23grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34Mathias Krause
Stephane is reporting[1] a regression introduced in git v2.40.0 that leads to 'git grep' segfaulting in his CI pipeline. It turns out, he's using an older version of libpcre2 that triggers a wild pointer dereference in the generated JIT code that was fixed in PCRE2 10.35. Instead of completely disabling the JIT compiler for the buggy version, just mask out the Unicode property handling as we used to do prior to commit acabd2048ee0 ("grep: correctly identify utf-8 characters with \{b,w} in -P"). [1] https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@clumio.com/ Reported-by: Stephane Odul <stephane@clumio.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-24cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23Merge branch 'ab/various-leak-fixes'Junio C Hamano
Leak fixes. * ab/various-leak-fixes: push: free_refs() the "local_refs" in set_refspecs() push: refactor refspec_append_mapped() for subsequent leak-fix receive-pack: release the linked "struct command *" list grep API: plug memory leaks by freeing "header_list" grep.c: refactor free_grep_patterns() builtin/merge.c: free "&buf" on "Your local changes..." error builtin/merge.c: use fixed strings, not "strbuf", fix leak show-branch: free() allocated "head" before return commit-graph: fix a parse_options_concat() leak http-backend.c: fix cmd_main() memory leak, refactor reg{exec,free}() http-backend.c: fix "dir" and "cmd_arg" leaks in cmd_main() worktree: fix a trivial leak in prune_worktrees() repack: fix leaks on error with "goto cleanup" name-rev: don't xstrdup() an already dup'd string various: add missing clear_pathspec(), fix leaks clone: use free() instead of UNLEAK() commit-graph: use free_commit_graph() instead of UNLEAK() bundle.c: don't leak the "args" in the "struct child_process" tests: mark tests as passing with SANITIZE=leak
2023-02-16Merge branch 'cb/grep-fallback-failing-jit'Junio C Hamano
In an environment where dynamically generated code is prohibited to run (e.g. SELinux), failure to JIT pcre patterns is expected. Fall back to interpreted execution in such a case. * cb/grep-fallback-failing-jit: grep: fall back to interpreter if JIT memory allocation fails
2023-02-07grep API: plug memory leaks by freeing "header_list"Ævar Arnfjörð Bjarmason
When the "header_list" struct member was added in [1], freeing this field was neglected. Fix that now, so that commands like ./git -P log -1 --color=always --author=A origin/master will run leak-free. 1. 80235ba79ef ("log --author=me --grep=it" should find intersection, not union, 2010-01-17) Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-07grep.c: refactor free_grep_patterns()Ævar Arnfjörð Bjarmason
Refactor the free_grep_patterns() function to split out the freeing of the "struct grep_pat" it contains. Right now we're only freeing the "pattern_list", but we should be freeing another member of the same type, which we'll do in the subsequent commit. Let's also replace the "return" if we don't have an "opt->pattern_expression" with a conditional call of free_pattern_expr(). Before db84376f981 (grep.c: remove "extended" in favor of "pattern_expression", fix segfault, 2022-10-11) the pattern here was: if (!x) return; free_pattern_expr(y); While at it, instead of: if (!x) return; free_pattern_expr(x); Let's instead do: if (x) free_pattern_expr(x); This will make it easier to free additional members from free_grep_patterns() in the future. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-31grep: fall back to interpreter if JIT memory allocation failsMathias Krause
Under Linux systems with SELinux's 'deny_execmem' or PaX's MPROTECT enabled, the allocation of PCRE2's JIT rwx memory may be prohibited, making pcre2_jit_compile() fail with PCRE2_ERROR_NOMEMORY (-48): [user@fedora git]$ git grep -c PCRE2_JIT grep.c:1 [user@fedora git]$ # Enable SELinux's W^X policy [user@fedora git]$ sudo semanage boolean -m -1 deny_execmem [user@fedora git]$ # JIT memory allocation fails, breaking 'git grep' [user@fedora git]$ git grep -c PCRE2_JIT fatal: Couldn't JIT the PCRE2 pattern 'PCRE2_JIT', got '-48' Instead of failing hard in this case and making 'git grep' unusable on such systems, simply fall back to interpreter mode, leading to a much better user experience. As having a functional PCRE2 JIT compiler is a legitimate use case for performance reasons, we'll only do the fallback if the supposedly available JIT is found to be non-functional by attempting to JIT compile a very simple pattern. If this fails, JIT is deemed to be non-functional and we do the interpreter fallback. For all other cases, i.e. the simple pattern can be compiled but the user provided cannot, we fail hard as we do now as the reason for the failure must be the pattern itself. To aid users in helping themselves change the error message to include a hint about the '(*NO_JIT)' prefix. Also clip the pattern at 64 characters to ensure the hint will be seen by the user and not internally truncated by the die() function. Cc: Carlo Marcelo Arenas Belón <carenas@gmail.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-19grep: correctly identify utf-8 characters with \{b,w} in -PCarlo Marcelo Arenas Belón
When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile() call, but PCRE2_UCP wasn't included. This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex: $ git grep -P '\bÆvar' Add PCRE2_UCP so that \w will include Æ and therefore \b could correctly match the beginning of that word. This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-21Merge branch 'ab/grep-simplify-extended-expression'Junio C Hamano
Giving "--invert-grep" and "--all-match" without "--grep" to the "git log" command resulted in an attempt to access grep pattern expression structure that has not been allocated, which has been corrected. * ab/grep-simplify-extended-expression: grep.c: remove "extended" in favor of "pattern_expression", fix segfault
2022-10-11grep.c: remove "extended" in favor of "pattern_expression", fix segfaultÆvar Arnfjörð Bjarmason
Since 79d3696cfb4 (git-grep: boolean expression on pattern matching., 2006-06-30) the "pattern_expression" member has been used for complex queries (AND/OR...), with "pattern_list" being used for the simple OR queries. Since then we've used both "pattern_expression" and its associated boolean "extended" member to see if we have a complex expression. Since f41fb662f57 (revisions API: have release_revisions() release "grep_filter", 2022-04-13) we've had a subtle bug relating to that: If we supplied options that were only used for "complex queries", but didn't supply the query itself we'd set "opt->extended", but would have a NULL "pattern_expression". As a result these would segfault as we tried to call "free_grep_patterns()" from "release_revisions()": git -P log -1 --invert-grep git -P log -1 --all-match The root cause of this is that we were conflating the state management we needed in "compile_grep_patterns()" itself with whether or not we had an "opt->pattern_expression" later on. In this cases as we're going through "compile_grep_patterns()" we have no "opt->pattern_list" but have "opt->no_body_match" or "opt->all_match". So we'd set "opt->extended = 1", but not "return" on "opt->extended" as that's an "else if" in the same "if" statement. That behavior is intentional and required, as the common case is that we have an "opt->pattern_list" that we're about to parse into the "opt->pattern_expression". But we don't need to keep track of this "extended" flag beyond the state management in compile_grep_patterns() itself. It needs it, but once we're out of that function we can rely on "opt->pattern_expression" being non-NULL instead for using these extended patterns. As 79d3696cfb4 itself shows we've assumed that there's a one-to-one mapping between the two since the very beginning. I.e. "match_line()" would check "opt->extended" to see if it should call "match_expr()", and the first thing we do in that function is assume that we have a "opt->pattern_expression". We'd then call "match_expr_eval()", which would have died if that "opt->pattern_expression" was NULL. The "die" was added in c922b01f54c (grep: fix segfault when "git grep '('" is given, 2009-04-27), and can now be removed as it's now clearly unreachable. We still do the right thing in the case that prompted that fix: git grep '(' fatal: unmatched parenthesis Arguably neither the "--invert-grep" option added in [1] nor the earlier "--all-match" option added in [2] were intended to be used stand-alone, and another approach[3] would be to error out in those cases. But since we've been treating them as a NOOP when given without --grep for a long time let's keep doing that. We could also return in "free_pattern_expr()" if the argument is non-NULL, as an alternative fix for this segfault does [4]. That would be more elegant in making the "free_*()" function behave like "free()", but it would also remove a sanity check: The "free_pattern_expr()" function calls itself recursively, and only the top-level is allowed to be NULL, let's not conflate those two conditions. 1. 22dfa8a23de (log: teach --invert-grep option, 2015-01-12) 2. 0ab7befa31d (grep --all-match, 2006-09-27) 3. https://lore.kernel.org/git/patch-1.1-f4b90799fce-20221010T165711Z-avarab@gmail.com/ 4. http://lore.kernel.org/git/7e094882c2a71894416089f894557a9eae07e8f8.1665423686.git.me@ttaylorr.com Reported-by: orygaw <orygaw@protonmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-22grep: add --max-count command line optionCarlos López
This patch adds a command line option analogous to that of GNU grep(1)'s -m / --max-count, which users might already be used to. This makes it possible to limit the amount of matches shown in the output while keeping the functionality of other options such as -C (show code context) or -p (show containing function), which would be difficult to do with a shell pipeline (e.g. head(1)). Signed-off-by: Carlos López 00xc@protonmail.com Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-26Merge branch 'rs/pcre-invalid-utf8-fix-fix'Junio C Hamano
Workaround we have for versions of PCRE2 before their version 10.36 were in effect only for their versions newer than 10.36 by mistake, which has been corrected. * rs/pcre-invalid-utf8-fix-fix: grep: fix triggering PCRE2_NO_START_OPTIMIZE workaround
2022-02-26Merge branch 'ab/grep-patterntype'Junio C Hamano
Some code clean-up in the "git grep" machinery. * ab/grep-patterntype: grep: simplify config parsing and option parsing grep.c: do "if (bool && memchr())" not "if (memchr() && bool)" grep.h: make "grep_opt.pattern_type_option" use its enum grep API: call grep_config() after grep_init() grep.c: don't pass along NULL callback value built-ins: trust the "prefix" from run_builtin() grep tests: add missing "grep.patternType" config tests grep tests: create a helper function for "BRE" or "ERE" log tests: check if grep_config() is called by "log"-like cmds grep.h: remove unused "regex_t regexp" from grep_opt
2022-02-18grep: fix triggering PCRE2_NO_START_OPTIMIZE workaroundRené Scharfe
PCRE2 bug 2642 was fixed in version 10.36. Our 95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24) worked around it on older versions by setting the flag PCRE2_NO_START_OPTIMIZE. 797c359978 (grep/pcre2: use compile-time PCREv2 version test, 2021-02-18) switched it around to set the flag on 10.36 and higher instead, while it claimed to use "the same test done at compile-time". Switch the condition back to apply the workaround on PCRE2 versions _before_ 10.36. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16grep: simplify config parsing and option parsingÆvar Arnfjörð Bjarmason
Simplify the parsing of "grep.patternType" and "grep.extendedRegexp". This changes no behavior, but gets rid of complex parsing logic that isn't needed anymore. When "grep.patternType" was introduced in 84befcd0a4a (grep: add a grep.patternType configuration setting, 2012-08-03) we promised that: 1. You can set "grep.patternType", and "[setting it to] 'default' will return to the default matching behavior". In that context "the default" meant whatever the configuration system specified before that change, i.e. via grep.extendedRegexp. 2. We'd support the existing "grep.extendedRegexp" option, but ignore it when the new "grep.patternType" option is set. We said we'd only ignore the older "grep.extendedRegexp" option "when the `grep.patternType` option is set to a value other than 'default'". In a preceding commit we changed grep_config() to be called after grep_init(), which means that much of the complexity here can go away. As before both "grep.patternType" and "grep.extendedRegexp" are last-one-wins variable, with "grep.extendedRegexp" yielding to "grep.patternType", except when "grep.patternType=default". Note that as the previously added tests indicate this cannot be done on-the-fly as we see the config variables, without introducing more state keeping. I.e. if we see: -c grep.extendedRegexp=false -c grep.patternType=default -c extendedRegexp=true We need to select ERE, since grep.patternType=default unselects that variable, which normally has higher precedence, but we also need to select BRE in cases of: -c grep.extendedRegexp=true \ -c grep.extendedRegexp=false Which would not be the case for this, which select ERE: -c grep.patternType=extended \ -c grep.extendedRegexp=false Therefore we cannot do this on-the-fly in grep_config without also introducing tracking variables for not only the pattern type, but what the source of that pattern type was. So we need to decide on the pattern after our config was fully parsed. Let's do that by deferring the decision on the pattern type until it's time to compile it in compile_regexp(). By that time we've not only parsed the config, but also handled the command-line options. Those will set "opt.pattern_type_option" (*not* "opt.extended_regexp_option"!). At that point all we need to do is see if "grep.patternType" was UNSPECIFIED in the end (including an explicit "=default"), if so we'll use the "grep.extendedRegexp" configuration, if any. See my 07a3d411739 (grep: remove regflags from the public grep_opt API, 2017-06-29) for addition of the two comments being removed here, i.e. the complexity noted in that commit is now going away. 1. https://lore.kernel.org/git/patch-v8-09.10-c211bb0c69d-20220118T155211Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16grep.c: do "if (bool && memchr())" not "if (memchr() && bool)"Ævar Arnfjörð Bjarmason
Change code in compile_regexp() to check the cheaper boolean "!opt->pcre2" condition before the "memchr()" search. This doesn't noticeably optimize anything, but makes the code more obvious and conventional. The line wrapping being added here also makes a subsequent commit smaller. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16grep API: call grep_config() after grep_init()Ævar Arnfjörð Bjarmason
The grep_init() function used the odd pattern of initializing the passed-in "struct grep_opt" with a statically defined "grep_defaults" struct, which would be modified in-place when we invoked grep_config(). So we effectively (b) initialized config, (a) then defaults, (c) followed by user options. Usually those are ordered as "a", "b" and "c" instead. As the comments being removed here show the previous behavior needed to be carefully explained as we'd potentially share the populated configuration among different instances of grep_init(). In practice we didn't do that, but now that it can't be a concern anymore let's remove those comments. This does not change the behavior of any of the configuration variables or options. That would have been the case if we didn't move around the grep_config() call in "builtin/log.c". But now that we call "grep_config" after "git_log_config" and "git_format_config" we'll need to pass in the already initialized "struct grep_opt *". See 6ba9bb76e02 (grep: copy struct in one fell swoop, 2020-11-29) and 7687a0541e0 (grep: move the configuration parsing logic to grep.[ch], 2012-10-09) for the commits that added the comments. The memcpy() pattern here will be optimized away and follows the convention of other *_init() functions. See 5726a6b4012 (*.c *_init(): define in terms of corresponding *_INIT macro, 2021-07-01). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16built-ins: trust the "prefix" from run_builtin()Ævar Arnfjörð Bjarmason
Change code in "builtin/grep.c" and "builtin/ls-tree.c" to trust the "prefix" passed from "run_builtin()". The "prefix" we get from setup.c is either going to be NULL or a string of length >0, never "". So we can drop the "prefix && *prefix" checks added for "builtin/grep.c" in 0d042fecf2f (git-grep: show pathnames relative to the current directory, 2006-08-11), and for "builtin/ls-tree.c" in a69dd585fca (ls-tree: chomp leading directories when run from a subdirectory, 2005-12-23). As seen in code in revision.c that was added in cd676a51367 (diff --relative: output paths as relative to the current subdirectory, 2008-02-12) we already have existing code that does away with this assertion. This makes it easier to reason about a subsequent change to the "prefix_length" code in grep.c in a subsequent commit, and since we're going to the trouble of doing that let's leave behind an assert() to promise this to any future callers. For "builtin/grep.c" it would be painful to pass the "prefix" down the callchain of: cmd_grep -> grep_tree -> grep_submodule -> grep_cache -> grep_oid -> grep_source_name So for the code that needs it in grep_source_name() let's add a "grep_prefix" variable similar to the existing "ls_tree_prefix". While at it let's move the code in cmd_ls_tree() around so that we assign to the "ls_tree_prefix" right after declaring the variables, and stop assigning to "prefix". We only subsequently used that variable later in the function after clobbering it. Let's just use our own "grep_prefix" instead. Let's also add an assert() in git.c, so that we'll make this promise about the "prefix" to any current and future callers, as well as to any readers of the code. Code history: * The strlen() in "grep.c" hasn't been used since 493b7a08d80 (grep: accept relative paths outside current working directory, 2009-09-05). When that code was added in 0d042fecf2f (git-grep: show pathnames relative to the current directory, 2006-08-11) we used the length. But since 493b7a08d80 we haven't used it for anything except a boolean check that we could have done on the "prefix" member itself. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-05Merge branch 'rs/grep-expr-cleanup'Junio C Hamano
Code clean-up. * rs/grep-expr-cleanup: grep: use grep_and_expr() in compile_pattern_and() grep: extract grep_binexp() from grep_or_expr() grep: use grep_not_expr() in compile_pattern_not() grep: use grep_or_expr() in compile_pattern_or()
2022-01-10Merge branch 'lh/use-gnu-color-in-grep'Junio C Hamano
The color palette used by "git grep" has been updated to match that of GNU grep. * lh/use-gnu-color-in-grep: grep: align default colors with GNU grep ones
2022-01-07grep: use grep_and_expr() in compile_pattern_and()Taylor Blau
In a similar spirit as a previous commit, use the new `grep_and_expr()` to construct the AND node in `compile_pattern_and()`. Unlike the aforementioned previous commit, this is not about code duplication, since this is the only spot in grep.c where an AND node is constructed. Rather, this is about visual consistency with the other `compile_pattern_xyz()` functions. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-07grep: extract grep_binexp() from grep_or_expr()Taylor Blau
When constructing an OR node, the grep.c code uses `grep_or_expr()` to make a node, assign its kind, and set its left and right children. The same is not done for AND nodes. Prepare to introduce a new `grep_and_expr()` function which will share code with the existing implementation of `grep_or_expr()` by introducing a new function which compiles either kind of binary expression, and reimplement `grep_or_expr()` in terms of it. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-06grep: use grep_not_expr() in compile_pattern_not()René Scharfe
Move the definition of grep_not_expr() up and use this function in compile_pattern_not() to simplify the code and reduce duplication. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-06grep: use grep_or_expr() in compile_pattern_or()René Scharfe
Move the definition of grep_or_expr() up and use this function in compile_pattern_or() to reduce code duplication. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-06Merge branch 'rs/pcre2-utf'Junio C Hamano
"git grep --perl-regexp" failed to match UTF-8 characters with wildcard when the pattern consists only of ASCII letters, which has been corrected. * rs/pcre2-utf: grep/pcre2: factor out literal variable grep/pcre2: use PCRE2_UTF even with ASCII patterns
2022-01-05grep: align default colors with GNU grep onesLénaïc Huard
git-grep shares a lot of options with the standard grep tool. Like GNU grep, it has coloring options to highlight the matching text. And like it, it has options to customize the various colored parts. This patch updates the default git-grep colors to make them match the GNU grep default ones [1]. It was possible to get the same result by setting the various `color.grep.<slot>` options, but this patch makes `git grep --color` share the same color scheme as `grep --color` by default without any user configuration. [1] https://www.man7.org/linux/man-pages/man1/grep.1.html#ENVIRONMENT Signed-off-by: Lénaïc Huard <lenaic@lhuard.fr> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-20grep/pcre2: factor out literal variableRené Scharfe
Patterns that contain no wildcards and don't have to be case-folded are literal. Give this condition a name to increase the readability of the boolean expression for enabling the option PCRE2_UTF. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-20grep/pcre2: use PCRE2_UTF even with ASCII patternsRené Scharfe
compile_pcre2_pattern() currently uses the option PCRE2_UTF only for patterns with non-ASCII characters. Patterns with ASCII wildcards can match non-ASCII strings, though. Without that option PCRE2 mishandles UTF-8 input, though -- it matches parts of multi-byte characters. Fix that by using PCRE2_UTF even for ASCII-only patterns. This is a remake of the reverted ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data, 2021-10-15). The change to the condition and the test are simplified and more targeted. Original-patch-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-18log: let --invert-grep only invert --grepRené Scharfe
The option --invert-grep is documented to filter out commits whose messages match the --grep filters. However, it also affects the header matches (--author, --committer), which is not intended. Move the handling of that option to grep.c, as only the code there can distinguish between matches in the header from those in the message body. If --invert-grep is given then enable extended expressions (not the regex type, we just need git grep's --not to work), negate the body patterns and check if any of them match by piggy-backing on the collect_hits mechanism of grep_source_1(). Collecting the matches in struct grep_opt is a bit iffy, but with "last_shown" we have a precedent for writing state information to that struct. Reported-by: Dotan Cohen <dotancohen@gmail.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-19Revert "grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data"Junio C Hamano
This reverts commit ae39ba431ab861548eb60b4bd2e1d8b8813db76f, as it breaks "grep" when looking for a string in non UTF-8 haystack, when linked with certain versions of PCREv2 library. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 dataHamza Mahfooz
If we attempt to grep non-ascii log message text with an ascii pattern, we run into the following issue: $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author grep: (standard input): binary file matches So, to fix this teach the grep code to use PCRE2_UTF, as long as the log output is encoded in UTF-8. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29grep: refactor next_match() and match_one_pattern() for external useHamza Mahfooz
These changes are made in preparation of, the colorization support for the "git log" subcommands that, rely on regex functionality (i.e. "--author", "--committer" and "--grep"). These changes are necessary primarily because match_one_pattern() expects header lines to be prefixed, however, in pretty, the prefixes are stripped from the lines because the name-email pairs need to go through additional parsing, before they can be printed and because next_match() doesn't handle the case of "ctx == GREP_CONTEXT_HEAD" at all. So, teach next_match() how to handle the new case and move match_one_pattern()'s core logic to headerless_match_one_pattern() while preserving match_one_pattern()'s uses that depend on the additional processing. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22grep: store grep_source buffer as constJeff King
Our grep_buffer() function takes a non-const buffer, which is confusing: we don't take ownership of nor write to the buffer. This mostly comes from the fact that the underlying grep_source struct in which we store the buffer uses non-const pointer. The memory pointed to by the struct is sometimes owned by us (for FILE or OID sources), and sometimes not (for BUF sources). Let's store it as const, which lets us err on the side of caution (i.e., the compiler will warn us if any of our code writes to or tries to free it). As a result, we must annotate the one place where we do free it by casting away the constness. But that's a small price to pay for the extra safety and clarity elsewhere (and indeed, it already had a comment explaining why GREP_SOURCE_BUF _didn't_ free it). And then we can mark grep_buffer() as taking a const buffer. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22grep: mark "haystack" buffers as constJeff King
When we're grepping in a buffer, we don't need to modify it. So we can take "const char *" buffers, rather than "char *". This can avoid some awkward casts in our callers, and make our expectations more clear (we will not take ownership of the memory, nor will we ever write to it). These spots don't all necessarily have to be converted at the same time, but some of them are dependent on others, because we pass pointers-to-pointers in a few cases. And none of this should change any behavior, since we're just adding "const" qualifiers (and likewise, the compiler will let us know if we missed any spots). So it's relatively low-risk to just do this all at once. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22grep: stop modifying buffer in grep_source_1()Jeff King
We find the end of each matching line of a buffer, and then temporarily write a NUL to turn it into a regular C string. But we don't need to do so, because the only thing we do in the interim is pass the line and its length (via an "eol" pointer) to match_line(). And that function should only look at the bytes we passed it, whether it has a terminating NUL or not. We can drop this temporary write in order to simplify the code and make it easier to use const buffers in more of grep.c. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22grep: stop modifying buffer in show_line()Jeff King
When showing lines via grep (or looking for funcnames), we call show_line() on a multi-line buffer. It finds the end of line and marks it with a NUL. However, we don't need to do so, as the resulting line is only used along with its "eol" marker: - we pass both to next_match(), which takes care to look at only the bytes we specified - we pass the line to output_color() without its matching eol marker. However, we do use the "match" struct we got from next_match() to tell it how many bytes to look at (which can never exceed the string we passed it). So we can stop setting and restoring this NUL marker. That makes the code simpler, and will allow us to take a const buffer in a future patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22grep: stop modifying buffer in strip_timestampJeff King
When grepping for headers in commit objects, we receive individual lines (e.g., "author Name <email> 1234 -0000"), and then strip off the timestamp to do our match. We do so by writing a NUL byte over the whitespace separator, and then remembering to restore it later. We had to do it this way when this was added back in a4d7d2c6db (log --author/--committer: really match only with name part, 2008-09-04), because we fed the result directly to regexec(), which expects a NUL-terminated string. But since b7d36ffca0 (regex: use regexec_buf(), 2016-09-21), we have a function which can match part of a buffer. So instead of modifying the string, we can instead just move the "eol" pointer, and the rest of the code will do the right thing. This will let further patches mark more buffers as "const". Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08grep: add repository to OID grep sourcesJonathan Tan
Record the repository whenever an OID grep source is created, and teach the worker threads to explicitly provide the repository when accessing objects. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08grep: typesafe versions of grep_source_initJonathan Tan
grep_source_init() can create "struct grep_source" objects and, depending on the value of the type passed, some void-pointer parameters have different meanings. Because one of these types (GREP_SOURCE_OID) will require an additional parameter in a subsequent patch, take the opportunity to increase clarity and type safety by replacing this function with individual functions for each type. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-14Merge branch 'rs/grep-parser-fix'Junio C Hamano
"git grep --and -e foo" ought to have been diagnosed as an error but instead segfaulted, which has been corrected. * rs/grep-parser-fix: grep: report missing left operand of --and
2021-07-01grep: report missing left operand of --andRené Scharfe
Git grep allows combining two patterns with --and. It checks and reports if the second pattern is missing when compiling the expression. A missing first pattern, however, is only reported later at match time. Thus no error is returned if no matching is done, e.g. because no file matches the also given pathspec. When that happens we get an expression tree with an GREP_NODE_AND node and a NULL pointer to the missing left child. free_pattern_expr() tries to dereference it during the cleanup at the end, which results in a segmentation fault. Fix this by verifying the presence of the left operand at expression compilation time. Reported-by: Matthew Hughes <matthewhughes934@gmail.com> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27hash: provide per-algorithm null OIDsbrian m. carlson
Up until recently, object IDs did not have an algorithm member, only a hash. Consequently, it was possible to share one null (all-zeros) object ID among all hash algorithms. Now that we're going to be handling objects from multiple hash algorithms, it's important to make sure that all object IDs have a correct algorithm field. Introduce a per-algorithm null OID, and add it to struct hash_algo. Introduce a wrapper function as well, and use it everywhere we used to use the null_oid constant. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-23Merge branch 'ab/grep-pcre2-allocfix'Junio C Hamano
Updates to memory allocation code around the use of pcre2 library. * ab/grep-pcre2-allocfix: grep/pcre2: move definitions of pcre2_{malloc,free} grep/pcre2: move back to thread-only PCREv2 structures grep/pcre2: actually make pcre2 use custom allocator grep/pcre2: use pcre2_maketables_free() function grep/pcre2: use compile-time PCREv2 version test grep/pcre2: add GREP_PCRE2_DEBUG_MALLOC debug mode grep/pcre2: prepare to add debugging to pcre2_malloc() grep/pcre2: correct reference to grep_init() in comment grep/pcre2: drop needless assignment to NULL grep/pcre2: drop needless assignment + assert() on opt->pcre2
2021-03-14use CALLOC_ARRAYRené Scharfe
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18grep/pcre2: move definitions of pcre2_{malloc,free}Ævar Arnfjörð Bjarmason
Move the definitions of the pcre2_{malloc,free} functions above the compile_pcre2_pattern() function they're used in. Before the preceding commit they used to be needed earlier, but now we can move them to be adjacent to the other PCREv2 functions. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18grep/pcre2: move back to thread-only PCREv2 structuresÆvar Arnfjörð Bjarmason
Change the setup of the "pcre2_general_context" to happen per-thread in compile_pcre2_pattern() instead of in grep_init(). This change brings it in line with how the rest of the pcre2_* members in the grep_pat structure are set up. As noted in the preceding commit the approach 513f2b0bbd4 (grep: make PCRE2 aware of custom allocator, 2019-10-16) took to allocate the pcre2_general_context seems to have been initially based on a misunderstanding of how PCREv2 memory allocation works. The approach of creating a global context in grep_init() is just added complexity for almost zero gain. On my system it's 24 bytes saved per-thread. For comparison PCREv2 will then go on to allocate at least a kilobyte for its own thread-local state. As noted in 6d423dd542f (grep: don't redundantly compile throwaway patterns under threading, 2017-05-25) the grep code is intentionally not trying to micro-optimize allocations by e.g. sharing some PCREv2 structures globally, while making others thread-local. So let's remove this special case and make all of them thread-local again for simplicity. With this change we could move the pcre2_{malloc,free} functions around to live closer to their current use. I'm not doing that here to keep this change small, that cleanup will be done in a follow-up commit. See also the discussion in 94da9193a6 (grep: add support for PCRE v2, 2017-06-01) about thread safety, and Johannes's comments[1] to the effect that we should be doing what this patch is doing. 1. https://lore.kernel.org/git/nycvar.QRO.7.76.6.1908052120302.46@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18grep/pcre2: actually make pcre2 use custom allocatorÆvar Arnfjörð Bjarmason
Continue work started in 513f2b0bbd4 (grep: make PCRE2 aware of custom allocator, 2019-10-16) and make PCREv2 use our pcre2_{malloc,free}(). functions for allocation. We'll now use it for all PCREv2 allocations. The reason 513f2b0bbd4 worked as a bugfix for the USE_NED_ALLOCATOR issue is because it targeted the allocation freed via free(), as opposed to by a pcre2_*free() function. I.e. the pcre2_maketables() and pcre2_maketables_free() pair. For most of the rest we continued allocating with stock malloc() inside PCREv2 itself, but didn't segfault because we'd use its corresponding free(). In a preceding commit of mine I changed the free() to pcre2_maketables_free() on versions of PCREv2 10.34 and newer. So as far as fixing the segfault goes we could revert 513f2b0bbd4. But then we wouldn't use the desired allocator, let's just use it instead. Before this patch we'd on e.g.: grep --threads=1 -iP æ.*var.*xyz Only use pcre2_{malloc,free}() for 2 malloc() calls and 2 corresponding free() calls. Now it's 12 calls to each. This can be observed with the GREP_PCRE2_DEBUG_MALLOC debug mode. Reading the history of how this bug got introduced it wasn't present in Johannes's original patch[1] to fix the issue. My reading of that thread is that the approach the follow-up patches to Johannes's original pursued were based on misunderstanding of how the PCREv2 API works. In particular this part of [2]: "most of the time (like when using UTF-8) the chartable (and therefore the global context) is not needed (even when using alternate allocators)" That's simply not how PCREv2 memory allocation works. It's easy to see how the misunderstanding came about. It's because (as noted above) the issue was noticed because of our use of free() in our own grep.c for freeing the memory allocated by pcre2_maketables(). Thus the misunderstanding that PCREv2's compile context is something only needed for pcre2_maketables(), and e.g. an aborted earlier attempt[3] to only set it up when we ourselves called pcre2_maketables(). That's not what PCREv2's compile context is. To quote PCREv2's documentation: "This context just contains pointers to (and data for) external memory management functions that are called from several places in the PCRE2 library." Thus the failed attempts to go down the route of only creating the general context in cases where we ourselves call pcre2_maketables(), before finally settling on the approach 513f2b0bbd4 took of always creating it, but then mostly not using it. Instead we should always create it, and then pass the general context to those functions that accept it, so that they'll consistently use our preferred memory allocation functions. 1. https://lore.kernel.org/git/3397e6797f872aedd18c6d795f4976e1c579514b.1565005867.git.gitgitgadget@gmail.com/ 2. https://lore.kernel.org/git/CAPUEsphMh_ZqcH3M7PXC9jHTfEdQN3mhTAK2JDkdvKBp53YBoA@mail.gmail.com/ 3. https://lore.kernel.org/git/20190806085014.47776-3-carenas@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18grep/pcre2: use pcre2_maketables_free() functionÆvar Arnfjörð Bjarmason
Make use of the pcre2_maketables_free() function to free the memory allocated by pcre2_maketables(). At first sight it's strange that 10da030ab75 (grep: avoid leak of chartables in PCRE2, 2019-10-16) which added the free() call here doesn't make use of the pcre2_free() the author introduced in the preceding commit in 513f2b0bbd4 (grep: make PCRE2 aware of custom allocator, 2019-10-16). The reason is that at the time the function didn't exist. It was first introduced in PCREv2 version 10.34, released on 2019-11-21. Let's make use of it behind a macro. I don't think this matters for anything to do with custom allocators, but it makes our use of PCREv2 more discoverable. At some distant point in the future we'll be able to drop the version guard, as nobody will be running a version older than 10.34. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18grep/pcre2: use compile-time PCREv2 version testÆvar Arnfjörð Bjarmason
Replace a use of pcre2_config(PCRE2_CONFIG_VERSION, ...) which I added in 95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24) with the same test done at compile-time. It might be cuter to do this at runtime since we don't have to do the "major >= 11 || (major >= 10 && ...)" test. But in the next commit we'll add another version comparison that absolutely needs to be done at compile-time, so we're better of being consistent across the board. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>