blame: add a fingerprint heuristic to match ignored lines

This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Michael Platings <michael@platin.gs> 2019-06-20 19:38:18 +0300
committer: Junio C Hamano <gitster@pobox.com> 2019-06-20 23:38:08 +0300
commit: 1d028dc682d0cb4420fb124419ebc60e913d421c (patch)
tree: 867fcb0a2ad521ff5e5828044642f2892dd91aac /t/t8014-blame-ignore-fuzzy.sh
parent: 1fc73384bac289b4907f0adba2a25d0644affa73 (diff)
1 files changed, 440 insertions, 0 deletions
diff --git a/t/t8014-blame-ignore-fuzzy.sh b/t/t8014-blame-ignore-fuzzy.sh
new file mode 100755
index 0000000000..8443966152
--- /dev/null
+++ b/t/t8014-blame-ignore-fuzzy.sh
@@ -0,0 +1,440 @@
+#!/bin/sh
+
+test_description='git blame ignore fuzzy heuristic'
+. ./test-lib.sh
+
+# short circuit until blame has the fuzzy capabilities
+test_done
+
+pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/'
+
+# Each test is composed of 4 variables:
+# titleN - the test name
+# aN - the initial content
+# bN - the final content
+# expectedN - the line numbers from aN that we expect git blame
+#             on bN to identify, or "Final" if bN itself should
+#             be identified as the origin of that line.
+
+# We start at test 2 because setup will show as test 1
+title2="Regression test for partially overlapping search ranges"
+cat <<EOF >a2
+1
+2
+3
+abcdef
+5
+6
+7
+ijkl
+9
+10
+11
+pqrs
+13
+14
+15
+wxyz
+17
+18
+19
+EOF
+cat <<EOF >b2
+abcde
+ijk
+pqr
+wxy
+EOF
+cat <<EOF >expected2
+4
+8
+12
+16
+EOF
+
+title3="Combine 3 lines into 2"
+cat <<EOF >a3
+if ((maxgrow==0) ||
+	( single_line_field && (field->dcols < maxgrow)) ||
+	(!single_line_field && (field->drows < maxgrow)))
+EOF
+cat <<EOF >b3
+if ((maxgrow == 0) || (single_line_field && (field->dcols < maxgrow)) ||
+	(!single_line_field && (field->drows < maxgrow))) {
+EOF
+cat <<EOF >expected3
+2
+3
+EOF
+
+title4="Add curly brackets"
+cat <<EOF >a4
+	if (rows) *rows = field->rows;
+	if (cols) *cols = field->cols;
+	if (frow) *frow = field->frow;
+	if (fcol) *fcol = field->fcol;
+EOF
+cat <<EOF >b4
+	if (rows) {
+		*rows = field->rows;
+	}
+	if (cols) {
+		*cols = field->cols;
+	}
+	if (frow) {
+		*frow = field->frow;
+	}
+	if (fcol) {
+		*fcol = field->fcol;
+	}
+EOF
+cat <<EOF >expected4
+1
+1
+Final
+2
+2
+Final
+3
+3
+Final
+4
+4
+Final
+EOF
+
+
+title5="Combine many lines and change case"
+cat <<EOF >a5
+for(row=0,pBuffer=field->buf;
+	row<height;
+	row++,pBuffer+=width )
+{
+	if ((len = (int)( After_End_Of_Data( pBuffer, width ) - pBuffer )) > 0)
+	{
+		wmove( win, row, 0 );
+		waddnstr( win, pBuffer, len );
+EOF
+cat <<EOF >b5
+for (Row = 0, PBuffer = field->buf; Row < Height; Row++, PBuffer += Width) {
+	if ((Len = (int)(afterEndOfData(PBuffer, Width) - PBuffer)) > 0) {
+		wmove(win, Row, 0);
+		waddnstr(win, PBuffer, Len);
+EOF
+cat <<EOF >expected5
+1
+5
+7
+8
+EOF
+
+title6="Rename and combine lines"
+cat <<EOF >a6
+bool need_visual_update = ((form != (FORM *)0)      &&
+	(form->status & _POSTED) &&
+	(form->current==field));
+
+if (need_visual_update)
+	Synchronize_Buffer(form);
+
+if (single_line_field)
+{
+	growth = field->cols * amount;
+	if (field->maxgrow)
+		growth = Minimum(field->maxgrow - field->dcols,growth);
+	field->dcols += growth;
+	if (field->dcols == field->maxgrow)
+EOF
+cat <<EOF >b6
+bool NeedVisualUpdate = ((Form != (FORM *)0) && (Form->status & _POSTED) &&
+	(Form->current == field));
+
+if (NeedVisualUpdate) {
+	synchronizeBuffer(Form);
+}
+
+if (SingleLineField) {
+	Growth = field->cols * amount;
+	if (field->maxgrow) {
+		Growth = Minimum(field->maxgrow - field->dcols, Growth);
+	}
+	field->dcols += Growth;
+	if (field->dcols == field->maxgrow) {
+EOF
+cat <<EOF >expected6
+1
+3
+4
+5
+6
+Final
+7
+8
+10
+11
+12
+Final
+13
+14
+EOF
+
+# Both lines match identically so position must be used to tie-break.
+title7="Same line twice"
+cat <<EOF >a7
+abc
+abc
+EOF
+cat <<EOF >b7
+abcd
+abcd
+EOF
+cat <<EOF >expected7
+1
+2
+EOF
+
+title8="Enforce line order"
+cat <<EOF >a8
+abcdef
+ghijkl
+ab
+EOF
+cat <<EOF >b8
+ghijk
+abcd
+EOF
+cat <<EOF >expected8
+2
+3
+EOF
+
+title9="Expand lines and rename variables"
+cat <<EOF >a9
+int myFunction(int ArgumentOne, Thing *ArgTwo, Blah XuglyBug) {
+	Squiggle FabulousResult = squargle(ArgumentOne, *ArgTwo,
+		XuglyBug) + EwwwGlobalWithAReallyLongNameYepTooLong;
+	return FabulousResult * 42;
+}
+EOF
+cat <<EOF >b9
+int myFunction(int argument_one, Thing *arg_asdfgh,
+	Blah xugly_bug) {
+	Squiggle fabulous_result = squargle(argument_one,
+		*arg_asdfgh, xugly_bug)
+		+ g_ewww_global_with_a_really_long_name_yep_too_long;
+	return fabulous_result * 42;
+}
+EOF
+cat <<EOF >expected9
+1
+1
+2
+3
+3
+4
+5
+EOF
+
+title10="Two close matches versus one less close match"
+cat <<EOF >a10
+abcdef
+abcdef
+ghijkl
+EOF
+cat <<EOF >b10
+gh
+abcdefx
+EOF
+cat <<EOF >expected10
+Final
+2
+EOF
+
+# The first line of b matches best with the last line of a, but the overall
+# match is better if we match it with the the first line of a.
+title11="Piggy in the middle"
+cat <<EOF >a11
+abcdefg
+ijklmn
+abcdefgh
+EOF
+cat <<EOF >b11
+abcdefghx
+ijklm
+EOF
+cat <<EOF >expected11
+1
+2
+EOF
+
+title12="No trailing newline"
+printf "abc\ndef" >a12
+printf "abx\nstu" >b12
+cat <<EOF >expected12
+1
+Final
+EOF
+
+title13="Reorder includes"
+cat <<EOF >a13
+#include "c.h"
+#include "b.h"
+#include "a.h"
+#include "e.h"
+#include "d.h"
+EOF
+cat <<EOF >b13
+#include "a.h"
+#include "b.h"
+#include "c.h"
+#include "d.h"
+#include "e.h"
+EOF
+cat <<EOF >expected13
+3
+2
+1
+5
+4
+EOF
+
+last_test=13
+
+test_expect_success setup '
+	{ for i in $(test_seq 2 $last_test)
+	do
+		# Append each line in a separate commit to make it easy to
+		# check which original line the blame output relates to.
+
+		line_count=0 &&
+		{ while IFS= read line
+		do
+			line_count=$((line_count+1)) &&
+			echo "$line" >>"$i" &&
+			git add "$i" &&
+			test_tick &&
+			GIT_AUTHOR_NAME="$line_count" git commit -m "$line_count"
+		done } <"a$i"
+	done } &&
+
+	{ for i in $(test_seq 2 $last_test)
+	do
+		# Overwrite the files with the final content.
+		cp b$i $i &&
+		git add $i
+	done } &&
+	test_tick &&
+
+	# Commit the final content all at once so it can all be
+	# referred to with the same commit ID.
+	GIT_AUTHOR_NAME=Final git commit -m Final &&
+
+	IGNOREME=$(git rev-parse HEAD)
+'
+
+for i in $(test_seq 2 $last_test); do
+	eval title="\$title$i"
+	test_expect_success "$title" \
+	"git blame -M9 --ignore-rev $IGNOREME $i >output &&
+	sed -e \"$pick_author\" output >actual &&
+	test_cmp expected$i actual"
+done
+
+# This invoked a null pointer dereference when the chunk callback was called
+# with a zero length parent chunk and there were no more suspects.
+test_expect_success 'Diff chunks with no suspects' '
+	test_write_lines xy1 A B C xy1 >file &&
+	git add file &&
+	test_tick &&
+	GIT_AUTHOR_NAME=1 git commit -m 1 &&
+
+	test_write_lines xy2 A B xy2 C xy2 >file &&
+	git add file &&
+	test_tick &&
+	GIT_AUTHOR_NAME=2 git commit -m 2 &&
+	REV_2=$(git rev-parse HEAD) &&
+
+	test_write_lines xy3 A >file &&
+	git add file &&
+	test_tick &&
+	GIT_AUTHOR_NAME=3 git commit -m 3 &&
+	REV_3=$(git rev-parse HEAD) &&
+
+	test_write_lines 1 1 >expected &&
+
+	git blame --ignore-rev $REV_2 --ignore-rev $REV_3 file >output &&
+	sed -e "$pick_author" output >actual &&
+
+	test_cmp expected actual
+	'
+
+test_expect_success 'position matching' '
+	test_write_lines abc def >file2 &&
+	git add file2 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=1 git commit -m 1 &&
+
+	test_write_lines abc def abc def >file2 &&
+	git add file2 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=2 git commit -m 2 &&
+
+	test_write_lines abcx defx abcx defx >file2 &&
+	git add file2 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=3 git commit -m 3 &&
+	REV_3=$(git rev-parse HEAD) &&
+
+	test_write_lines abcy defy abcx defx >file2 &&
+	git add file2 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=4 git commit -m 4 &&
+	REV_4=$(git rev-parse HEAD) &&
+
+	test_write_lines 1 1 2 2 >expected &&
+
+	git blame --ignore-rev $REV_3 --ignore-rev $REV_4 file2 >output &&
+	sed -e "$pick_author" output >actual &&
+
+	test_cmp expected actual
+	'
+
+# This fails if each blame entry is processed independently instead of
+# processing each diff change in full.
+test_expect_success 'preserve order' '
+	test_write_lines bcde >file3 &&
+	git add file3 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=1 git commit -m 1 &&
+
+	test_write_lines bcde fghij >file3 &&
+	git add file3 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=2 git commit -m 2 &&
+
+	test_write_lines bcde fghij abcd >file3 &&
+	git add file3 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=3 git commit -m 3 &&
+
+	test_write_lines abcdx fghijx bcdex >file3 &&
+	git add file3 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=4 git commit -m 4 &&
+	REV_4=$(git rev-parse HEAD) &&
+
+	test_write_lines abcdx fghijy bcdex >file3 &&
+	git add file3 &&
+	test_tick &&
+	GIT_AUTHOR_NAME=5 git commit -m 5 &&
+	REV_5=$(git rev-parse HEAD) &&
+
+	test_write_lines 1 2 3 >expected &&
+
+	git blame --ignore-rev $REV_4 --ignore-rev $REV_5 file3 >output &&
+	sed -e "$pick_author" output >actual &&
+
+	test_cmp expected actual
+	'
+
+test_done
author	Michael Platings <michael@platin.gs>	2019-06-20 19:38:18 +0300
committer	Junio C Hamano <gitster@pobox.com>	2019-06-20 23:38:08 +0300
commit	1d028dc682d0cb4420fb124419ebc60e913d421c (patch)
tree	867fcb0a2ad521ff5e5828044642f2892dd91aac /t/t8014-blame-ignore-fuzzy.sh
parent	1fc73384bac289b4907f0adba2a25d0644affa73 (diff)