diff options
author | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-06-15 09:09:01 +0300 |
---|---|---|
committer | Patrick Steinhardt <psteinhardt@gitlab.com> | 2021-06-21 08:49:55 +0300 |
commit | d2870b204c6801317a6e4c4fba09968fa6fd283d (patch) | |
tree | 2cbc0fe2a27c028fbd788e4dfa83847f7b1e9f89 | |
parent | 7a33a3366d9b6b67dadec40e64b15f57e45bce04 (diff) |
blob: Speed up LFS pointer search via object type filters
The `ListLFSPointers()` RPC returns all LFS pointers referenced by a set
of revisions. This filtering is quite expensive: we first need to
enumerate all reachable objects, then for each object we need to see
whether it's a blob and whether its size indicates that it can be an LFS
pointer, and finally we need to check the blobs' contents and test
whether it really is an LFS pointer.
To optimize this a bit, we do set up a blob size limit of 200 bytes,
which is the maximum size an LFS pointer can have. While this severely
brings down the number of candidate blobs, one issue we have is that
git-rev-list(1) will still unconditionally list all the other object
types. Effectively, we're thus needlessly retrieving object info of all
tags, commits and trees only to notice that they aren't blobs in the
first place. It goes without saying that this is a huge waste of time.
To tackle this problem, we have upstreamed two new options for
git-rev-list(1):
- By default, git-rev-list(1) will always unconditionally print
objects which have directly been received either via the command
line or via stdin. A new option `--filter-provided-objects` has
been added which changes this behaviour and also causes provided
revisions to be filtered.
- A new object type filter `--filter=object:type=<type>` has been
added which will cause git-rev-list(1) to only list objects whose
type matches the given type.
Used in combination, this brings down the number of potential LFS
pointer candidates by a significant factor. Executed on linux.git:
$ git rev-list --objects --filter=blob:limit=200 --all | wc -l
7146677
$ git rev-list --objects --filter=blob:limit=200 --all \
--filter=object:type=blob --filter-provided-objects | wc -l
15217
For this particular repo, we have a factor of 470 less objects to check
for whether they are an LFS pointer or not. Naturally, this is an
artificial demonstration only because we don't typically search LFS
objects with `--all`. But we can expect that this translates to speedups
at a smaller scale by not having to do pointless work.
So let's use this by setting up the new `withObjectTypeFilter()` option
in case we're running a Git version which supports it. No new feature
flag is introduced given that we only implement it on the new pipeline
code, which is already guarded by a featureflag anyway.
Changelog: performance
-rw-r--r-- | internal/gitaly/service/blob/lfs_pointers.go | 12 |
1 files changed, 11 insertions, 1 deletions
diff --git a/internal/gitaly/service/blob/lfs_pointers.go b/internal/gitaly/service/blob/lfs_pointers.go index f1a5c625b..200377d77 100644 --- a/internal/gitaly/service/blob/lfs_pointers.go +++ b/internal/gitaly/service/blob/lfs_pointers.go @@ -73,7 +73,17 @@ func (s *server) ListLFSPointers(in *gitalypb.ListLFSPointersRequest, stream git return helper.ErrInternal(fmt.Errorf("creating catfile process: %w", err)) } - revlistChan := revlist(ctx, repo, in.GetRevisions(), withBlobLimit(lfsPointerMaxSize)) + gitVersion, err := git.CurrentVersion(ctx, s.gitCmdFactory) + if err != nil { + return helper.ErrInternalf("cannot determine Git version: %v", err) + } + + revlistOptions := []revlistOption{withBlobLimit(lfsPointerMaxSize)} + if gitVersion.SupportsObjectTypeFilter() { + revlistOptions = append(revlistOptions, withObjectTypeFilter(objectTypeBlob)) + } + + revlistChan := revlist(ctx, repo, in.GetRevisions(), revlistOptions...) catfileInfoChan := catfileInfo(ctx, catfileProcess, revlistChan) catfileInfoChan = catfileInfoFilter(ctx, catfileInfoChan, func(r catfileInfoResult) bool { return r.objectInfo.Type == "blob" && r.objectInfo.Size <= lfsPointerMaxSize |