Welcome to mirror list, hosted at ThFree Co, Russian Federation.

git.kernel.org/pub/scm/git/git.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJeff King <peff@peff.net>2021-02-09 13:53:50 +0300
committerJunio C Hamano <gitster@pobox.com>2021-02-11 20:57:55 +0300
commit16950f8384afa5106b1ce57da07a964c2aaef3f7 (patch)
tree608e6e56eef19255cfa4637cd54b49af26f4f3e1 /t/t6115-rev-list-du.sh
parent3803a3a0993045605d7f3db363188ce377e917c8 (diff)
rev-list: add --disk-usage option for calculating disk usage
It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone). You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: git rev-list --objects --no-object-names main..branch git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' Though note that the caveats from git-cat-file(1) apply here. We "blame" base objects more than their deltas, even though the relationship could easily be flipped. Still, it can be a useful rough measure. But one problem is that it's slow to run. Teaching rev-list to sum up the sizes can be much faster for two reasons: 1. It skips all of the piping of object names and sizes. 2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size. This patch implements a --disk-usage option which produces the same answer in a fraction of the time. Here are some timings using a clone of torvalds/linux: [rev-list piped to cat-file, no bitmaps] $ time git rev-list --objects --no-object-names --all | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m29.635s user 0m38.003s sys 0m1.093s [internal, no bitmaps] $ time git rev-list --disk-usage --objects --all 1459938510 real 0m31.262s user 0m30.885s sys 0m0.376s Even though the wall-clock time is slightly worse due to parallelism, notice the CPU savings between the two. We saved 21% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] $ time git rev-list --objects --no-object-names --all --use-bitmap-index | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m6.244s user 0m8.452s sys 0m0.311s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] $ time git rev-list --disk-usage --objects --all --use-bitmap-index 1459938510 real 0m0.219s user 0m0.169s sys 0m0.049s then we get the same answer much faster. For "--all", that answer will correspond closely to "du objects/pack", of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s sys 0m0.072s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 't/t6115-rev-list-du.sh')
-rwxr-xr-xt/t6115-rev-list-du.sh51
1 files changed, 51 insertions, 0 deletions
diff --git a/t/t6115-rev-list-du.sh b/t/t6115-rev-list-du.sh
new file mode 100755
index 0000000000..b4aef32b71
--- /dev/null
+++ b/t/t6115-rev-list-du.sh
@@ -0,0 +1,51 @@
+#!/bin/sh
+
+test_description='basic tests of rev-list --disk-usage'
+. ./test-lib.sh
+
+# we want a mix of reachable and unreachable, as well as
+# objects in the bitmapped pack and some outside of it
+test_expect_success 'set up repository' '
+ test_commit --no-tag one &&
+ test_commit --no-tag two &&
+ git repack -adb &&
+ git reset --hard HEAD^ &&
+ test_commit --no-tag three &&
+ test_commit --no-tag four &&
+ git reset --hard HEAD^
+'
+
+# We don't want to hardcode sizes, because they depend on the exact details of
+# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
+# machinery works and compare the --disk-usage output to that.
+disk_usage_slow () {
+ git rev-list --no-object-names "$@" |
+ git cat-file --batch-check="%(objectsize:disk)" |
+ perl -lne '$total += $_; END { print $total}'
+}
+
+# check behavior with given rev-list options; note that
+# whitespace is not preserved in args
+check_du () {
+ args=$*
+
+ test_expect_success "generate expected size ($args)" "
+ disk_usage_slow $args >expect
+ "
+
+ test_expect_success "rev-list --disk-usage without bitmaps ($args)" "
+ git rev-list --disk-usage $args >actual &&
+ test_cmp expect actual
+ "
+
+ test_expect_success "rev-list --disk-usage with bitmaps ($args)" "
+ git rev-list --disk-usage --use-bitmap-index $args >actual &&
+ test_cmp expect actual
+ "
+}
+
+check_du HEAD
+check_du --objects HEAD
+check_du --objects HEAD^..HEAD
+
+test_done