Welcome to mirror list, hosted at ThFree Co, Russian Federation.

git.kernel.org/pub/scm/git/git.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-12-15pack-revindex: implement `midx_pair_to_pack_pos()`Taylor Blau
Now that we have extracted the `midx_key_to_pack_pos()` function, we can implement the `midx_pair_to_pack_pos()` function which accepts (pack_id, offset) tuples and returns an index into the psuedo-pack order. This will be used in a following commit in order to figure out whether or not the MIDX chose a given delta's base object from the same pack as the delta resides in. It will do so by locating the base object's offset in the pack, and then performing a binary search using the same pack ID with the base object's offset. If (and only if) it finds a match (at any position) we can guarantee that the MIDX selected both halves of the delta/base pair from the same pack. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-15pack-revindex: factor out `midx_key_to_pack_pos()` helperTaylor Blau
The `midx_to_pack_pos()` function implements a binary search over objects in the MIDX between lexical and pseudo-pack order. It does this by taking in an index into the lexical order (i.e. the same argument you'd use for `nth_midxed_object_id()` and similar) and spits out a position in the pseudo-pack order. This works for all callers, since they currently all are translating from lexical order to pseudo-pack order. But future callers may want to translate a known (offset, pack_id) tuple into an index into the psuedo-pack order, without knowing where that (offset, pack_id) tuple appears in lexical order. Prepare for implementing a function that translates between a (offset, pack_id) tuple into an index into the psuedo-pack order by extracting a helper function which does just that, and then reimplementing midx_to_pack_pos() in terms of it. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-15midx: implement `midx_preferred_pack()`Taylor Blau
When performing a binary search over the objects in a MIDX's bitmap (i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack ordering using a combination of (a) the preferred pack, (b) the pack's lexical position in the MIDX based on pack names, and (c) the object offset within the pack. In order to perform this binary search, the reader must know the identity of the preferred pack. This could be stored in the MIDX, but isn't for historical reasons, mostly because it can easily be inferred at read-time by looking at the object in the first bit position and finding out which pack it was selected from in the MIDX, like so: nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0)); In midx_to_pack_pos() which performs this binary search, we look up the identity of the preferred pack before each search. This is relatively quick, since it involves two table-driven lookups (one in the MIDX's revindex for `pack_pos_to_midx()`, and another in the MIDX's object table for `nth_midxed_pack_int_id()`). But since the preferred pack does not change after the MIDX is written, it is safe to cache this value on the MIDX itself. Write a helper to do just that, and rewrite all of the existing call-sites that care about the identity of the preferred pack in terms of this new helper. This will prepare us for a subsequent patch where we will need to binary search through the MIDX's pseudo-pack order multiple times. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-23Merge branch 'jk/chunk-bounds'Junio C Hamano
The codepaths that read "chunk" formatted files have been corrected to pay attention to the chunk size and notice broken files. * jk/chunk-bounds: (21 commits) t5319: make corrupted large-offset test more robust chunk-format: drop pair_chunk_unsafe() commit-graph: detect out-of-order BIDX offsets commit-graph: check bounds when accessing BIDX chunk commit-graph: check bounds when accessing BDAT chunk commit-graph: bounds-check generation overflow chunk commit-graph: check size of generations chunk commit-graph: bounds-check base graphs chunk commit-graph: detect out-of-bounds extra-edges pointers commit-graph: check size of commit data chunk midx: check size of revindex chunk midx: bounds-check large offset chunk midx: check size of object offset chunk midx: enforce chunk alignment on reading midx: check size of pack names chunk commit-graph: check consistency of fanout table midx: check size of oid lookup chunk commit-graph: check size of oid fanout chunk midx: stop ignoring malformed oid fanout chunk t: add library for munging chunk-format files ...
2023-10-10midx: check size of revindex chunkJeff King
When we load a revindex from disk, we check the size of the file compared to the number of objects we expect it to have. But when we use a RIDX chunk stored directly in the midx, we just access the memory directly. This can lead to out-of-bounds memory access for a corrupted or malicious multi-pack-index file. We can catch this by recording the RIDX chunk size, and then checking it against the expected size when we "load" the revindex. Note that this check is much simpler than the one that load_revindex_from_disk() does, because we just have the data array with no header (so we do not need to account for the header size, and nor do we need to bother validating the header values). The test confirms both that we catch this case, and that we continue the process (the revindex is required to use the midx bitmaps, but we fallback to a non-bitmap traversal). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-30parse: separate out parsing functions from config.hCalvin Wan
The files config.{h,c} contain functions that have to do with parsing, but not config. In order to further reduce all-in-one headers, separate out functions in config.c that do not operate on config into its own file, parse.h, and update the include directives in the .c files that need only such functions accordingly. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-28Merge branch 'ds/fsck-pack-revindex'Junio C Hamano
"git fsck" learned to validate the on-disk pack reverse index files. * ds/fsck-pack-revindex: fsck: validate .rev file header fsck: check rev-index position values fsck: check rev-index checksums fsck: create scaffolding for rev-index checks
2023-04-28Merge branch 'tb/pack-revindex-on-disk'Junio C Hamano
The on-disk reverse index that allows mapping from the pack offset to the object name for the object stored at the offset has been enabled by default. * tb/pack-revindex-on-disk: t: invert `GIT_TEST_WRITE_REV_INDEX` config: enable `pack.writeReverseIndex` by default pack-revindex: introduce `pack.readReverseIndex` pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK pack-revindex: make `load_pack_revindex` take a repository t5325: mark as leak-free pack-write.c: plug a leak in stage_tmp_packfiles()
2023-04-18fsck: validate .rev file headerDerrick Stolee
While parsing a .rev file, we check the header information to be sure it makes sense. This happens before doing any additional validation such as a checksum or value check. In order to differentiate between a bad header and a non-existent file, we need to update the API for loading a reverse index. Make load_pack_revindex_from_disk() non-static and specify that a positive value means "the file does not exist" while other errors during parsing are negative values. Since an invalid header prevents setting up the structures we would use for further validations, we can stop at that point. The place where we can distinguish between a missing file and a corrupt file is inside load_revindex_from_disk(), which is used both by pack rev-indexes and multi-pack-index rev-indexes. Some tests in t5326 demonstrate that it is critical to take some conditions to allow positive error signals. Add tests that check the three header values. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-18fsck: check rev-index position valuesDerrick Stolee
When checking a rev-index file, it may be helpful to identify exactly which positions are incorrect. Compare the rev-index to a freshly-computed in-memory rev-index and report the comparison failures. This additional check (on top of the checksum validation) can help find files that were corrupt by a single bit flip on-disk or perhaps were written incorrectly due to a bug in Git. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-18fsck: check rev-index checksumsDerrick Stolee
The previous change added calls to verify_pack_revindex() in builtin/fsck.c, but the implementation of the method was left empty. Add the first and most-obvious check to this method: checksum verification. While here, create a helper method in the test script that makes it easy to adjust the .rev file and check that 'git fsck' reports the correct error message. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-18fsck: create scaffolding for rev-index checksDerrick Stolee
The 'fsck' builtin checks many of Git's on-disk data structures, but does not currently validate the pack rev-index files (a .rev file to pair with a .pack and .idx file). Before doing a more-involved check process, create the scaffolding within builtin/fsck.c to have a new error type and add that error type when the API method verify_pack_revindex() returns an error. That method does nothing currently, but we will add checks to it in later changes. For now, check that 'git fsck' succeeds without any errors in the normal case. Future checks will be paired with tests that corrupt the .rev file appropriately. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13pack-revindex: introduce `pack.readReverseIndex`Taylor Blau
Since 1615c567b8 (Documentation/config/pack.txt: advertise 'pack.writeReverseIndex', 2021-01-25), we have had the `pack.writeReverseIndex` configuration option, which tells Git whether or not it is allowed to write a ".rev" file when indexing a pack. Introduce a complementary configuration knob, `pack.readReverseIndex` to control whether or not Git will read any ".rev" file(s) that may be available on disk. This option is useful for debugging, as well as disabling the effect of ".rev" files in certain instances. This is useful because of the trade-off[^1] between the time it takes to generate a reverse index (slow from scratch, fast when reading an existing ".rev" file), and the time it takes to access a record (the opposite). For example, even though it is faster to use the on-disk reverse index when computing the on-disk size of a packed object, it is slower to enumerate the same value for all objects. Here are a couple of examples from linux.git. When computing the above for a single object, using the on-disk reverse index is significantly faster: $ git rev-parse HEAD >in $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in Time (mean ± σ): 302.5 ms ± 12.5 ms [User: 258.7 ms, System: 43.6 ms] Range (min … max): 291.1 ms … 328.1 ms 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in Time (mean ± σ): 3.9 ms ± 0.3 ms [User: 1.6 ms, System: 2.4 ms] Range (min … max): 2.0 ms … 4.4 ms 801 runs Summary 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran 77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in' , but when instead trying to compute the on-disk object size for all objects in the repository, using the ".rev" file is a disadvantage over creating the reverse index from scratch: $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 8.258 s ± 0.035 s [User: 7.949 s, System: 0.308 s] Range (min … max): 8.199 s … 8.293 s 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 16.976 s ± 0.107 s [User: 16.706 s, System: 0.268 s] Range (min … max): 16.839 s … 17.105 s 10 runs Summary 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran 2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' Luckily, the results when running `git cat-file` with `--unordered` are closer together: $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 5.066 s ± 0.105 s [User: 4.792 s, System: 0.274 s] Range (min … max): 4.943 s … 5.220 s 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 6.193 s ± 0.069 s [User: 5.937 s, System: 0.255 s] Range (min … max): 6.145 s … 6.356 s 10 runs Summary 'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran 1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' Because the equilibrium point between these two is highly machine- and repository-dependent, allow users to configure whether or not they will read any ".rev" file(s) with this configuration knob. [^1]: Generating a reverse index in memory takes O(N) time (where N is the number of objects in the repository), since we use a radix sort. Reading an entry from an on-disk ".rev" file is slower since each operation is bound by disk I/O instead of memory I/O. In order to compute the on-disk size of a packed object, we need to find the offset of our object, and the adjacent object (the on-disk size difference of these two). Finding the first offset requires a binary search. Finding the latter involves a single .rev lookup. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISKTaylor Blau
In ec8e7760ac (pack-revindex: ensure that on-disk reverse indexes are given precedence, 2021-01-25), we introduced GIT_TEST_REV_INDEX_DIE_IN_MEMORY to abort the process when Git generated a reverse index from scratch. ec8e7760ac was about ensuring that Git prefers a .rev file when available over generating the same information in memory from scratch. In a subsequent patch, we'll introduce `pack.readReverseIndex`, which may be used to disable reading ".rev" files when available. In order to ensure that those files are indeed being ignored, introduce an analogous option to abort the process when Git reads a ".rev" file from disk. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13pack-revindex: make `load_pack_revindex` take a repositoryTaylor Blau
In a future commit, we will introduce a `pack.readReverseIndex` configuration, which forces Git to generate the reverse index from scratch instead of loading it from disk. In order to avoid reading this configuration value more than once, we'll use the `repo_settings` struct to lazily load this value. In order to access the `struct repo_settings`, add a repository argument to `load_pack_revindex`, and update all callers to pass the correct instance (in all cases, `the_repository`). In certain instances, a new function-local variable is introduced to take the place of a `struct repository *` argument to the function itself to avoid propagating the new parameter even further throughout the tree. Co-authored-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: remove cache.h inclusion due to object-file.h changesElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: be explicit about dependence on trace.h & trace2.hElijah Newren
Dozens of files made use of trace and trace2 functions, without explicitly including trace.h or trace2.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include trace.h or trace2.h if they are using them. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27midx: read `RIDX` chunk when presentTaylor Blau
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index is read from it instead of the on-disk .rev file. Since we need to encode the object order in the MIDX itself for correctness reasons, there is no point in storing the same data again outside of the MIDX. So, this patch stops writing separate .rev files, and reads it out of the MIDX itself. This is possible to do with relatively little new code, since the format of the RIDX chunk is identical to the data in the .rev file. In other words, we can implement this by pointing the `revindex_data` field at the reverse index chunk of the MIDX instead of the .rev file without any other changes. Note that we have two knobs that are adjusted for the new tests: GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls whether the MIDX .rev is written at all, and the latter controls whether we read the MIDX's RIDX chunk. Both are necessary to ensure that the test added at the beginning of this series continues to work. This is because we always need to write the RIDX chunk in the MIDX in order to change its checksum, but we want to make sure reading the existing .rev file still works (since the RIDX chunk takes precedence by default). Arguably this isn't a very interesting mode to test, because the precedence rules mean that we'll always read the RIDX chunk over the .rev file. But it makes it impossible for a user to induce corruption in their repository by adjusting the test knobs (since if we had an either/or knob they could stop writing the RIDX chunk, allowing them to tweak the MIDX's object order without changing its checksum). Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27pack-revindex.c: instrument loading on-disk reverse indexTaylor Blau
In a subsequent commit, we'll use the MIDX's new 'RIDX' chunk as a source for the reverse index's data. But it will be useful for tests to be able to determine whether the reverse index was loaded from the separate .rev file, or from a chunk within the MIDX. To instrument this, add a trace2 event which the tests can look for in order to determine the reverse index's source. Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-29midx.c: write MIDX filenames to strbufTaylor Blau
To ask for the name of a MIDX and its corresponding .rev file, callers invoke get_midx_filename() and get_midx_rev_filename(), respectively. These both invoke xstrfmt(), allocating a chunk of memory which must be freed later on. This makes callers in pack-bitmap.c somewhat awkward. Specifically, midx_bitmap_filename(), which is implemented like: return xstrfmt("%s-%s.bitmap", get_midx_filename(midx->object_dir), hash_to_hex(get_midx_checksum(midx))); this leaks the second argument to xstrfmt(), which itself was allocated with xstrfmt(). This caller could assign both the result of get_midx_filename() and the outer xstrfmt() to a temporary variable, remembering to free() the former before returning. But that involves a wasteful copy. Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf as an output parameter. This way midx_bitmap_filename() can manipulate and pass around a temporary buffer which it detaches back to its caller. That allows us to implement the function without copying or open-coding get_midx_filename() in a way that doesn't leak. Update the other callers of get_midx_filename() and get_midx_rev_filename() accordingly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-08Merge branch 'tb/reverse-midx'Junio C Hamano
An on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles is introduced. * tb/reverse-midx: midx.c: improve cache locality in midx_pack_order_cmp() pack-revindex: write multi-pack reverse indexes pack-write.c: extract 'write_rev_file_order' pack-revindex: read multi-pack reverse indexes Documentation/technical: describe multi-pack reverse indexes midx: make some functions non-static midx: keep track of the checksum midx: don't free midx_name early midx: allow marking a pack as preferred t/helper/test-read-midx.c: add '--show-objects' builtin/multi-pack-index.c: display usage on unrecognized command builtin/multi-pack-index.c: don't enter bogus cmd_mode builtin/multi-pack-index.c: split sub-commands builtin/multi-pack-index.c: define common usage with a macro builtin/multi-pack-index.c: don't handle 'progress' separately builtin/multi-pack-index.c: inline 'flags' with options
2021-04-01pack-revindex: read multi-pack reverse indexesTaylor Blau
Implement reading for multi-pack reverse indexes, as described in the previous patch. Note that these functions don't yet have any callers, and won't until multi-pack reachability bitmaps are introduced in a later patch series. In the meantime, this patch implements some of the infrastructure necessary to support multi-pack bitmaps. There are three new functions exposed by the revindex API: - load_midx_revindex(): loads the reverse index corresponding to the given multi-pack index. - midx_to_pack_pos() and pack_pos_to_midx(): these convert between the multi-pack index and pseudo-pack order. load_midx_revindex() and pack_pos_to_midx() are both relatively straightforward. load_midx_revindex() needs a few functions to be exposed from the midx API. One to get the checksum of a midx, and another to get the .rev's filename. Similar to recent changes in the packed_git struct, three new fields are added to the multi_pack_index struct: one to keep track of the size, one to keep track of the mmap'd pointer, and another to point past the header and at the reverse index's data. pack_pos_to_midx() simply reads the corresponding entry out of the table. midx_to_pack_pos() is the trickiest, since it needs to find an object's position in the psuedo-pack order, but that order can only be recovered in the .rev file itself. This mapping can be implemented with a binary search, but note that the thing we're binary searching over isn't an array of values, but rather a permuted order of those values. So, when comparing two items, it's helpful to keep in mind the difference. Instead of a traditional binary search, where you are comparing two things directly, here we're comparing a (pack, offset) tuple with an index into the multi-pack index. That index describes another (pack, offset) tuple, and it is _those_ two tuples that are compared. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27pack-revindex.c: don't close unopened file descriptorsTaylor Blau
When opening a reverse index, load_revindex_from_disk() jumps to the 'cleanup' label in case something goes wrong: the reverse index had the wrong size, an unrecognized version, or similar. It also jumps to this label when the reverse index couldn't be opened in the first place, which will cause an error with the unguarded close() call in the label. Guard this call with "if (fd >= 0)" to make sure that we have a valid file descriptor to close before attempting to close it. Reported-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26pack-revindex: ensure that on-disk reverse indexes are given precedenceTaylor Blau
When an on-disk reverse index exists, there is no need to generate one in memory. In fact, doing so can be slow, and require large amounts of the heap. Let's make sure that we treat the on-disk reverse index with precedence (i.e., that when it exists, we don't bother trying to generate an equivalent one in memory) by teaching Git how to conditionally die() when generating a reverse index in memory. Then, add a test to ensure that when (a) an on-disk reverse index exists, and (b) when setting GIT_TEST_REV_INDEX_DIE_IN_MEMORY, that we do not die, implying that we read from the on-disk one. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26packfile: prepare for the existence of '*.rev' filesTaylor Blau
Specify the format of the on-disk reverse index 'pack-*.rev' file, as well as prepare the code for the existence of such files. The reverse index maps from pack relative positions (i.e., an index into the array of object which is sorted by their offsets within the packfile) to their position within the 'pack-*.idx' file. Today, this is done by building up a list of (off_t, uint32_t) tuples for each object (the off_t corresponding to that object's offset, and the uint32_t corresponding to its position in the index). To convert between pack and index position quickly, this array of tuples is radix sorted based on its offset. This has two major drawbacks: First, the in-memory cost scales linearly with the number of objects in a pack. Each 'struct revindex_entry' is sizeof(off_t) + sizeof(uint32_t) + padding bytes for a total of 16. To observe this, force Git to load the reverse index by, for e.g., running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking for a single object in a fresh clone of the kernel, Git needs to allocate 120+ MB of memory in order to hold the reverse index in memory. Second, the cost to sort also scales with the size of the pack. Luckily, this is a linear function since 'load_pack_revindex()' uses a radix sort, but this cost still must be paid once per pack per process. As an example, it takes ~60x longer to print the _size_ of an object as it does to print that entire object's _contents_: Benchmark #1: git.compile cat-file --batch <obj Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms] Range (min … max): 3.2 ms … 3.7 ms 726 runs Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms] Range (min … max): 193.7 ms … 224.4 ms 13 runs Instead, avoid computing and sorting the revindex once per process by writing it to a file when the pack itself is generated. The format is relatively straightforward. It contains an array of uint32_t's, the length of which is equal to the number of objects in the pack. The ith entry in this table contains the index position of the ith object in the pack, where "ith object in the pack" is determined by pack offset. One thing that the on-disk format does _not_ contain is the full (up to) eight-byte offset corresponding to each object. This is something that the in-memory revindex contains (it stores an off_t in 'struct revindex_entry' along with the same uint32_t that the on-disk format has). Omit it in the on-disk format, since knowing the index position for some object is sufficient to get a constant-time lookup in the pack-*.idx file to ask for an object's offset within the pack. This trades off between the on-disk size of the 'pack-*.rev' file for runtime to chase down the offset for some object. Even though the lookup is constant time, the constant is heavier, since it can potentially involve two pointer walks in v2 indexes (one to access the 4-byte offset table, and potentially a second to access the double wide offset table). Consider trying to map an object's pack offset to a relative position within that pack. In a cold-cache scenario, more page faults occur while switching between binary searching through the reverse index and searching through the *.idx file for an object's offset. Sure enough, with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after 'sync'ing), printing out the entire object's contents is still marginally faster than printing its size: Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms] Range (min … max): 21.4 ms … 23.5 ms 41 runs Benchmark #2: git.compile cat-file --batch <obj >/dev/null Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms] Range (min … max): 15.6 ms … 18.2 ms 45 runs (Numbers taken in the kernel after cheating and using the next patch to generate a reverse index). There are a couple of approaches to improve cold cache performance not pursued here: - We could include the object offsets in the reverse index format. Predictably, this does result in fewer page faults, but it triples the size of the file, while simultaneously duplicating a ton of data already available in the .idx file. (This was the original way I implemented the format, and it did show `--batch-check='%(objectsize:disk)'` winning out against `--batch`.) On the other hand, this increase in size also results in a large block-cache footprint, which could potentially hurt other workloads. - We could store the mapping from pack to index position in more cache-friendly way, like constructing a binary search tree from the table and writing the values in breadth-first order. This would result in much better locality, but the price you pay is trading O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you can no longer directly index the table). So, neither of these approaches are taken here. (Thankfully, the format is versioned, so we are free to pursue these in the future.) But, cold cache performance likely isn't interesting outside of one-off cases like asking for the size of an object directly. In real-world usage, Git is often performing many operations in the revindex (i.e., asking about many objects rather than a single one). The trade-off is worth it, since we will avoid the vast majority of the cost of generating the revindex that the extra pointer chase will look like noise in the following patch's benchmarks. This patch describes the format and prepares callers (like in pack-revindex.c) to be able to read *.rev files once they exist. An implementation of the writer will appear in the next patch, and callers will gradually begin to start using the writer in the patches that follow after that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-14pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'Taylor Blau
To prepare for on-disk reverse indexes, remove a spot in 'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct packed_git'. Even though this use of the revindex pointer is within pack-revindex.c, this clean up is still worth doing. Since the 'revindex' pointer will be NULL when reading from an on-disk reverse index (instead the 'revindex_data' pointer will be mmaped to the 'pack-*.rev' file), this call-site would have to include a conditional to lookup the offset for position 'mi' each iteration through the search. So instead of open-coding 'pack_pos_to_offset()', call it directly from within 'offset_to_pack_pos()'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-14pack-revindex: hide the definition of 'revindex_entry'Taylor Blau
Now that all spots outside of pack-revindex.c that reference 'struct revindex_entry' directly have been removed, it is safe to hide the implementation by moving it from pack-revindex.h to pack-revindex.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-14pack-revindex: remove unused 'find_revindex_position()'Taylor Blau
Now that all 'find_revindex_position()' callers have been removed (and converted to the more descriptive 'offset_to_pack_pos()'), it is almost safe to get rid of 'find_revindex_position()' entirely. Almost, except for the fact that 'offset_to_pack_pos()' calls 'find_revindex_position()'. Inline 'find_revindex_position()' into 'offset_to_pack_pos()', and then remove 'find_revindex_position()' entirely. This is a straightforward refactoring with one minor snag. 'offset_to_pack_pos()' used to load the index before calling 'find_revindex_position()'. That means that by the time 'find_revindex_position()' starts executing, 'p->num_objects' can be safely read. After inlining, be careful to not read 'p->num_objects' until _after_ 'load_pack_revindex()' (which loads the index as a side-effect) has been called. Another small fix that is included is converting the upper- and lower-bounds to be unsigned's instead of ints. This dates back to 92e5c77c37 (revindex: export new APIs, 2013-10-24)--ironically, the last time we introduced new APIs here--but this unifies the types. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-14pack-revindex: remove unused 'find_pack_revindex()'Taylor Blau
Now that no callers of 'find_pack_revindex()' remain, remove the function's declaration and implementation entirely. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-14pack-revindex: introduce a new APITaylor Blau
In the next several patches, we will prepare for loading a reverse index either in memory (mapping the inverse of the .idx's contents in-core), or directly from a yet-to-be-introduced on-disk format. To prepare for that, we'll introduce an API that avoids the caller explicitly indexing the revindex pointer in the packed_git structure. There are four ways to interact with the reverse index. Accordingly, four functions will be exported from 'pack-revindex.h' by the time that the existing API is removed. A caller may: 1. Load the pack's reverse index. This involves opening up the index, generating an array, and then sorting it. Since opening the index can fail, this function ('load_pack_revindex()') returns an int. Accordingly, it takes only a single argument: the 'struct packed_git' the caller wants to build a reverse index for. This function is well-suited for both the current and new API. Callers will have to continue to open the reverse index explicitly, but this function will eventually learn how to detect and load a reverse index from the on-disk format, if one exists. Otherwise, it will fallback to generating one in memory from scratch. 2. Convert a pack position into an offset. This operation is now called `pack_pos_to_offset()`. It takes a pack and a position, and returns the corresponding off_t. Any error simply calls BUG(), since the callers are not well-suited to handle a failure and keep going. 3. Convert a pack position into an index position. Same as above; this takes a pack and a position, and returns a uint32_t. This operation is known as `pack_pos_to_index()`. The same thinking about error conditions applies here as well. 4. Find the pack position for a given offset. This operation is now known as `offset_to_pack_pos()`. It takes a pack, an offset, and a pointer to a uint32_t where the position is written, if an object exists at that offset. Otherwise, -1 is returned to indicate failure. Unlike some of the callers that used to access '->offset' and '->nr' directly, the error checking around this call is somewhat more robust. This is important since callers should always pass an offset which points at the boundary of two objects. The API, unlike direct access, enforces that that is the case. This will become important in a subsequent patch where a caller which does not but could check the return value treats the signed `-1` from `find_revindex_position()` as an index into the 'revindex' array. Two design warts are carried over into the new API: - Asking for the index position of an out-of-bounds object will result in a BUG() (since no such object exists), but asking for the offset of the non-existent object at the end of the pack returns the total size of the pack. This makes it convenient for callers who always want to take the difference of two adjacent object's offsets (to compute the on-disk size) but don't want to worry about boundaries at the end of the pack. - offset_to_pack_pos() lazily loads the reverse index, but pack_pos_to_index() doesn't (callers of the former are well-suited to handle errors, but callers of the latter are not). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17compute pack .idx byte offsets using size_tJeff King
A pack and its matching .idx file are limited to 2^32 objects, because the pack format contains a 32-bit field to store the number of objects. Hence we use uint32_t in the code. But the byte count of even a .idx file can be much larger than that, because it stores at least a hash and an offset for each object. So using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650 objects. This confuses load_idx(), which computes the minimum size like this: unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz; Even though min_size will be big enough on most 64-bit platforms, the actual arithmetic is done as a uint32_t, resulting in a truncation. We actually exceed that min_size, but then we do: unsigned long max_size = min_size; if (nr) max_size += (nr - 1)*8; to account for the variable-sized table. That computation doesn't overflow quite so low, but with the truncation for min_size, we end up with a max_size that is much smaller than our actual size. So we complain that the idx is invalid, and can't find any of its objects. We can fix this case by casting "nr" to a size_t, which will do the multiplication in 64-bits (assuming you're on a 64-bit platform; this will never work on a 32-bit system since we couldn't map the whole .idx anyway). Likewise, we don't have to worry about further additions, because adding a smaller number to a size_t will convert the other side to a size_t. A few notes: - obviously we could just declare "nr" as a size_t in the first place (and likewise, packed_git.num_objects). But it's conceptually a uint32_t because of the on-disk format, and we correctly treat it that way in other contexts that don't need to compute byte offsets (e.g., iterating over the set of objects should and generally does use a uint32_t). Switching to size_t would make all of those other cases look wrong. - it could be argued that the proper type is off_t to represent the file offset. But in practice the .idx file must fit within memory, because we mmap the whole thing. And the rest of the code (including the idx_size variable we're comparing against) uses size_t. - we'll add the same cast to the max_size arithmetic line. Even though we're adding to a larger type, which will convert our result, the multiplication is still done as a 32-bit value and can itself overflow. I didn't check this with my test case, since it would need an even larger pack (~530M objects), but looking at compiler output shows that it works this way. The standard should agree, but I couldn't find anything explicit in 6.3.1.8 ("usual arithmetic conversions"). The case in load_idx() was the most immediate one that I was able to trigger. After fixing it, looking up actual objects (including the very last one in sha1 order) works in a test repo with 153,725,110 objects. That's because bsearch_hash() works with uint32_t entry indices, and the actual byte access: int cmp = hashcmp(table + mi * stride, sha1); is done with "stride" as a size_t, causing the uint32_t "mi" to be promoted to a size_t. This is the way most code will access the index data. However, I audited all of the other byte-wise accesses of packed_git.index_data, and many of the others are suspect (they are similar to the max_size one, where we are adding to a properly sized offset or directly to a pointer, but the multiplication in the sub-expression can overflow). I didn't trigger any of these in practice, but I believe they're potential problems, and certainly adding in the cast is not going to hurt anything here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-16pack-revindex: open index if necessaryJeff King
We can't create a pack revindex if we haven't actually looked at the index. Normally we would never get as far as creating a revindex without having already been looking in the pack, so this code never bothered to double-check that pack->index_data had been loaded. But with the new multi-pack-index feature, many code paths might not load the individual pack .idx at all (they'd find objects via the midx and then open the .pack, but not its index). This can't yet be triggered in practice, because a bug in the midx code means we accidentally open up the individual .idx files anyway. But in preparation for fixing that, let's have the revindex code check that everything it needs has been loaded. In most cases this will just be a quick noop. But note that this does introduce a possibility of error (if we have to open the index and it's corrupt), so load_pack_revindex() now returns a result code, and callers need to handle the error. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-04various: tighten constness of some local variablesShahzad Lone
Signed-off-by: Shahzad Lone <shahzadlone@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-15pack-revindex: express constants in terms of the_hash_algobrian m. carlson
Express the various constants used in terms of the_hash_algo. While we're at it, fix a comment style issue as well. Reviewed-by: Stefan Beller <sbeller@google.com> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26object-store: move packed_git and packed_git_mru to object storeStefan Beller
In a process with multiple repositories open, packfile accessors should be associated to a single repository and not shared globally. Move packed_git and packed_git_mru into the_repository and adjust callers to reflect this. [nd: while at there, wrap access to these two fields in get_packed_git() and get_packed_git_mru(). This allows us to lazily initialize these fields without caller doing that explicitly] Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-01-19packfile: use get_be64() for large offsetsDerrick Stolee
The pack-index version 2 format uses two 4-byte integers in network-byte order to represent one 8-byte value. The current implementation has several code clones for stitching these integers together. Use get_be64() to create an 8-byte integer from two 4-byte integers represented this way. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-31use SWAP macroRené Scharfe
Apply the semantic patch swap.cocci to convert hand-rolled swaps to use the macro SWAP. The resulting code is shorter and easier to read, the object code is effectively unchanged. The patch for object.c had to be hand-edited in order to preserve the comment before the change; Coccinelle tried to eat it for some reason. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26use COPY_ARRAYRené Scharfe
Add a semantic patch for converting certain calls of memcpy(3) to COPY_ARRAY() and apply that transformation to the code base. The result is shorter and safer code. For now only consider calls where source and destination have the same type, or in other words: easy cases. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-27Merge branch 'jk/tighten-alloc'Junio C Hamano
Update various codepaths to avoid manually-counted malloc(). * jk/tighten-alloc: (22 commits) ewah: convert to REALLOC_ARRAY, etc convert ewah/bitmap code to use xmalloc diff_populate_gitlink: use a strbuf transport_anonymize_url: use xstrfmt git-compat-util: drop mempcpy compat code sequencer: simplify memory allocation of get_message test-path-utils: fix normalize_path_copy output buffer size fetch-pack: simplify add_sought_entry fast-import: simplify allocation in start_packfile write_untracked_extension: use FLEX_ALLOC helper prepare_{git,shell}_cmd: use argv_array use st_add and st_mult for allocation size computation convert trivial cases to FLEX_ARRAY macros use xmallocz to avoid size arithmetic convert trivial cases to ALLOC_ARRAY convert manual allocations to argv_array argv-array: add detach function add helpers for allocating flex-array structs harden REALLOC_ARRAY and xcalloc against size_t overflow tree-diff: catch integer overflow in combine_diff_path allocation ...
2016-02-23convert trivial cases to ALLOC_ARRAYJeff King
Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-12-22pack-revindex: store entries directly in packed_gitJeff King
A pack_revindex struct has two elements: the revindex entries themselves, and a pointer to the packed_git. We need both to do lookups, because only the latter knows things like the number of objects in the pack. Now that packed_git contains the pack_revindex struct it's just as easy to pass around the packed_git itself, and we do not need the extra back-pointer. We can instead just store the entries directly in the pack. All functions which took a pack_revindex now just take a packed_git. We still lazy-load in find_pack_revindex, so most callers are unaffected. The exception is the bitmap code, which computes the revindex and caches the pointer when we load the bitmaps. We can continue to load, drop the extra cache pointer, and just access bitmap_git.pack.revindex directly. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-12-22pack-revindex: drop hash tableJeff King
The main entry point to the pack-revindex code is find_pack_revindex(). This calls revindex_for_pack(), which lazily computes and caches the revindex for the pack. We store the cache in a very simple hash table. It's created by init_pack_revindex(), which inserts an entry for every packfile we know about, and we never grow or shrink the hash. If we ever need the revindex for a pack that isn't in the hash, we die() with an internal error. This can lead to a race, because we may load more packs after having called init_pack_revindex(). For example, imagine we have one process which needs to look at the revindex for a variety of objects (e.g., cat-file's "%(objectsize:disk)" format). Simultaneously, git-gc is running, which is doing a `git repack -ad`. We might hit a sequence like: 1. We need the revidx for some packed object. We call find_pack_revindex() and end up in init_pack_revindex() to create the hash table for all packs we know about. 2. We look up another object and can't find it, because the repack has removed the pack it's in. We re-scan the pack directory and find a new pack containing the object. It gets added to our packed_git list. 3. We call find_pack_revindex() for the new object, which hits revindex_for_pack() for our new pack. It can't find the packed_git in the revindex hash, and dies. You could also replace the `repack` above with a push or fetch to create a new pack, though these are less likely (you would have to somehow learn about the new objects to look them up). Prior to 1a6d8b9 (do not discard revindex when re-preparing packfiles, 2014-01-15), this was safe, as we threw away the revindex whenever we re-scanned the pack directory (and thus re-created the revindex hash on the fly). However, we don't want to simply revert that commit, as it was solving a different race. So we have a few options: - We can fix the race in 1a6d8b9 differently, by having the bitmap code look in the revindex hash instead of caching the pointer. But this would introduce a lot of extra hash lookups for common bitmap operations. - We could teach the revindex to dynamically add new packs to the hash table. This would perform the same, but would mean adding extra code to the revindex hash (which currently cannot be resized at all). - We can get rid of the hash table entirely. There is exactly one revindex per pack, so we can just store it in the packed_git struct. Since it's initialized lazily, it does not add to the startup cost. This is the best of both worlds: less code and fewer hash table lookups. The original code likely avoided this in the name of encapsulation. But the packed_git and reverse_index code are fairly intimate already, so it's not much of a loss. This patch implements the final option. It's a minimal conversion that retains the pack_revindex struct. No callers need to change, and we can do further cleanup in a follow-on patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-10-26Silence GCC's "cast of pointer to integer of a different size" warningJohannes Schindelin
When calculating hashes from pointers, it actually makes sense to cut off the most significant bits. In that case, said warning does not make a whole lot of sense. So let's just work around it by casting the pointer first to intptr_t and then casting up/down to the final integral type. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-05-28pack-revindex.c: rearrange xcalloc argumentsBrian Gesiak
xcalloc() takes two arguments: the number of elements and their size. init_pack_revindex() passes the arguments in reverse order, passing the size of a pack_revindex, followed by the number to allocate. Rearrange them so they are in the correct order. Signed-off-by: Brian Gesiak <modocache@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-01-17do not discard revindex when re-preparing packfilesJeff King
When an object lookup fails, we re-read the objects/pack directory to pick up any new packfiles that may have been created since our last read. We also discard any pack revindex structs we've allocated. The discarding is a problem for the pack-bitmap code, which keeps a pointer to the revindex for the bitmapped pack. After the discard, the pointer is invalid, and we may read free()d memory. Other revindex users do not keep a bare pointer to the revindex; instead, they always access it through revindex_for_pack(), which lazily builds the revindex. So one solution is to teach the pack-bitmap code a similar trick. It would be slightly less efficient, but probably not all that noticeable. However, it turns out this discarding is not actually necessary. When we call reprepare_packed_git, we do not throw away our old pack list. We keep the existing entries, and only add in new ones. So there is no safety problem; we will still have the pack struct that matches each revindex. The packfile itself may go away, of course, but we are already prepared to handle that, and it may happen outside of reprepare_packed_git anyway. Throwing away the revindex may save some RAM if the pack never gets reused (about 12 bytes per object). But it also wastes some CPU time (to regenerate the index) if the pack does get reused. It's hard to say which is more valuable, but in either case, it happens very rarely (only when we race with a simultaneous repack). Just leaving the revindex in place is simple and safe both for current and future code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-25revindex: export new APIsVicent Marti
Allow users to efficiently lookup consecutive entries that are expected to be found on the same revindex by exporting `find_revindex_position`: this function takes a pointer to revindex itself, instead of looking up the proper revindex for a given packfile on each call. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12pack-revindex: radix-sort the revindexJeff King
The pack revindex stores the offsets of the objects in the pack in sorted order, allowing us to easily find the on-disk size of each object. To compute it, we populate an array with the offsets from the sha1-sorted idx file, and then use qsort to order it by offsets. That does O(n log n) offset comparisons, and profiling shows that we spend most of our time in cmp_offset. However, since we are sorting on a simple off_t, we can use numeric sorts that perform better. A radix sort can run in O(k*n), where k is the number of "digits" in our number. For a 64-bit off_t, using 16-bit "digits" gives us k=4. On the linux.git repo, with about 3M objects to sort, this yields a 400% speedup. Here are the best-of-five numbers for running echo HEAD | git cat-file --batch-check="%(objectsize:disk) on a fully packed repository, which is dominated by time spent building the pack revindex: before after real 0m0.834s 0m0.204s user 0m0.788s 0m0.164s sys 0m0.040s 0m0.036s This matches our algorithmic expectations. log(3M) is ~21.5, so a traditional sort is ~21.5n. Our radix sort runs in k*n, where k is the number of radix digits. In the worst case, this is k=4 for a 64-bit off_t, but we can quit early when the largest value to be sorted is smaller. For any repository under 4G, k=2. Our algorithm makes two passes over the list per radix digit, so we end up with 4n. That should yield ~5.3x speedup. We see 4x here; the difference is probably due to the extra bucket book-keeping the radix sort has to do. On a smaller repo, the difference is less impressive, as log(n) is smaller. For git.git, with 173K objects (but still k=2), we see a 2.7x improvement: before after real 0m0.046s 0m0.017s user 0m0.036s 0m0.012s sys 0m0.008s 0m0.000s On even tinier repos (e.g., a few hundred objects), the speedup goes away entirely, as the small advantage of the radix sort gets erased by the book-keeping costs (and at those sizes, the cost to generate the the rev-index gets lost in the noise anyway). Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Brandon Casey <drafnel@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>