Welcome to mirror list, hosted at ThFree Co, Russian Federation.

git.kernel.org/pub/scm/git/git.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/midx.h
AgeCommit message (Collapse)Author
2024-01-13Merge branch 'tb/multi-pack-verbatim-reuse'Junio C Hamano
Streaming spans of packfile data used to be done only from a single, primary, pack in a repository with multiple packfiles. It has been extended to allow reuse from other packfiles, too. * tb/multi-pack-verbatim-reuse: (26 commits) t/perf: add performance tests for multi-pack reuse pack-bitmap: enable reuse from all bitmapped packs pack-objects: allow setting `pack.allowPackReuse` to "single" t/test-lib-functions.sh: implement `test_trace2_data` helper pack-objects: add tracing for various packfile metrics pack-bitmap: prepare to mark objects from multiple packs for reuse pack-revindex: implement `midx_pair_to_pack_pos()` pack-revindex: factor out `midx_key_to_pack_pos()` helper midx: implement `midx_preferred_pack()` git-compat-util.h: implement checked size_t to uint32_t conversion pack-objects: include number of packs reused in output pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse pack-objects: prepare `write_reused_pack()` for multi-pack reuse pack-objects: pass `bitmapped_pack`'s to pack-reuse functions pack-objects: keep track of `pack_start` for each reuse pack pack-objects: parameterize pack-reuse routines over a single pack pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature ewah: implement `bitmap_is_empty()` pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions ...
2023-12-26treewide: remove unnecessary includes from header filesElijah Newren
There are three kinds of unnecessary includes: * includes which aren't directly needed, but which include some other forgotten include * includes which could be replaced by a simple forward declaration of some structs * includes which aren't needed at all Remove the third kind of include. Subsequent commits (and a subsequent series) will work on removing some of the other kinds of includes. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-15midx: implement `midx_preferred_pack()`Taylor Blau
When performing a binary search over the objects in a MIDX's bitmap (i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack ordering using a combination of (a) the preferred pack, (b) the pack's lexical position in the MIDX based on pack names, and (c) the object offset within the pack. In order to perform this binary search, the reader must know the identity of the preferred pack. This could be stored in the MIDX, but isn't for historical reasons, mostly because it can easily be inferred at read-time by looking at the object in the first bit position and finding out which pack it was selected from in the MIDX, like so: nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0)); In midx_to_pack_pos() which performs this binary search, we look up the identity of the preferred pack before each search. This is relatively quick, since it involves two table-driven lookups (one in the MIDX's revindex for `pack_pos_to_midx()`, and another in the MIDX's object table for `nth_midxed_pack_int_id()`). But since the preferred pack does not change after the MIDX is written, it is safe to cache this value on the MIDX itself. Write a helper to do just that, and rewrite all of the existing call-sites that care about the identity of the preferred pack in terms of this new helper. This will prepare us for a subsequent patch where we will need to binary search through the MIDX's pseudo-pack order multiple times. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-15midx: implement `midx_locate_pack()`Taylor Blau
The multi-pack index API exposes a `midx_contains_pack()` function that takes in a string ending in either ".idx" or ".pack" and returns whether or not the MIDX contains a given pack corresponding to that string. There is no corresponding function to locate the position of a pack within the MIDX's pack order (sorted lexically by pack filename). We could add an optional out parameter to `midx_contains_pack()` that is filled out with the pack's position when the parameter is non-NULL. To minimize the amount of fallout from this change, instead introduce a new function by renaming `midx_contains_pack()` to `midx_locate_pack()`, adding that output parameter, and then reimplementing `midx_contains_pack()` in terms of it. Future patches will make use of this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-15midx: implement `BTMP` chunkTaylor Blau
When a multi-pack bitmap is used to implement verbatim pack reuse (that is, when verbatim chunks from an on-disk packfile are copied directly[^1]), it does so by using its "preferred pack" as the source for pack-reuse. This allows repositories to pack the majority of their objects into a single (often large) pack, and then use it as the single source for verbatim pack reuse. This increases the amount of objects that are reused verbatim (and consequently, decrease the amount of time it takes to generate many packs). But this performance comes at a cost, which is that the preferred packfile must pace its growth with that of the entire repository in order to maintain the utility of verbatim pack reuse. As repositories grow beyond what we can reasonably store in a single packfile, the utility of verbatim pack reuse diminishes. Or, at the very least, it becomes increasingly more expensive to maintain as the pack grows larger and larger. It would be beneficial to be able to perform this same optimization over multiple packs, provided some modest constraints (most importantly, that the set of packs eligible for verbatim reuse are disjoint with respect to the subset of their objects being sent). If we assume that the packs which we treat as candidates for verbatim reuse are disjoint with respect to any of their objects we may output, we need to make only modest modifications to the verbatim pack-reuse code itself. Most notably, we need to remove the assumption that the bits in the reachability bitmap corresponding to objects from the single reuse pack begin at the first bit position. Future patches will unwind these assumptions and reimplement their existing functionality as special cases of the more general assumptions (e.g. that reuse bits can start anywhere within the bitset, but happen to start at 0 for all existing cases). This patch does not yet relax any of those assumptions. Instead, it implements a foundational data-structure, the "Bitampped Packs" (`BTMP`) chunk of the multi-pack index. The `BTMP` chunk's contents are described in detail here. Importantly, the `BTMP` chunk contains information to map regions of a multi-pack index's reachability bitmap to the packs whose objects they represent. For now, this chunk is only written, not read (outside of the test-tool used in this patch to test the new chunk's behavior). Future patches will begin to make use of this new chunk. [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the pack that wasn't used verbatim. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10midx: check size of revindex chunkJeff King
When we load a revindex from disk, we check the size of the file compared to the number of objects we expect it to have. But when we use a RIDX chunk stored directly in the midx, we just access the memory directly. This can lead to out-of-bounds memory access for a corrupted or malicious multi-pack-index file. We can catch this by recording the RIDX chunk size, and then checking it against the expected size when we "load" the revindex. Note that this check is much simpler than the one that load_revindex_from_disk() does, because we just have the data array with no header (so we do not need to account for the header size, and nor do we need to bother validating the header values). The test confirms both that we catch this case, and that we continue the process (the revindex is required to use the midx bitmaps, but we fallback to a non-bitmap traversal). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10midx: bounds-check large offset chunkJeff King
When we see a large offset bit in the regular midx offset table, we use the entry as an index into a separate large offset table (just like a pack idx does). But we don't bounds-check the access to that large offset table (nor even record its size when we parse the chunk!). The equivalent code for a regular pack idx is in check_pack_index_ptr(). But things are a bit simpler here because of the chunked format: we can just check our array index directly. As a bonus, we can get rid of the st_mult() here. If our array bounds-check is successful, then we know that the result will fit in a size_t (and the bounds check uses a division to avoid overflow entirely). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10midx: check size of pack names chunkJeff King
We parse the pack-name chunk as a series of NUL-terminated strings. But since we don't look at the chunk size, there's nothing to guarantee that we don't parse off the end of the chunk (or even off the end of the mapped file). We can record the length, and then as we parse make sure that we never walk past it. The new test exercises the case, though note that it does not actually segfault before this patch. It hits a NUL byte somewhere in one of the other chunks, and comes up with a garbage pack name. You could construct one that reads out-of-bounds (e.g., a PNAM chunk at the end of file), but this case is simple and sufficient to check that we detect the problem. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-26pack-bitmap-write: learn pack.writeBitmapLookupTable and add testsAbhradeep Chakraborty
Teach Git to provide a way for users to enable/disable bitmap lookup table extension by providing a config option named 'writeBitmapLookupTable'. Default is false. Also add test to verify writting of lookup table. Mentored-by: Taylor Blau <me@ttaylorr.com> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> Co-Authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27midx: read `RIDX` chunk when presentTaylor Blau
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index is read from it instead of the on-disk .rev file. Since we need to encode the object order in the MIDX itself for correctness reasons, there is no point in storing the same data again outside of the MIDX. So, this patch stops writing separate .rev files, and reads it out of the MIDX itself. This is possible to do with relatively little new code, since the format of the RIDX chunk is identical to the data in the .rev file. In other words, we can implement this by pointing the `revindex_data` field at the reverse index chunk of the MIDX instead of the .rev file without any other changes. Note that we have two knobs that are adjusted for the new tests: GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls whether the MIDX .rev is written at all, and the latter controls whether we read the MIDX's RIDX chunk. Both are necessary to ensure that the test added at the beginning of this series continues to work. This is because we always need to write the RIDX chunk in the MIDX in order to change its checksum, but we want to make sure reading the existing .rev file still works (since the RIDX chunk takes precedence by default). Arguably this isn't a very interesting mode to test, because the precedence rules mean that we'll always read the RIDX chunk over the .rev file. But it makes it impossible for a user to induce corruption in their repository by adjusting the test knobs (since if we had an either/or knob they could stop writing the RIDX chunk, allowing them to tweak the MIDX's object order without changing its checksum). Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-29midx.c: write MIDX filenames to strbufTaylor Blau
To ask for the name of a MIDX and its corresponding .rev file, callers invoke get_midx_filename() and get_midx_rev_filename(), respectively. These both invoke xstrfmt(), allocating a chunk of memory which must be freed later on. This makes callers in pack-bitmap.c somewhat awkward. Specifically, midx_bitmap_filename(), which is implemented like: return xstrfmt("%s-%s.bitmap", get_midx_filename(midx->object_dir), hash_to_hex(get_midx_checksum(midx))); this leaks the second argument to xstrfmt(), which itself was allocated with xstrfmt(). This caller could assign both the result of get_midx_filename() and the outer xstrfmt() to a temporary variable, remembering to free() the former before returning. But that involves a wasteful copy. Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf as an output parameter. This way midx_bitmap_filename() can manipulate and pass around a temporary buffer which it detaches back to its caller. That allows us to implement the function without copying or open-coding get_midx_filename() in a way that doesn't leak. Update the other callers of get_midx_filename() and get_midx_rev_filename() accordingly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-19Merge branch 'tb/repack-write-midx'Junio C Hamano
"git repack" has been taught to generate multi-pack reachability bitmaps. * tb/repack-write-midx: test-read-midx: fix leak of bitmap_index struct builtin/repack.c: pass `--refs-snapshot` when writing bitmaps builtin/repack.c: make largest pack preferred builtin/repack.c: support writing a MIDX while repacking builtin/repack.c: extract showing progress to a variable builtin/repack.c: rename variables that deal with non-kept packs builtin/repack.c: keep track of existing packs unconditionally midx: preliminary support for `--refs-snapshot` builtin/multi-pack-index.c: support `--stdin-packs` mode midx: expose `write_midx_file_only()` publicly
2021-09-29midx: preliminary support for `--refs-snapshot`Taylor Blau
To figure out which commits we can write a bitmap for, the multi-pack index/bitmap code does a reachability traversal, marking any commit which can be found in the MIDX as eligible to receive a bitmap. This approach will cause a problem when multi-pack bitmaps are able to be generated from `git repack`, since the reference tips can change during the repack. Even though we ignore commits that don't exist in the MIDX (when doing a scan of the ref tips), it's possible that a commit in the MIDX reaches something that isn't. This can happen when a multi-pack index contains some pack which refers to loose objects (e.g., if a pack was pushed after starting the repack but before generating the MIDX which depends on an object which is stored as loose in the repository, and by definition isn't included in the multi-pack index). By taking a snapshot of the references before we start repacking, we can close that race window. In the above scenario (where we have a packed object pointing at a loose one), we'll either (a) take a snapshot of the references before seeing the packed one, or (b) take it after, at which point we can guarantee that the loose object will be packed and included in the MIDX. This patch does just that. It writes a temporary "reference snapshot", which is a list of OIDs that are at the ref tips before writing a multi-pack bitmap. References that are "preferred" (i.e,. are a suffix of at least one value of the 'pack.preferBitmapTips' configuration) are marked with a special '+'. The format is simple: one line per commit at each tip, with an optional '+' at the beginning (for preferred references, as described above). When provided, the reference snapshot is used to drive bitmap selection instead of the MIDX code doing its own traversal. When it isn't provided, the usual traversal takes place instead. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29midx: expose `write_midx_file_only()` publiclyTaylor Blau
Expose a variant of the write_midx_file() function which ignores packs that aren't included in an explicit "allow" list. This will be used in an upcoming patch to power a new `--stdin-packs` mode of `git multi-pack-index write` for callers that only want to include certain packs in a MIDX (and ignore any packs which may have happened to enter the repository independently, e.g., from pushes). Those patches will provide test coverage for this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-15midx.c: respect 'pack.writeBitmapHashcache' when writing bitmapsTaylor Blau
In the previous commit, the bitmap writing code learned to propagate values from an existing hash-cache extension into the bitmap that it is writing. Now that that functionality exists, let's expose it by teaching the 'git multi-pack-index' builtin to respect the `pack.writeBitmapHashCache` option so that the hash-cache may be written at all. Two minor points worth noting here: - The 'git multi-pack-index write' sub-command didn't previously read any configuration (instead this is handled in the base command). A separate handler is added here to respect this write-specific config option. - I briefly considered adding a 'bitmap_flags' field to the static options struct, but decided against it since it would require plumbing through a new parameter to the write_midx_file() function. Instead, a new MIDX-specific flag is added, which is translated to the corresponding bitmap one. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'Taylor Blau
Introduce a new 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' environment variable to also write a multi-pack bitmap when 'GIT_TEST_MULTI_PACK_INDEX' is set. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: write multi-pack bitmapsTaylor Blau
Write multi-pack bitmaps in the format described by Documentation/technical/bitmap-format.txt, inferring their presence with the absence of '--bitmap'. To write a multi-pack bitmap, this patch attempts to reuse as much of the existing machinery from pack-objects as possible. Specifically, the MIDX code prepares a packing_data struct that pretends as if a single packfile has been generated containing all of the objects contained within the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: read multi-pack bitmapsTaylor Blau
This prepares the code in pack-bitmap to interpret the new multi-pack bitmaps described in Documentation/technical/bitmap-format.txt, which mostly involves converting bit positions to accommodate looking them up in a MIDX. Note that there are currently no writers who write multi-pack bitmaps, and that this will be implemented in the subsequent commit. Note also that get_midx_checksum() and get_midx_filename() are made non-static so they can be called from pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01pack-revindex: write multi-pack reverse indexesTaylor Blau
Implement the writing half of multi-pack reverse indexes. This is nothing more than the format describe a few patches ago, with a new set of helper functions that will be used to clear out stale .rev files corresponding to old MIDXs. Unfortunately, a very similar comparison function as the one implemented recently in pack-revindex.c is reimplemented here, this time accepting a MIDX-internal type. An effort to DRY these up would create more indirection and overhead than is necessary, so it isn't pursued here. Currently, there are no callers which pass the MIDX_WRITE_REV_INDEX flag, meaning that this is all dead code. But, that won't be the case for long, since subsequent patches will introduce the multi-pack bitmap, which will begin passing this field. (In midx.c:write_midx_internal(), the two adjacent if statements share a conditional, but are written separately since the first one will eventually also handle the MIDX_WRITE_BITMAP flag, which does not yet exist.) Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01pack-revindex: read multi-pack reverse indexesTaylor Blau
Implement reading for multi-pack reverse indexes, as described in the previous patch. Note that these functions don't yet have any callers, and won't until multi-pack reachability bitmaps are introduced in a later patch series. In the meantime, this patch implements some of the infrastructure necessary to support multi-pack bitmaps. There are three new functions exposed by the revindex API: - load_midx_revindex(): loads the reverse index corresponding to the given multi-pack index. - midx_to_pack_pos() and pack_pos_to_midx(): these convert between the multi-pack index and pseudo-pack order. load_midx_revindex() and pack_pos_to_midx() are both relatively straightforward. load_midx_revindex() needs a few functions to be exposed from the midx API. One to get the checksum of a midx, and another to get the .rev's filename. Similar to recent changes in the packed_git struct, three new fields are added to the multi_pack_index struct: one to keep track of the size, one to keep track of the mmap'd pointer, and another to point past the header and at the reverse index's data. pack_pos_to_midx() simply reads the corresponding entry out of the table. midx_to_pack_pos() is the trickiest, since it needs to find an object's position in the psuedo-pack order, but that order can only be recovered in the .rev file itself. This mapping can be implemented with a binary search, but note that the thing we're binary searching over isn't an array of values, but rather a permuted order of those values. So, when comparing two items, it's helpful to keep in mind the difference. Instead of a traditional binary search, where you are comparing two things directly, here we're comparing a (pack, offset) tuple with an index into the multi-pack index. That index describes another (pack, offset) tuple, and it is _those_ two tuples that are compared. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: make some functions non-staticTaylor Blau
In a subsequent commit, pack-revindex.c will become responsible for sorting a list of objects in the "MIDX pack order" (which will be defined in the following patch). To do so, it will need to be know the pack identifier and offset within that pack for each object in the MIDX. The MIDX code already has functions for doing just that (nth_midxed_offset() and nth_midxed_pack_int_id()), but they are statically declared. Since there is no reason that they couldn't be exposed publicly, and because they are already doing exactly what the caller in pack-revindex.c will want, expose them publicly so that they can be reused there. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: allow marking a pack as preferredTaylor Blau
When multiple packs in the multi-pack index contain the same object, the MIDX machinery must make a choice about which pack it associates with that object. Prior to this patch, the lowest-ordered[1] pack was always selected. Pack selection for duplicate objects is relatively unimportant today, but it will become important for multi-pack bitmaps. This is because we can only invoke the pack-reuse mechanism when all of the bits for reused objects come from the reuse pack (in order to ensure that all reused deltas can find their base objects in the same pack). To encourage the pack selection process to prefer one pack over another (the pack to be preferred is the one a caller would like to later use as a reuse pack), introduce the concept of a "preferred pack". When provided, the MIDX code will always prefer an object found in a preferred pack over any other. No format changes are required to store the preferred pack, since it will be able to be inferred with a corresponding MIDX bitmap, by looking up the pack associated with the object in the first bit position (this ordering is described in detail in a subsequent commit). [1]: the ordering is specified by MIDX internals; for our purposes we can consider the "lowest ordered" pack to be "the one with the most-recent mtime. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-01Merge branch 'ds/multi-pack-index'Junio C Hamano
The multi-pack-index left mmapped file descriptors open when it does not have to. * ds/multi-pack-index: multi-pack-index: close file descriptor after mmap
2020-04-24multi-pack-index: close file descriptor after mmapDerrick Stolee
The multi-pack-index subsystem was not closing its file descriptor after memory-mapping the file contents. After this mmap() succeeds, there is no need to keep the file descriptor open. In fact, there is signficant reason to close it so we do not run out of descriptors. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23midx: add MIDX_PROGRESS flagWilliam Baker
Add the MIDX_PROGRESS flag and update the write|verify|expire|repack functions in midx.h to accept a flags parameter. The MIDX_PROGRESS flag indicates whether the caller of the function would like progress information to be displayed. This patch only changes the method prototypes and does not change the functionality. The functionality change will be handled by a later patch. Signed-off-by: William Baker <William.Baker@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11multi-pack-index: prepare 'repack' subcommandDerrick Stolee
In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11multi-pack-index: prepare for 'expire' subcommandDerrick Stolee
The multi-pack-index tracks objects in a collection of pack-files. Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers. It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file. Introduce a new 'expire' subcommand to the multi-pack-index builtin. This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files. More details about the specifics will follow as the method is implemented. Add a test that verifies the 'expire' subcommand is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and should not be removed during an 'expire' operation. The packs are created carefully to ensure they have a specific order when sorted by size. This will be important in a later test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-07midx: pass a repository pointerDerrick Stolee
Much of the multi-pack-index code focuses on the multi_pack_index struct, and so we only pass a pointer to the current one. However, we will insert a dependency on the packed_git linked list in a future change, so we will need a repository reference. Inserting these parameters is a significant enough change to split out. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-16midx: check both pack and index names for containmentJeff King
A midx file (and the struct we parse from it) contains a list of all of the covered packfiles, mentioned by their ".idx" names (e.g., "pack-1234.idx", etc). And thus calls to midx_contains_pack() expect callers to provide the idx name. This works for most of the calls, but the one in open_packed_git_1() tries to feed a packed_git->pack_name, which is the ".pack" name, meaning we'll never find a match (even if the pack is covered by the midx). We can fix this by converting the ".pack" to ".idx" in the caller. However, that requires allocating a new string. Instead, let's make midx_contains_pack() a bit friendlier, and allow it take _either_ the .pack or .idx variant. All cleverness in the matching code is credited to René. Bugs are mine. There's no test here, because while this does fix _a_ bug, it's masked by another bug in that same caller. That will be covered (with a test) in the next patch. Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13Merge branch 'ds/test-multi-pack-index'Junio C Hamano
Tests for the recently introduced multi-pack index machinery. * ds/test-multi-pack-index: packfile: close multi-pack-index in close_all_packs multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX midx: close multi-pack-index on repack midx: fix broken free() in close_midx()
2018-10-22multi-pack-index: define GIT_TEST_MULTI_PACK_INDEXDerrick Stolee
The multi-pack-index feature is tested in isolation by t5319-multi-pack-index.sh, but there are many more interesting scenarios in the test suite surrounding pack-file data shapes and interactions. Since the multi-pack-index is an optional data structure, it does not make sense to include it by default in those tests. Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable that enables core.multiPackIndex and writes a multi-pack-index after each 'git repack' command. This adds extra test coverage when needed. There are a few spots in the test suite that need to react to this change: * t5319-multi-pack-index.sh: there is a test that checks that 'git repack' deletes the multi-pack-index. Disable the environment variable to ensure this still happens. * t5310-pack-bitmaps.sh: One test moves a pack-file from the object directory to an alternate. This breaks the multi-pack-index, so delete the multi-pack-index at this point, if it exists. * t9300-fast-import.sh: One test verifies the number of files in the .git/objects/pack directory is exactly 8. Exclude the multi-pack-index from this count so it is still 8 in all cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-22midx: close multi-pack-index on repackDerrick Stolee
When repacking, we may remove pack-files. This invalidates the multi-pack-index (if it exists). Previously, we removed the multi-pack-index file before removing any pack-file. In some cases, the repack command may load the multi-pack-index into memory. This may lead to later in-memory references to the non-existent pack- files. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-18headers: normalize the spelling of some header guardsRamsay Jones
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-16Merge branch 'rj/header-check'Junio C Hamano
Header files clean-up. * rj/header-check: delta-islands.h: add missing forward declarations (hdr-check) midx.h: add missing forward declarations (hdr-check) refs/refs-internal.h: add missing declarations (hdr-check) refs/packed-backend.h: add missing declaration (hdr-check) refs/ref-cache.h: add missing declarations (hdr-check) ewah/ewok_rlw.h: add missing include (hdr-check) json-writer.h: add missing include (hdr-check) Makefile: add a hdr-check target
2018-09-20midx.h: add missing forward declarations (hdr-check)Ramsay Jones
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17multi-pack-index: add 'verify' verbDerrick Stolee
The multi-pack-index builtin writes multi-pack-index files, and uses a 'write' verb to do so. Add a 'verify' verb that checks this file matches the contents of the pack-indexes it replaces. The current implementation is a no-op, but will be extended in small increments in later commits. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21packfile: add all_packs listDerrick Stolee
If a repo contains a multi-pack-index, then the packed_git list does not contain the packfiles that are covered by the multi-pack-index. This is important for doing object lookups, abbreviations, and approximating object count. However, there are many operations that really want to iterate over all packfiles. Create a new 'all_packs' linked list that contains this list, starting with the packfiles in the multi-pack-index and then continuing along the packed_git linked list. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-21multi-pack-index: store local propertyDerrick Stolee
A pack-file is 'local' if it is stored within the usual object directory. If it is stored in an alternate, it is non-local. Pack-files are stored using a 'pack_local' member in the packed_git struct. Add a similar 'local' member to the multi_pack_index struct and 'local' parameters to the methods that load and prepare multi- pack-indexes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: clear midx on repackDerrick Stolee
If a 'git repack' command replaces existing packfiles, then we must clear the existing multi-pack-index before moving the packfiles it references. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: use existing midx when writing new oneDerrick Stolee
Due to how Windows handles replacing a lockfile when there is an open handle, create the close_midx() method to close the existing midx before writing the new one. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: use midx in abbreviation calculationsDerrick Stolee
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: read objects from multi-pack-indexDerrick Stolee
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20config: create core.multiPackIndex settingDerrick Stolee
The core.multiPackIndex config setting controls the multi-pack- index (MIDX) feature. If false, the setting will disable all reads from the multi-pack-index file. Read this config setting in the new prepare_multi_pack_index_one() which is called during prepare_packed_git(). This check is run once per repository. Add comparison commands in t5319-multi-pack-index.sh to check typical Git behavior remains the same as the config setting is turned on and off. This currently includes 'git rev-list' and 'git log' commands to trigger several object database reads. Currently, these would only catch an error in the prepare_multi_pack_index_one(), but with later commits will catch errors in object lookups, abbreviations, and approximate object counts. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: write object offsetsDerrick Stolee
The final pair of chunks for the multi-pack-index file stores the object offsets. We default to using 32-bit offsets as in the pack-index version 1 format, but if there exists an offset larger than 32-bits, we use a trick similar to the pack-index version 2 format by storing all offsets at least 2^31 in a 64-bit table; we use the 32-bit table to point into that 64-bit table as necessary. We only store these 64-bit offsets if necessary, so create a test that manipulates a version 2 pack-index to fake a large offset. This allows us to test that the large offset table is created, but the data does not match the actual packfile offsets. The multi-pack-index offset does match the (corrupted) pack-index offset, so a future feature will compare these offsets during a 'verify' step. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: write object id fanout chunkDerrick Stolee
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: write object ids in a chunkDerrick Stolee
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20midx: read pack names into arrayDerrick Stolee
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20multi-pack-index: write pack names in chunkDerrick Stolee
The multi-pack-index needs to track which packfiles it indexes. Store these in our first required chunk. Since filenames are not well structured, add padding to keep good alignment in later chunks. Modify the 'git multi-pack-index read' subcommand to output the existence of the pack-file name chunk. Modify t5319-multi-pack-index.sh to reflect this new output and the new expected number of chunks. Defense in depth: A pattern we are using in the multi-pack-index feature is to verify the data as we write it. We want to ensure we never write invalid data to the multi-pack-index. There are many checks that verify that the values we are writing fit the format definitions. This mainly helps developers while working on the feature, but it can also identify issues that only appear when dealing with very large data sets. These large sets are hard to encode into test cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20multi-pack-index: load into memoryDerrick Stolee
Create a new multi_pack_index struct for loading multi-pack-indexes into memory. Create a test-tool builtin for reading basic information about that multi-pack-index to verify the correct data is written. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20multi-pack-index: add 'write' verbDerrick Stolee
In anticipation of writing multi-pack-indexes, add a skeleton 'git multi-pack-index write' subcommand and send the options to a write_midx_file() method. Also create a skeleton test script that tests the 'write' subcommand. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>