Welcome to mirror list, hosted at ThFree Co, Russian Federation.

git.kernel.org/pub/scm/git/git.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-09-21midx.c: avoid cruft packs with non-zero `repack --batch-size`Taylor Blau
Apply similar treatment with respect to cruft packs as in a few commits ago to `repack` with a non-zero `--batch-size`. Since the case of a non-zero `--batch-size` is handled separately (in `fill_included_packs_batch()` instead of `fill_included_packs_all()`), a separate fix must be applied for this case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: avoid cruft packs with `repack --batch-size=0`Taylor Blau
The `repack` sub-command of the `git multi-pack-index` builtin creates a new pack aggregating smaller packs contained in the MIDX up to some given `--batch-size`. When `--batch-size=0`, this instructs the MIDX builtin to repack everything contained in the MIDX into a single pack. In similar spirit as a previous commit, it is undesirable to repack the contents of a cruft pack in this step. Teach `repack` to ignore any cruft pack(s) when `--batch-size=0` for the same reason(s). (The case of a non-zero `--batch-size` will be handled in a subsequent commit). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21midx.c: prevent `expire` from removing the cruft packTaylor Blau
The `expire` sub-command unlinks any packs that are (a) contained in the MIDX, but (b) have no objects referenced by the MIDX. This sub-command ignores `.keep` packs, which remain on-disk even if they have no objects referenced by the MIDX. Cruft packs, however, aren't given the same treatment: if none of the objects contained in the cruft pack are selected from the cruft pack by the MIDX, then the cruft pack is eligible to be expired. This is less than desireable, since the cruft pack has important metadata about the individual object mtimes, which is useful to determine how quickly an object should age out of the repository when pruning. Ordinarily, we wouldn't expect the contents of a cruft pack to duplicated across non-cruft packs (and we'd expect to see the MIDX select all cruft objects from other sources even less often). But nonetheless, it is still possible to trick the `expire` sub-command into removing the `.mtimes` file in this circumstance. Teach the `expire` sub-command to ignore cruft packs in the same manner as it does `.keep` packs, in order to keep their metadata around, even when they are unreferenced by the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-04Merge branch 'es/test-chain-lint'Junio C Hamano
Broken &&-chains in the test scripts have been corrected. * es/test-chain-lint: t6000-t9999: detect and signal failure within loop t5000-t5999: detect and signal failure within loop t4000-t4999: detect and signal failure within loop t0000-t3999: detect and signal failure within loop tests: simplify by dropping unnecessary `for` loops tests: apply modern idiom for exiting loop upon failure tests: apply modern idiom for signaling test failure tests: fix broken &&-chains in `{...}` groups tests: fix broken &&-chains in `$(...)` command substitutions tests: fix broken &&-chains in compound statements tests: use test_write_lines() to generate line-oriented output tests: simplify construction of large blocks of text t9107: use shell parameter expansion to avoid breaking &&-chain t6300: make `%(raw:size) --shell` test more robust t5516: drop unnecessary subshell and command invocation t4202: clarify intent by creating expected content less cleverly t1020: avoid aborting entire test script when one test fails t1010: fix unnoticed failure on Windows t/lib-pager: use sane_unset() to avoid breaking &&-chain
2021-12-13t5000-t5999: detect and signal failure within loopEric Sunshine
Failures within `for` and `while` loops can go unnoticed if not detected and signaled manually since the loop itself does not abort when a contained command fails, nor will a failure necessarily be detected when the loop finishes since the loop returns the exit code of the last command it ran on the final iteration, which may not be the command which failed. Therefore, detect and signal failures manually within loops using the idiom `|| return 1` (or `|| exit 1` within subshells). Signed-off-by: Eric Sunshine <sunshine@sunshineco.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-11Merge branch 'jk/t5319-midx-corruption-test-deflake'Junio C Hamano
Test fix. * jk/t5319-midx-corruption-test-deflake: t5319: corrupt more bytes of the midx checksum
2021-11-19t5319: corrupt more bytes of the midx checksumJeff King
One of the tests in t5319 corrupts the checksum of the midx file by writing a single 0xff over the final byte, and then confirms that we detect the problem. This usually works fine, but would break if the actual checksum ended with that same byte already. It seems like this should happen in 1 out of 256 test runs, but it turns out to be less often in practice. The contents of the midx are mostly deterministic because it's based on the objects, and we remove most sources of randomness by setting GIT_COMMITTER_DATE, etc. However, there's still some randomness: some objects are duplicated between packs, and the midx must decide which to use, which can be based on timing. So very occasionally we can end up with a real 0xff byte, and the test fails. The most robust fix would be to read out the final byte and then change it to something else (e.g., adding 1 mod 256). But that's awkward to do in shell. Let's just blindly corrupt 10 bytes instead of 1, which reduces our chances of an accidental noop to 1 in 2^80. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-01Merge branch 'gc/use-repo-settings'Junio C Hamano
It is wrong to read some settings directly from the config subsystem, as things like feature.experimental can affect their default values. * gc/use-repo-settings: gc: perform incremental repack when implictly enabled fsck: verify multi-pack-index when implictly enabled fsck: verify commit graph when implicitly enabled
2021-10-19Merge branch 'tb/repack-write-midx'Junio C Hamano
"git repack" has been taught to generate multi-pack reachability bitmaps. * tb/repack-write-midx: test-read-midx: fix leak of bitmap_index struct builtin/repack.c: pass `--refs-snapshot` when writing bitmaps builtin/repack.c: make largest pack preferred builtin/repack.c: support writing a MIDX while repacking builtin/repack.c: extract showing progress to a variable builtin/repack.c: rename variables that deal with non-kept packs builtin/repack.c: keep track of existing packs unconditionally midx: preliminary support for `--refs-snapshot` builtin/multi-pack-index.c: support `--stdin-packs` mode midx: expose `write_midx_file_only()` publicly
2021-10-16fsck: verify multi-pack-index when implictly enabledGlen Choo
Like the previous commit, change fsck to check the "core_multi_pack_index" variable set in "repo-settings.c" instead of reading the "core.multiPackIndex" config variable. This fixes a bug where we wouldn't verify midx if the config key was missing. This bug was introduced in 18e449f86b (midx: enable core.multiPackIndex by default, 2020-09-25) where core.multiPackIndex was turned on by default. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Glen Choo <chooglen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-06Merge branch 'tb/commit-graph-usage-fix'Junio C Hamano
Regression in "git commit-graph" command line parsing has been corrected. * tb/commit-graph-usage-fix: builtin/multi-pack-index.c: disable top-level --[no-]progress builtin/commit-graph.c: don't accept common --[no-]progress
2021-09-29builtin/multi-pack-index.c: support `--stdin-packs` modeTaylor Blau
To power a new `--write-midx` mode, `git repack` will want to write a multi-pack index containing a certain set of packs in the repository. This new option will be used by `git repack` to write a MIDX which contains only the packs which will survive after the repack (that is, it will exclude any packs which are about to be deleted). This patch effectively exposes the function implemented in the previous commit via the `git multi-pack-index` builtin. An alternative approach would have been to call that function from the `git repack` builtin directly, but this introduces awkward problems around closing and reopening the object store, so the MIDX will be written out-of-process. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22builtin/multi-pack-index.c: disable top-level --[no-]progressTaylor Blau
In a similar spirit as the previous patch, let sub-commands which support showing or hiding a progress meter handle parsing the `--progress` or `--no-progress` option, but do not expose it as an option to the top-level `multi-pack-index` builtin. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01t5319: don't write MIDX bitmaps in t5319Taylor Blau
This test is specifically about generating a midx still respecting a pack-based bitmap file. Generating a MIDX bitmap would confuse the test. Let's override the 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' variable to make sure we don't do so. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: avoid opening multiple MIDXs when writingTaylor Blau
Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `<alt>` when referring to the value passed to this option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: reject empty `--preferred-pack`'sTaylor Blau
The soon-to-be-implemented multi-pack bitmap treats object in the first bit position specially by assuming that all objects in the pack it was selected from are also represented from that pack in the MIDX. In other words, the pack from which the first object was selected must also have all of its other objects selected from that same pack in the MIDX in case of any duplicates. But this assumption relies on the fact that there is at least one object in that pack to begin with; otherwise the object in the first bit position isn't from a preferred pack, in which case we can no longer assume that all objects in that pack were also selected from the same pack. Guard this assumption by checking the number of objects in the given preferred pack, and failing if the given pack is empty. To make sure we can safely perform this check, open any packs which are contained in an existing MIDX via prepare_midx_pack(). The same is done for new packs via the add_pack_to_midx() callback, but packs picked up from a previous MIDX will not yet have these opened. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: fix `*.rev` cleanups with `--object-dir`Taylor Blau
If using --object-dir to point into an object directory which belongs to a different repository than the one in the current working directory, such as: git init repo git -C repo ... # add some objects cd alternate git multi-pack-index --object-dir ../repo/.git/objects write the binary will segfault trying to access the object-dir via the repo it found, but that's not fully initialized. Worse, if we later call clear_midx_files_ext(), we will use `the_repository` and remove files out of the wrong object directory. Fix this by using the given object_dir (or the object directory of `the_repository` if `--object-dir` wasn't given) to properly to clean up the *.rev files, avoiding the crash. Original-patch-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: disallow running outside of a repositoryTaylor Blau
The multi-pack-index command supports working with arbitrary object directories via the `--object-dir` flag. Though this has historically worked in arbitrary repositories (including when the command itself was run outside of a Git repository), this has been somewhat of an accident. For example, running: git multi-pack-index write --object-dir=/path/to/repo/objects outside of a Git repository causes a BUG(). This is because the top-level `cmd_multi_pack_index()` function stops parsing when it sees "write", and then fills in the default object directory (the result of calling `get_object_directory()`) before handing off to `cmd_multi_pack_index_write()`. But there is no repository to initialize, and so calling `get_object_directory()` results in a BUG() (indicating that the current repository is not initialized). Another case where this doesn't quite work as expected is when operating in a SHA-256 repository. To see the failure, try this in your shell: git init --object-format=sha256 repo git -C repo commit --allow-empty base git -C repo repack -d git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write and observe that we cannot open the `.idx` file in "repo", because the outermost process assumes that any repository that it works in also uses the default value of `the_hash_algo` (at the time of writing, SHA-1). There may be compelling reasons for trying to work around these bugs, but working in arbitrary `--object-dir`'s is non-standard enough (and likewise, these bugs prevalent enough) that I don't think any workflows would be broken by abandoning this behavior. Accordingly, restrict the `multi-pack-index` builtin to only work when inside of a Git repository (i.e., its main utility becomes selecting which alternate to operate in), which avoids both of the bugs above. (Note that you can still trigger a bug when writing a MIDX in an alternate which does not use the same object format as the repository which it is an alternate of, but that is an unrelated bug to this one). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-28Merge branch 'tb/reverse-midx'Junio C Hamano
The code that gives an error message in "git multi-pack-index" when no subcommand is given tried to print a NULL pointer as a strong, which has been corrected. * tb/reverse-midx: multi-pack-index: fix potential segfault without sub-command
2021-07-20multi-pack-index: fix potential segfault without sub-commandTaylor Blau
Since cd57bc41bb (builtin/multi-pack-index.c: display usage on unrecognized command, 2021-03-30) we have used a "usage" label to avoid having two separate callers of usage_with_options (one when no arguments are given, and another for unrecognized sub-commands). But the first caller has been broken since cd57bc41bb, since it will happily jump to usage without arguments, and then pass argv[0] to the "unrecognized subcommand" error. Many compilers will save us from a segfault here, but the end result is ugly, since it mentions an unrecognized subcommand when we didn't even pass one, and (on GCC) includes "(null)" in its output. Move the "usage" label down past the error about unrecognized subcommands so that it is only triggered when it should be. While we're at it, bulk up our test coverage in this area, too. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-29midx: report checksum mismatches during 'verify'Taylor Blau
'git multi-pack-index verify' inspects the data in an existing MIDX for correctness by checking that the recorded object offsets are correct, and so on. But it does not check that the file's trailing checksum matches the data that it records. So, if an on-disk corruption happened to occur in the final few bytes (and all other data was recorded correctly), we would: - get a clean result from 'git multi-pack-index verify', but - be unable to reuse the existing MIDX when writing a new one (since we now check for checksum mismatches before reusing a MIDX) Teach the 'verify' sub-command to recognize corruption in the checksum by calling midx_checksum_valid(). Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-29midx: don't reuse corrupt MIDXs when writingTaylor Blau
When writing a new multi-pack index, Git tries to reuse as much of the data from an existing MIDX as possible, like object offsets. This is done to avoid re-opening a bunch of *.idx files unnecessarily, but can lead to problems if the data we are reusing is corrupt. That's because we'll blindly reuse data from an existing MIDX without checking its trailing checksum for validity. So if there is memory corruption while writing a MIDX, or disk corruption in the intervening period between writing and reuse, we'll blindly propagate those bad values forward. Suppose we experience a memory corruption while writing a MIDX such that we write an incorrect object offset (or alternatively, the disk corrupts the data after being written, but before being reused). Then when we go to write a new MIDX, we'll reuse the bad object offset without checking its validity. This means that the MIDX we just wrote is broken, but its trailing checksum is in-tact, since we never bothered to look at the values before writing. In the above, a "git multi-pack-index verify" would have caught the problem before writing, but writing a new MIDX wouldn't have noticed anything wrong, blindly carrying forward the corrupt offset. Individual pack indexes check their validity by verifying the crc32 attached to each entry when carrying data forward during a repack. We could solve this problem for MIDXs in the same way, but individual crc32's don't make much sense, since their entries are so small. Likewise, checking the whole file on every read may be prohibitively expensive if a repository has a lot of objects, packs, or both. But we can check the trailing checksum when reusing an existing MIDX when writing a new one. And a corrupt MIDX need not stop us from writing a new one, since we can just avoid reusing the existing one at all and pretend as if we are writing a new MIDX from scratch. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01midx: allow marking a pack as preferredTaylor Blau
When multiple packs in the multi-pack index contain the same object, the MIDX machinery must make a choice about which pack it associates with that object. Prior to this patch, the lowest-ordered[1] pack was always selected. Pack selection for duplicate objects is relatively unimportant today, but it will become important for multi-pack bitmaps. This is because we can only invoke the pack-reuse mechanism when all of the bits for reused objects come from the reuse pack (in order to ensure that all reused deltas can find their base objects in the same pack). To encourage the pack selection process to prefer one pack over another (the pack to be preferred is the one a caller would like to later use as a reuse pack), introduce the concept of a "preferred pack". When provided, the MIDX code will always prefer an object found in a preferred pack over any other. No format changes are required to store the preferred pack, since it will be able to be inferred with a corresponding MIDX bitmap, by looking up the pack associated with the object in the first bit position (this ordering is described in detail in a subsequent commit). [1]: the ordering is specified by MIDX internals; for our purposes we can consider the "lowest ordered" pack to be "the one with the most-recent mtime. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-25Merge branch 'ds/chunked-file-api' into tb/reverse-midxJunio C Hamano
* ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn
2021-02-19midx: use chunk-format read APIDerrick Stolee
Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). In particular, we can use the return value of pair_chunk() to generate an error when a required chunk is missing. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26t: prepare for GIT_TEST_WRITE_REV_INDEXTaylor Blau
In the next patch, we'll add support for unconditionally enabling the 'pack.writeReverseIndex' setting with a new GIT_TEST_WRITE_REV_INDEX environment variable. This causes a little bit of fallout with tests that, for example, compare the list of files in the pack directory being unprepared to see .rev files in its output. Those locations can be cleaned up to look for specific file extensions, rather than take everything in the pack directory (for instance) and then grep out unwanted items. Once the pack.writeReverseIndex option has been thoroughly tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX, and making it possible to revert this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-09Merge branch 'tb/idx-midx-race-fix'Junio C Hamano
Processes that access packdata while the .idx file gets removed (e.g. while repacking) did not fail or fall back gracefully as they could. * tb/idx-midx-race-fix: midx.c: protect against disappearing packs packfile.c: protect against disappearing indexes
2020-11-26midx.c: protect against disappearing packsTaylor Blau
When a packed object is stored in a multi-pack index, but that pack has racily gone away, the MIDX code simply calls die(), when it could be returning an error to the caller, which would in turn lead to re-scanning the pack directory. A pack can racily disappear, for example, due to a simultaneous 'git repack -ad', You can also reproduce this with two terminals, where one is running: git init while true; do git commit -q --allow-empty -m foo git repack -ad git multi-pack-index write done (in effect, constantly writing new MIDXs), and the other is running: obj=$(git rev-parse HEAD) while true; do echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break done That will sometimes hit the error preparing packfile from multi-pack-index message, which this patch fixes. Right now, that path to discovering a missing pack looks something like 'find_pack_entry()' calling 'fill_midx_entry()' and eventually making its way to call 'nth_midxed_pack_entry()'. 'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and propagates an error if the pack is invalid. So, this works if the pack has gone away between calling 'prepare_midx_pack()' and before calling 'is_pack_valid()', but not if it disappears before then. Catch the case where the pack has already disappeared before 'prepare_midx_pack()' by returning an error in that case, too. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-26packfile.c: protect against disappearing indexesTaylor Blau
In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-28Merge branch 'ds/maintenance-part-2'Junio C Hamano
"git maintenance", an extended big brother of "git gc", continues to evolve. * ds/maintenance-part-2: maintenance: add incremental-repack auto condition maintenance: auto-size incremental-repack batch maintenance: add incremental-repack task midx: use start_delayed_progress() midx: enable core.multiPackIndex by default maintenance: create auto condition for loose-objects maintenance: add loose-objects task maintenance: add prefetch task
2020-09-25maintenance: add incremental-repack taskDerrick Stolee
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 (midx: implement midx_repack(), 2019-06-10). The 'incremental-repack' task runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 (midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25midx: use start_delayed_progress()Derrick Stolee
Now that the multi-pack-index may be written as part of auto maintenance at the end of a command, reduce the progress output when the operations are quick. Use start_delayed_progress() instead of start_progress(). Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that the progress indicators are conditional. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09Merge branch 'tb/repack-clearing-midx'Junio C Hamano
When a packfile is removed by "git repack", multi-pack-index gets cleared; the code was taught to do so less aggressively by first checking if the midx actually refers to a pack that no longer exists. * tb/repack-clearing-midx: midx: traverse the local MIDX first builtin/repack.c: invalidate MIDX only when necessary
2020-08-26builtin/repack.c: invalidate MIDX only when necessaryTaylor Blau
In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' learned to remove a multi-pack-index file if it added or removed a pack from the object store. This mechanism is a little over-eager, since it is only necessary to drop a MIDX if 'git repack' removes a pack that the MIDX references. Adding a pack outside of the MIDX does not require invalidating the MIDX, and likewise for removing a pack the MIDX does not know about. Teach 'git repack' to check for this by loading the MIDX, and checking whether the to-be-removed pack is known to the MIDX. This requires a slightly odd alternation to a test in t5319, which is explained with a comment. A new test is added to show that the MIDX is left alone when both packs known to it are marked as .keep, but two packs unknown to it are removed and combined into one new pack. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-25Merge branch 'ds/midx-repack-to-batch-size'Junio C Hamano
The "--batch-size" option of "git multi-pack-index repack" command is now used to specify that very small packfiles are collected into one until the total size roughly exceeds it. * ds/midx-repack-to-batch-size: multi-pack-index: repack batches below --batch-size
2020-08-18multi-pack-index: use hash version byteDerrick Stolee
Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-12multi-pack-index: repack batches below --batch-sizeDerrick Stolee
The --batch-size=<size> option of 'git multi-pack-index repack' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone' operation and number of smaller pack-files from incremental 'git fetch' operations. The issue with '--batch-size' is that it also _prevents_ the repack from happening if the expected size of the resulting pack-file is too small. This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size. That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable. The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered. If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size. This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-30t: remove test_oid_init in testsbrian m. carlson
Now that we call test_oid_init in the setup for all test scripts, there's no point in calling it individually. Remove all of the places where we've done so to help keep tests tidy. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Reviewed-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-10multi-pack-index: respect repack.packKeptObjects=falseDerrick Stolee
When selecting a batch of pack-files to repack in the "git multi-pack-index repack" command, Git should respect the repack.packKeptObjects config option. When false, this option says that the pack-files with an associated ".keep" file should not be repacked. This config value is "false" by default. There are two cases for selecting a batch of objects. The first is the case where the input batch-size is zero, which specifies "repack everything". The second is with a non-zero batch size, which selects pack-files using a greedy selection criteria. Both of these cases are updated and tested. Reported-by: Son Luong Ngoc <sluongng@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-29Merge branch 'ds/t5319-touch-fix'Junio C Hamano
Tests update to use "test-chmtime" instead of "touch -t". * ds/t5319-touch-fix: t5319: replace 'touch -m' with 'test-tool chmtime'
2020-04-02t5319: replace 'touch -m' with 'test-tool chmtime'Derrick Stolee
The use of 'touch -m' to modify a file's mtime is slightly less portable than using our own 'test-tool chmtime'. The important thing is that these pack-files are ordered in a special way to ensure the multi-pack-index selects some as the "newer" pack-files when resolving duplicate objects. Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-29midx.c: fix an integer underflowDamien Robert
When verifying a midx index with 0 objects, the m->num_objects - 1 underflows and wraps around to 4294967295. Fix this both by checking that the midx contains at least one oid, and also that we don't write any midx when there is no packfiles. Update the tests to check that `git multi-pack-index write` does not write an midx when there is no objects, and another to check that `git multi-pack-index verify` warns when it verifies an midx with no objects. For this last test, use t5319/no-objects.midx which was generated by an older version of git. Signed-off-by: Damien Robert <damien.olivier.robert+git@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16t5319: make test work with SHA-256brian m. carlson
This test corrupts various locations in a multi-pack index to test various error responses. However, these offsets differ between SHA-1 indexes and SHA-256 indexes due to differences in object length. Use test_oid to look up the correct offsets based on the algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16t5319: change invalid offset for SHA-256 compatibilitybrian m. carlson
When using SHA-1, the existing value of the byte we use is 0x13, so writing the byte 0x07 serves to corrupt the test and verify that we detect corruption. However, when we use SHA-256, the value at that offset is already 0x07, so our "corruption" doesn't work and the test fails to detect it. To provide a value that is truly out of range, let's use 0xff, which is not likely to be a valid value as the high byte of a two-byte offset in a multi-pack index this small. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23multi-pack-index: add [--[no-]progress] option.William Baker
Add the --[no-]progress option to git multi-pack-index. Pass the MIDX_PROGRESS flag to the subcommand functions when progress should be displayed by multi-pack-index. The progress feature was added to 'verify' in 144d703 ("multi-pack-index: report progress during 'verify'", 2018-09-13) but some subcommands were not updated to display progress, and the ability to opt-out was overlooked. Signed-off-by: William Baker <William.Baker@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-01t5319: use 'test-tool path-utils' instead of 'ls -l'Derrick Stolee
Using 'ls -l' and parsing the columns to find file sizes is problematic when the platform could report the owner as a name with spaces. Instead, use the 'test-tool path-utils file-size' command to list only the sizes. Reported-by: Johannes Sixt <j6t@kdbg.org> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11t5319-multi-pack-index.sh: test batch size zeroDerrick Stolee
The 'git multi-pack-index repack' command can take a batch size of zero, which creates a new pack-file containing all objects in the multi-pack-index. The first 'repack' command will create one new pack-file, and an 'expire' command after that will delete the old pack-files, as they no longer contain any referenced objects in the multi-pack-index. We must remove the .keep file that was added in the previous test in order to expire that pack-file. Also test that a 'repack' will do nothing if there is only one pack-file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11midx: add test that 'expire' respects .keep filesDerrick Stolee
The 'git multi-pack-index expire' subcommand may delete packs that are not needed from the perspective of the multi-pack-index. If a pack has a .keep file, then we should not delete that pack. Add a test that ensures we preserve a pack that would otherwise be expired. First, create a new pack that contains every object in the repo, then add it to the multi-pack-index. Then create a .keep file for a pack starting with "a-pack" that was added in the previous test. Finally, expire and verify that the pack remains and the other packs were expired. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11multi-pack-index: test expire while adding packsDerrick Stolee
During development of the multi-pack-index expire subcommand, a version went out that improperly computed the pack order if a new pack was introduced while other packs were being removed. Part of the subtlety of the bug involved the new pack being placed before other packs that already existed in the multi-pack-index. Add a test to t5319-multi-pack-index.sh that catches this issue. The test adds new packs that cause another pack to be expired, and creates new packs that are lexicographically sorted before and after the existing packs. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11midx: implement midx_repack()Derrick Stolee
To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, compute their expected size, and add the packs to a list if they are smaller than the given batch-size. Stop when the total expected size is at least the batch size. If the batch size is zero, select all packs in the multi-pack-index. Finally, collect the objects from the multi-pack-index that are in the selected packs and send them to 'git pack-objects'. Write a new multi-pack-index that includes the new pack. Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs. This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the actual size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. The "expected size" version provides similar behavior, but could skip a pack-file if the average object size is much larger than the actual size of the referenced objects, or can create a large pack if the actual size of the referenced objects is larger than the expected size. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>