git.99rst.org Git - git.git/log

rebase: extract create_autostash()

In a future commit, we will lib-ify this code. In preparation for
this, extract the code into the create_autostash() function so that it
can be cleaned up before it is finally lib-ified.

This patch is best viewed with `--color-moved` and
`--color-moved-ws=allow-indentation-change`.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

reset: extract reset_head() from rebase

Continue the process of lib-ifying the autostash code. In a future
commit, this will be used to implement `--autostash` in other builtins.

This patch is best viewed with `--color-moved`.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

rebase: generify reset_head()

In the future, we plan on lib-ifying reset_head() so we need it to
be more generic. Make it more generic by making it accept a
`struct repository` argument instead of implicitly using the non-repo
functions. Also, make it accept a `const char *default_reflog_action`
argument so that the default action of "rebase" isn't hardcoded in.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

rebase: use apply_autostash() from sequencer.c

The apply_autostash() function in builtin/rebase.c is similar enough to
the apply_autostash() function in sequencer.c that they are almost
interchangeable, except for the type of arg they accept. Make the
sequencer.c version extern and use it in rebase.

The rebase version was introduced in 6defce2b02 (builtin rebase: support
`--autostash` option, 2018-09-04) as part of the shell to C conversion.
It opted to duplicate the function because, at the time, there was
another in-progress project converting interactive rebase from shell to
C as well and they did not want to clash with them by refactoring
sequencer.c version of apply_autostash(). Since both efforts are long
done, we can freely combine them together now.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: rename stash_sha1 to stash_oid

The preferred terminology is to refer to object identifiers as "OIDs".
Rename the `stash_sha1` variable to `stash_oid` in order to conform to
this.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: make apply_autostash() accept a path

In order to make apply_autostash() more generic for future extraction, make
it accept a `path` argument so that the location from where to read the
reference to the autostash commit can be customized. Remove the `opts`
argument since it was unused before anyway.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

rebase: use read_oneliner()

Since in sequencer.c, read_one() basically duplicates the functionality
of read_oneliner(), reduce code duplication by replacing read_one() with
read_oneliner().

This was done with the following Coccinelle script

@@
expression a, b;
@@
- read_one(a, b)
+ !read_oneliner(b, a, READ_ONELINER_WARN_MISSING)

and long lines were manually broken up.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

repository: mark the "refs" pointer as private

The "refs" pointer in a struct repository starts life as NULL, but then
is lazily initialized when it is accessed via get_main_ref_store().
However, it's easy for calling code to forget this and access it
directly, leading to code which works _some_ of the time, but fails if
it is called before anybody else accesses the refs.

This was the cause of the bug fixed by 5ff4b920eb (sha1-name: do not
assume that the ref store is initialized, 2020-04-09). In order to
prevent similar bugs, let's more clearly mark the "refs" field as
private.

In addition to helping future code, the name change will help us audit
any existing direct uses. Besides get_main_ref_store() itself, it turns
out there is only one. But we know it's OK as it is on the line directly
after the fix from 5ff4b920eb, which will have initialized the pointer.
However it's still a good idea for it to model the proper use of the
accessing function, so we'll convert it.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

builtin/receive-pack: use constant-time comparison for HMAC value

When we're comparing a push cert nonce, we currently do so using strcmp.
Most implementations of strcmp short-circuit and exit as soon as they
know whether two values are equal.  This, however, is a problem when
we're comparing the output of HMAC, as it leaks information in the time
taken about how much of the two values match if they do indeed differ.

In our case, the nonce is used to prevent replay attacks against our
server via the embedded timestamp and replay attacks using requests from
a different server via the HMAC.  Push certs, which contain the nonces,
are signed, so an attacker cannot tamper with the nonces without
breaking validation of the signature.  They can, of course, create their
own signatures with invalid nonces, but they can also create their own
signatures with valid nonces, so there's nothing to be gained.  Thus,
there is no security problem.

Even though it doesn't appear that there are any negative consequences
from the current technique, for safety and to encourage good practices,
let's use a constant time comparison function for nonce verification.
POSIX does not provide one, but they are easy to write.

The technique we use here is also used in NaCl and the Go standard
library and relies on the fact that bitwise or and xor are constant time
on all known architectures.

We need not be concerned about exiting early if the actual and expected
lengths differ, since the standard cryptographic assumption is that
everyone, including an attacker, knows the format of and algorithm used
in our nonces (and in any event, they have the source code and can
determine it easily).  As a result, we assume everyone knows how long
our nonces should be.  This philosophy is also taken by the Go standard
library and other cryptographic libraries when performing constant time
comparisons on HMAC values.

Signed-off-by: brian m. carlson <redacted>
Signed-off-by: Junio C Hamano <redacted>

sha1-name: do not assume that the ref store is initialized

c931ba4e (sha1-name.c: remove the_repo from handle_one_ref(),
2019-04-16) replaced the use of for_each_ref() helper, which works
with the main ref store of the default repository instance, with
refs_for_each_ref(), which can work on any ref store instance, by
assuming that the repository instance the function is given has its
ref store already initialized.

But it is possible that nobody has initialized it, in which case,
the code ends up dereferencing a NULL pointer.

Reported-by: Érico Rolim <redacted>
Signed-off-by: Junio C Hamano <redacted>

bloom: ignore renames when computing changed paths

The changed-path Bloom filters record an entry in the filter for
every path that was changed. This includes every add and delete,
regardless of whether a rename was detected. Detecting renames
causes significant performance issues, but also will trigger
downloading missing blobs in partial clone.

The simple fix is to disable rename detection when computing a
changed-path Bloom filter. This should already be disabled by
default, but it is good to explicitly enforce the intended
behavior.

Signed-off-by: Derrick Stolee <redacted>
Signed-off-by: Junio C Hamano <redacted>

config.txt: move closing "----" to cover entire listing

Commit 1925fe0c8a ("Documentation: wrap config listings in "----"",
2019-09-07) wrapped this fairly large block of example config directives
in "----". The closing "----" ended up a few lines too early though.
Make sure to include the trailing "IncludeIf.onbranch:..." example, too.

Signed-off-by: Martin Ågren <redacted>
Signed-off-by: Junio C Hamano <redacted>

stash -p: (partially) fix bug concerning split hunks

When trying to stash part of the worktree changes by splitting a hunk
and then only partially accepting the split bits and pieces, the user
is presented with a rather cryptic error:

error: patch failed: <file>:<line>
error: test: patch does not apply
Cannot remove worktree changes

and the command would fail to stash the desired parts of the worktree
changes (even if the `stash` ref was actually updated correctly).

We even have a test case demonstrating that failure, carrying it for
four years already.

The explanation: when splitting a hunk, the changed lines are no longer
separated by more than 3 lines (which is the amount of context lines
Git's diffs use by default), but less than that. So when staging only
part of the diff hunk for stashing, the resulting diff that we want to
apply to the worktree in reverse will contain those changes to be
dropped surrounded by three context lines, but since the diff is
relative to HEAD rather than to the worktree, these context lines will
not match.

Example time. Let's assume that the file README contains these lines:

We
the
people

and the worktree added some lines so that it contains these lines
instead:

We
are
the
kind
people

and the user tries to stash the line containing "are", then the command
will internally stage this line to a temporary index file and try to
revert the diff between HEAD and that index file. The diff hunk that
`git stash` tries to revert will look somewhat like this:

@@ -1776,3 +1776,4
We
+are
the
people

It is obvious, now, that the trailing context lines overlap with the
part of the original diff hunk that the user did *not* want to stash.

Keeping in mind that context lines in diffs serve the primary purpose of
finding the exact location when the diff does not apply precisely (but
when the exact line number in the file to be patched differs from the
line number indicated in the diff), we work around this by reducing the
amount of context lines: the diff was just generated.

Note: this is not a *full* fix for the issue. Just as demonstrated in
t3701's 'add -p works with pathological context lines' test case, there
are ambiguities in the diff format. It is very rare in practice, of
course, to encounter such repeated lines.

The full solution for such cases would be to replace the approach of
generating a diff from the stash and then applying it in reverse by
emulating `git revert` (i.e. doing a 3-way merge). However, in `git
stash -p` it would not apply to `HEAD` but instead to the worktree,
which makes this non-trivial to implement as long as we also maintain a
scripted version of `add -i`.

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

t3904: fix incorrect demonstration of a bug

In 7e9e048661 (stash -p: demonstrate failure of split with mixed y/n,
2015-04-16), a regression test for a known breakage that was added to
the test script `t3904-stash-patch.sh` that demonstrated that splitting
a hunk and trying to stash only part of that split hunk fails (but
shouldn't).

As expected, it still fails, but for the wrong reason: once the bug is
fixed, we would expect stderr to show nothing, yet the regression test
expects stderr to show something.

Let's fix that by telling that regression test case to expect nothing to
be printed to stderr.

While at it, also drop the obvious left-over from debugging where the
regression test did not mind `git stash -p` to return a non-zero exit
status.

Of course, the regression test still fails, but this time for the
correct reason.

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

mingw: do not treat `COM0` as a reserved file name

In 4dc42c6c186 (mingw: refuse paths containing reserved names,
2019-12-21), we started disallowing file names that are reserved, e.g.
`NUL`, `CONOUT$`, etc.

This included `COM<n>` where `<n>` is a digit. Unfortunately, this
includes `COM0` but only `COM1`, ..., `COM9` are reserved, according to
the official documentation, `COM0` is mentioned in the "NT Namespaces"
section but it is explicitly _omitted_ from the list of reserved names:
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions

Tests corroborate this: it is totally possible to write a file called
`com0.c` on Windows 10, but not `com1.c`.

So let's tighten the code to disallow only the reserved `COM<n>` file
names, but to allow `COM0` again.

This fixes https://github.com/git-for-windows/git/issues/2470.

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

mingw: use modern strftime implementation if possible

Microsoft introduced a new "Universal C Runtime Library" (UCRT) with
Visual Studio 2015. The UCRT comes with a new strftime() implementation
that supports more date formats. We link git against the older
"Microsoft Visual C Runtime Library" (MSVCRT), so to use the UCRT
strftime() we need to load it from ucrtbase.dll using
DECLARE_PROC_ADDR()/INIT_PROC_ADDR().

Most supported Windows systems should have recieved the UCRT via Windows
update, but in some cases only MSVCRT might be available. In that case
we fall back to using that implementation.

With this change, it is possible to use e.g. the `%g` and `%V` date
format specifiers, e.g.

git show -s --format=%cd --date=format:‘%g.%V’ HEAD

Without this change, the user would see this error message on Windows:

fatal: invalid strftime format: '‘%g.%V’'

This fixes https://github.com/git-for-windows/git/issues/2495

Signed-off-by: Matthias Aßhauer <redacted>
Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

subtree: fix build with AsciiDoctor 2

This is a (late) companion for f6461b82b93 (Documentation: fix build
with Asciidoctor 2, 2019-09-15).

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

format-patch: teach --no-encode-email-headers

When commit subjects or authors have non-ASCII characters, git
format-patch Q-encodes them so they can be safely sent over email.
However, if the patch transfer method is something other than email (web
review tools, sneakernet), this only serves to make the patch metadata
harder to read without first applying it (unless you can decode RFC 2047
in your head). git am as well as some email software supports
non-Q-encoded mail as described in RFC 6531.

Add --[no-]encode-email-headers and format.encodeEmailHeaders to let the
user control this behavior.

Signed-off-by: Emma Brooks <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci: fix the `jobname` of the `GETTEXT_POISON` job

In 6cdccfce1e0f (i18n: make GETTEXT_POISON a runtime option,
2018-11-08), the `jobname` was adjusted to have the `GIT_TEST_` prefix,
but that prefix makes no sense in this context.

Co-authored-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/lib: set TERM environment variable if not exist

GitHub Action doesn't set TERM environment variable, which is required
by "tput".

Fallback to dumb if it's not set.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/lib: allow running in GitHub Actions

For each CI system we support, we need a specific arm in that if/else
construct in ci/lib.sh. Let's add one for GitHub Actions.

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/lib: if CI type is unknown, show the environment variables

This should help with adding new CI-specific if-else arms.

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

Merge branch 'dd/ci-musl-libc' into HEAD

* dd/ci-musl-libc:
  travis: build and test on Linux with musl libc and busybox
  ci/linux32: libify install-dependencies step
  ci: refactor docker runner script
  ci/linux32: parameterise command to switch arch
  ci/lib-docker: preserve required environment variables
  ci: make MAKEFLAGS available inside the Docker container in the Linux32 job

Merge branch 'dd/test-with-busybox' into HEAD

* dd/test-with-busybox:
  t5703: feed raw data into test-tool unpack-sideband
  t4124: tweak test so that non-compliant diff(1) can also be used
  t7063: drop non-POSIX argument "-ls" from find(1)
  t5616: use rev-parse instead to get HEAD's object_id
  t5003: skip conversion test if unzip -a is unavailable
  t5003: drop the subshell in test_lazy_prereq
  test-lib-functions: test_cmp: eval $GIT_TEST_CMP
  t4061: use POSIX compliant regex(7)

sequencer: make read_oneliner() extern

The function read_oneliner() is a generally useful util function.
Instead of hiding it as a static function within sequencer.c, extern it
so that it can be reused by others.

This patch is best viewed with --color-moved.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: configurably warn on non-existent files

In the future, we plan on externing read_oneliner(). Future users of
read_oneliner() will want the ability to output warnings in the event
that the `path` doesn't exist. Introduce the
`READ_ONELINER_WARN_MISSING` flag which, if active, would issue a
warning when a file doesn't exist by always executing warning_errno()
in the case where strbuf_read_file() fails.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: make read_oneliner() accept flags

In a future commit, we will need read_oneliner() to accept flags other
than just `skip_if_empty`. Instead of having an argument for each flag,
teach read_oneliner() to accept the bitfield `flags` instead. For now,
only recognize the `READ_ONELINER_SKIP_IF_EMPTY` flag. More flags will
be added in a future commit.

The result of this is that parallel topics which introduce invocations
of read_oneliner() will still be compatible with this new function
signature since, instead of passing 1 or 0 for `skip_if_empty`, they'll
be passing 1 or 0 to `flags`, which gives equivalent behavior.

Mechanically fix up invocations of read_oneliner() with the following
spatch

@@
expression a, b;
@@
read_oneliner(a, b,
- 1
+ READ_ONELINER_SKIP_IF_EMPTY
)

and manually break up long lines in the result.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: make file exists check more efficient

We currently check whether a file exists and return early before reading
the file. Instead of accessing the file twice, always read the file and
check `errno` to see if the file doesn't exist.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

diff: restrict when prefetching occurs

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.

diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
     diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
     detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
     string)

Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.

Helped-by: Jeff King <redacted>
Signed-off-by: Jonathan Tan <redacted>
Signed-off-by: Junio C Hamano <redacted>

diff: refactor object read

Refactor the object reads in diff_populate_filespec() to have the first
object read not be in an if/else branch, because in a future patch, a
retry will be added to that first object read.

Signed-off-by: Jonathan Tan <redacted>
Signed-off-by: Junio C Hamano <redacted>

diff: make diff_populate_filespec_options struct

The behavior of diff_populate_filespec() currently can be customized
through a bitflag, but a subsequent patch requires it to support a
non-boolean option. Replace the bitflag with an options struct.

Signed-off-by: Jonathan Tan <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: honor GIT_REFLOG_ACTION

There is a lot of code to honor GIT_REFLOG_ACTION throughout git,
including some in sequencer.c; unfortunately, reflog_message() and its
callers ignored it. Instruct reflog_message() to check the existing
environment variable, and use it when present as an override to
action_name().

Also restructure pick_commits() to only temporarily modify
GIT_REFLOG_ACTION for a short duration and then restore the old value,
so that when we do this setting within a loop we do not keep adding "
(pick)" substrings and end up with a reflog message of the form
rebase (pick) (pick) (pick) (finish): returning to refs/heads/master

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

travis: build and test on Linux with musl libc and busybox

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/linux32: libify install-dependencies step

In a later patch, we will add new Travis Job for linux-musl.
Most of other code in this file could be reuse for that job.

Move the code to install dependencies to a common script.
Should we add new CI system that can run directly in container,
we can reuse this script for installation step.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci: refactor docker runner script

We will support alpine check in docker later in this series.

While we're at it, tell people to run as root in podman,
if podman is used as drop-in replacement for docker,
because podman will map host-user to container's root,
therefore, mapping their permission.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/linux32: parameterise command to switch arch

In a later patch, the remaining of this command will be re-used for the
CI job for linux with musl libc.

Allow customisation of the emulator, now.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci/lib-docker: preserve required environment variables

We're using "su -m" to preserve environment variables in the shell run
by "su". But, that options will be ignored while "-l" (aka "--login") is
specified in util-linux and busybox's su.

In a later patch this script will be reused for checking Git for Linux
with musl libc on Alpine Linux, Alpine Linux uses "su" from busybox.

Since we don't have interest in all environment variables,
pass only those necessary variables to the inner script.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

doc: --recurse-submodules mostly applies to active submodules

The documentation refers to "initialized" or "populated" submodules,
to explain which submodules are affected by '--recurse-submodules', but
the real terminology here is 'active' submodules. Update the
documentation accordingly.

Some terminology:
- Active is defined in gitsubmodules(7), it only involves the
  configuration variables 'submodule.active', 'submodule.<name>.active'
  and 'submodule.<name>.url'. The function
  submodule.c::is_submodule_active checks that a submodule is active.
- Populated means that the submodule's working tree is present (and the
  gitfile correctly points to the submodule repository), i.e. either the
  superproject was cloned with ` --recurse-submodules`, or the user ran
  `git submodule update --init`, or `git submodule init [<path>]` and
  `git submodule update [<path>]` separately which populated the
  submodule working tree. This does not involve the 3 configuration
  variables above.
- Initialized (at least in the context of the man pages involved in this
  patch) means both "populated" and "active" as defined above, i.e. what
  `git submodule update --init` does.

The --recurse-submodules option mostly affects active submodules. An
exception is `git fetch` where the option affects populated submodules.
As a consequence, in `git pull --recurse-submodules` the fetch affects
populated submodules, but the resulting working tree update only affects
active submodules.

In the documentation of `git-pull`, let's distinguish between the
fetching part which affects populated submodules, and the updating of
worktrees, which only affects active submodules.

Signed-off-by: Damien Robert <redacted>
Helped-by: Philippe Blain <redacted>
Signed-off-by: Junio C Hamano <redacted>

doc: be more precise on (fetch|push).recurseSubmodules

The default value also depends on the value of submodule.recurse.
Use this opportunity to correct some grammar mistakes in
Documentation/config/fetch.txt signaled by Robert P. J. Day.

Also mention `fetch.recurseSubmodules` in fetch-options.txt. In
git-push.txt, `push.recurseSubmodules` is implicitly mentioned (by
explaining how to disable it), so no need to add it there.

Lastly add a link to `git-fetch` in `git-pull.txt` to explain the
meaning of `--recurse-submodules` there.

Signed-off-by: Damien Robert <redacted>
Signed-off-by: Junio C Hamano <redacted>

doc: explain how to deactivate submodule.recurse completely

Signed-off-by: Damien Robert <redacted>
Signed-off-by: Junio C Hamano <redacted>

doc: document --recurse-submodules for reset and restore

Also unify the formulation about --no-recurse-submodules for checkout
and switch, which we reuse for restore.

And correct the formulation about submodules' HEAD in read-tree, which
we reuse in reset.

Signed-off-by: Damien Robert <redacted>
Signed-off-by: Junio C Hamano <redacted>

doc: list all commands affected by submodule.recurse

Note that `ls-files` is not affected, even though it has a
`--recurse-submodules` option, so list it as an exception too.

Signed-off-by: Damien Robert <redacted>
Signed-off-by: Junio C Hamano <redacted>

fast-import: replace custom hash with hashmap.c

We use a custom hash in fast-import to store the set of objects we've
imported so far. It has a fixed set of 2^16 buckets and chains any
collisions with a linked list. As the number of objects grows larger
than that, the load factor increases and we degrade to O(n) lookups and
O(n^2) insertions.

We can scale better by using our hashmap.c implementation, which will
resize the bucket count as we grow. This does incur an extra memory cost
of 8 bytes per object, as hashmap stores the integer hash value for each
entry in its hashmap_entry struct (which we really don't care about
here, because we're just reusing the embedded object hash). But I think
the numbers below justify this (and our per-object memory cost is
already much higher).

I also looked at using khash, but it seemed to perform slightly worse
than hashmap at all sizes, and worse even than the existing code for
small sizes. It's also awkward to use here, because we want to look up a
"struct object_entry" from a "struct object_id", and it doesn't handle
mismatched keys as well. Making a mapping of object_id to object_entry
would be more natural, but that would require pulling the embedded oid
out of the object_entry or incurring an extra 32 bytes per object.

In a synthetic test creating as many cheap, tiny objects as possible

  perl -e '
      my $bits = shift;
      my $nr = 2**$bits;

      for (my $i = 0; $i < $nr; $i++) {
              print "blob\n";
              print "data 4\n";
              print pack("N", $i);
      }
  ' $bits | git fast-import

I got these results:

  nr_objects   master       khash      hashmap
  2^20         0m4.317s     0m5.109s   0m3.890s
  2^21         0m10.204s    0m9.702s   0m7.933s
  2^22         0m27.159s    0m17.911s  0m16.751s
  2^23         1m19.038s    0m35.080s  0m31.963s
  2^24         4m18.766s    1m10.233s  1m6.793s

which points to hashmap as the winner. We didn't have any perf tests for
fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than
a full one, but still is an interesting case people might use (e.g., for
rewriting history). It will emphasize this change in some ways (as a
percentage we spend more time making objects and less shuffling blob
bytes around) and less in others (the total object count is lower).

Here are the results for linux.git:

  Test                        HEAD^                 HEAD
  ----------------------------------------------------------------------------
  9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
  9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

It only has ~5.2M commits and trees, so this is a larger effect than I
expected (the 2^23 case above only improved by 50s or so, but here we
gained almost 90s). This is probably due to actually performing more
object lookups in a real import with trees and commits, as opposed to
just dumping a bunch of blobs into a pack.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag

Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.

The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.

Helped-by: Derrick Stolee <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

t4216: add end to end tests for git log with Bloom filters

These tests exercises writing commit graph with Bloom filters
and exercises 'git log -- path' with all the applicable
options. They check that the output is the same with and
without Bloom filters, confirm Bloom filters were used by
checking if trace2 statistics were logged correctly.

Also confirms cases where Bloom filters are not used:
1. Multiple path specs,
2. --walk-reflogs (see patch titled 'revision.c: use Bloom filters...'
for details,
3. If the latest commit graph does not have Bloom filters

Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

revision.c: add trace2 stats around Bloom filter usage

Add trace2 statistics around Bloom filter usage and behavior
for 'git log -- path' commands that are hoping to benefit from
the presence of computed changed paths Bloom filters.

These statistics are great for performance analysis work and
for formal testing, which we will see in the commit following
this one.

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <redacted>
Helped-by: Jonathan Tan <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

revision.c: use Bloom filters to speed up path based revision walks

Revision walk will now use Bloom filters for commits to speed up
revision walks for a particular path (for computing history for
that path), if they are present in the commit-graph file.

We load the Bloom filters during the prepare_revision_walk step,
currently only when dealing with a single pathspec. Extending
it to work with multiple pathspecs can be explored and built on
top of this series in the future.

While comparing trees in rev_compare_trees(), if the Bloom filter
says that the file is not different between the two trees, we don't
need to compute the expensive diff. This is where we get our
performance gains. The other response of the Bloom filter is '`:maybe',
in which case we fall back to the full diff calculation to determine
if the path was changed in the commit.

We do not try to use Bloom filters when the '--walk-reflogs' option
is specified. The '--walk-reflogs' option does not walk the commit
ancestry chain like the rest of the options. Incorporating the
performance gains when walking reflog entries would add more
complexity, and can be explored in a later series.

Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.

On the git and linux repos:
- we observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
faster.

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <redacted>
Helped-by: Jonathan Tan <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

commit-graph: add --changed-paths option to write subcommand

Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.

Helped-by: Derrick Stolee <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

commit-graph: reuse existing Bloom filters during write

Add logic to
a) parse Bloom filter information from the commit graph file and,
b) re-use existing Bloom filters.

See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.

To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
   filter of commit i+1. It is essentially the index past the end
   of the filter of commit i. It is called end_index in the code.

2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
   for filter of commit i. It is called the start_index in the code.
   For the first commit, where i = 0, Bloom filter data starts at the
   beginning, just past the header in the BDAT chunk. Hence, start_index
   will be 0.

3. The length of the filter will be end_index - start_index, because
   BIDX[i] gives the cumulative 8-byte words including the ith
   commit's filter.

We toggle whether Bloom filters should be recomputed based on the
compute_if_not_present flag.

Helped-by: Derrick Stolee <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

commit-graph: write Bloom filters to commit graph file

Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.

Helped-by: Derrick Stolee <redacted>
Signed-off-by: Garima Singh <redacted>
Signed-off-by: Junio C Hamano <redacted>

pull doc: correct outdated description of an example

Since f269048754 (fetch: opportunistically update tracking refs,
2013-05-11), the underlying `git fetch` in `git pull <remote> <branch>`
updates the configured remote-tracking branch for <branch>.

However, an example in the 'Examples' section of the `git pull`
documentation still states that this is not the case.

Correct the description of this example.

Signed-off-by: Philippe Blain <redacted>
Signed-off-by: Junio C Hamano <redacted>

pull doc: refer to a specific section in 'fetch' doc

The documentation for the `<refspec>` parameter in the `git fetch`
documentation refers to the section "CONFIGURED REMOTE-TRACKING
BRANCHES" in this same documentation page.

In the `git pull` documentation, let's also refer specifically to this
section instead of just linking to the `git fetch` documentation.

Signed-off-by: Philippe Blain <redacted>
Signed-off-by: Junio C Hamano <redacted>

t0007: fix a typo

Signed-off-by: Johannes Schindelin <redacted>
Signed-off-by: Junio C Hamano <redacted>

Makefile: avoid running curl-config unnecessarily

Commit 94a88e2524 (Makefile: avoid running curl-config multiple times,
2020-03-26) put the call to $(CURL_CONFIG) into a "simple" variable
which is expanded immediately, rather than expanding it each time it's
needed. However, that also means that we expand it whenever the Makefile
is parsed, whether we need it or not.

This is wasteful, but also breaks the ci/test-documentation.sh job, as
it does not have curl at all and complains about the extra messages to
stderr. An easy way to see it is just:

  $ make CURL_CONFIG=does-not-work check-builtins
  make: does-not-work: Command not found
  make: does-not-work: Command not found
  GIT_VERSION = 2.26.0.108.gb3f3f45f29
  make: does-not-work: Command not found
  make: does-not-work: Command not found
  ./check-builtins.sh

We can get the best of both worlds if we're willing to accept a little
Makefile hackery. Courtesy of the article at:

  http://make.mad-scientist.net/deferred-simple-variable-expansion/

this patch uses a lazily-evaluated recursive variable which replaces its
contents with an immediately assigned simple one on first use.

Reported-by: Johannes Schindelin <redacted>
Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

git-rebase.txt: add another hook to the hooks section, and explain more

For more discussion about these hooks, their history relative to rebase,
and logical consistency between different types of operations, see
https://lore.kernel.org/git/CABPp-BG0bFKUage5cN_2yr2DkmS04W2Z9Pg5WcROqHznV3XBdw@mail.gmail.com/
and the links to some threads referenced therein.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: stop leaking buf

In read_populate_opts(), we call read_oneliner() _after_ calling
strbuf_release(). This means that `buf` is leaked at the end of the
function.

Always clean up the strbuf by going to `done_rebase_i` whether or not
we return an error.

Signed-off-by: Denton Liu <redacted>
Signed-off-by: Junio C Hamano <redacted>

Documentation: document merge option --no-gpg-sign

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

Documentation: merge commit-tree --[no-]gpg-sign

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

Documentation: reword commit --no-gpg-sign

Merge with --gpg-sign option, and clarify that --no-gpg-sign also
override earlier --gpg-sign.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

Documentation: document am --no-gpg-sign

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

cherry-pick/revert: honour --no-gpg-sign in all case

{cherry-pick,revert} --edit hasn't honoured --no-gpg-sign yet.

Pass this option down to git-commit to honour it.

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

rebase.c: honour --no-gpg-sign

Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

promisor-remote: accept 0 as oid_nr in function

There are 3 callers to promisor_remote_get_direct() that first check if
the number of objects to be fetched is equal to 0. Fold that check into
promisor_remote_get_direct(), and in doing so, be explicit as to what
promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success,
immediately).

Signed-off-by: Jonathan Tan <redacted>
Signed-off-by: Junio C Hamano <redacted>

git-submodule.sh: setup uninitialized variables

We have an environment variable `jobs=16` defined in our CI system, and
this environment makes our build job failed with the following message:

error: pathspec '16' did not match any file(s) known to git

The pathspec '16' for Git command is from the environment variable
"jobs".

This is because "git-submodule" command is implemented in shell script,
and environment variables may change its behavior. Set values for
uninitialized variables, such as "jobs" and "recommend_shallow" will
fix this issue.

Helped-by: Jiang Xin <redacted>
Signed-off-by: Li Xuejiang <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: implement interactive transaction handling

The git-update-ref(1) command can only handle queueing transactions
right now via its "--stdin" parameter, but there is no way for users to
handle the transaction itself in a more explicit way. E.g. in a
replicated scenario, one may imagine a coordinator that spawns
git-update-ref(1) for multiple repositories and only if all agree that
an update is possible will the coordinator send a commit. Such a
transactional session could look like

    > start
    < start: ok
    > update refs/heads/master $OLD $NEW
    > prepare
    < prepare: ok
    # All nodes have returned "ok"
    > commit
    < commit: ok

or

    > start
    < start: ok
    > create refs/heads/master $OLD $NEW
    > prepare
    < fatal: cannot lock ref 'refs/heads/master': reference already exists
    # On all other nodes:
    > abort
    < abort: ok

In order to allow for such transactional sessions, this commit
introduces four new commands for git-update-ref(1), which matches those
we have internally already with the exception of "start":

    - start: start a new transaction

    - prepare: prepare the transaction, that is try to lock all
               references and verify their current value matches the
               expected one

    - commit: explicitly commit a session, that is update references to
              match their new expected state

    - abort: abort a session and roll back all changes

By design, git-update-ref(1) will commit as soon as standard input is
being closed. While fine in a non-transactional world, it is definitely
unexpected in a transactional world. Because of this, as soon as any of
the new transactional commands is used, the default will change to
aborting without an explicit "commit". To avoid a race between queueing
updates and the first "prepare" that starts a transaction, the "start"
command has been added to start an explicit transaction.

Add some tests to exercise this new functionality.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: read commands in a line-wise fashion

The git-update-ref(1) supports a `--stdin` mode that allows it to read
all reference updates from standard input. This is mainly used to allow
for atomic reference updates that are all or nothing, so that either all
references will get updated or none.

Currently, git-update-ref(1) reads all commands as a single block of up
to 1000 characters and only starts processing after stdin gets closed.
This is less flexible than one might wish for, as it doesn't really
allow for longer-lived transactions and doesn't allow any verification
without committing everything. E.g. one may imagine the following
exchange:

    > start
    < start: ok
    > update refs/heads/master $NEWOID1 $OLDOID1
    > update refs/heads/branch $NEWOID2 $OLDOID2
    > prepare
    < prepare: ok
    > commit
    < commit: ok

When reading all input as a whole block, the above interactive protocol
is obviously impossible to achieve. But by converting the command to
read commands linewise, we can make it more interactive than before.

Obviously, the linewise interface is only a first step in making
git-update-ref(1) work in a more transaction-oriented way. Missing is
most importantly support for transactional commands that manage the
current transaction.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: move transaction handling into `update_refs_stdin()`

While the actual logic to update the transaction is handled in
`update_refs_stdin()`, the transaction itself is started and committed
in `cmd_update_ref()` itself. This makes it hard to handle transaction
abortion and commits as part of `update_refs_stdin()` itself, which is
required in order to introduce transaction handling features to `git
update-refs --stdin`.

Refactor the code to move all transaction handling into
`update_refs_stdin()` to prepare for transaction handling features.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: pass end pointer instead of strbuf

We currently pass both an `strbuf` containing the current command line
as well as the `next` pointer pointing to the first argument to
commands. This is both confusing and makes code more intertwined.
Convert this to use a simple pointer as well as a pointer pointing to
the end of the input as a preparatory step to line-wise reading of
stdin.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: drop unused argument for `parse_refname`

The `parse_refname` function accepts a `struct strbuf *input` argument
that isn't used at all. As we're about to convert commands to not use a
strbuf anymore but instead an end pointer, let's drop this argument now
to make the converting commit easier to review.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

update-ref: organize commands in an array

We currently manually wire up all commands known to `git-update-ref
--stdin`, making it harder than necessary to preprocess arguments after
the command is determined. To make this more extensible, let's refactor
the code to use an array of known commands instead. While this doesn't
add a lot of value now, it is a preparatory step to implement line-wise
reading of commands.

As we're going to introduce commands without trailing spaces, this
commit also moves whitespace parsing into the respective commands.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

ci: make MAKEFLAGS available inside the Docker container in the Linux32 job

Once upon a time we ran 'make --jobs=2 ...' to build Git, its
documentation, or to apply Coccinelle semantic patches.  Then commit
eaa62291ff (ci: inherit --jobs via MAKEFLAGS in run-build-and-tests,
2019-01-27) came along, and started using the MAKEFLAGS environment
variable to centralize setting the number of parallel jobs in
'ci/libs.sh'.  Alas, it forgot to update 'ci/run-linux32-docker.sh' to
make MAKEFLAGS available inside the Docker container running the 32
bit Linux job, and, consequently, since then that job builds Git
sequentially, and it ignores any Makefile knobs that we might set in
MAKEFLAGS (though we don't set any for the 32 bit Linux job at the
moment).

So update the 'docker run' invocation in 'ci/run-linux32-docker.sh' to
make MAKEFLAGS available inside the Docker container as well.  Set
CC=gcc for the 32 bit Linux job, because that's the compiler installed
in the 32 bit Linux Docker image that we use (Travis CI nowadays sets
CC=clang by default, but clang is not installed in this image).

Signed-off-by: SZEDER Gábor <redacted>
Signed-off-by: Đoàn Trần Công Danh <redacted>
Signed-off-by: Junio C Hamano <redacted>

t5319: replace 'touch -m' with 'test-tool chmtime'

The use of 'touch -m' to modify a file's mtime is slightly less
portable than using our own 'test-tool chmtime'. The important
thing is that these pack-files are ordered in a special way to
ensure the multi-pack-index selects some as the "newer" pack-files
when resolving duplicate objects.

Reported-by: Jeff King <redacted>
Signed-off-by: Derrick Stolee <redacted>
Signed-off-by: Junio C Hamano <redacted>

commit-graph: fix buggy --expire-time option

The commit-graph builtin has an --expire-time option that takes a
datetime using OPT_EXPIRY_DATE(). However, the implementation inside
expire_commit_graphs() was treating a non-zero value as a number of
seconds to subtract from "now".

Update t5323-split-commit-graph.sh to demonstrate the correct value
of the --expire-time option by actually creating a crud .graph file
with mtime earlier than the expire time. Instead of using a super-
early time (1980) we use an explicit, and recent, time. Using
test-tool chmtime to create two files on either end of an exact
second, we create a test that catches this failure no matter the
current time. Using a fixed date is more portable than trying to
format a relative date string into the --expiry-date input.

I noticed this when inspecting some Scalar repos that had an excess
number of commit-graph files. In Scalar, we were using this second
interpretation by using "--expire-time=3600" to mean "delete graphs
older than one hour ago" to avoid deleting a commit-graph that a
foreground process may be trying to load.

Also I noticed that the help text was copied from the --max-commits
option. Fix that help text.

Signed-off-by: Derrick Stolee <redacted>
Signed-off-by: Junio C Hamano <redacted>

completion: fix 'git add' on paths under an untracked directory

As reported on the git mailing list, since git-2.25,
    git add untracked-dir/
has been tab completing to
    git add untracked-dir/./

The cause for this was that with commit b9670c1f5e (dir: fix checks on
common prefix directory, 2019-12-19),
    git ls-files -o --directory untracked-dir/
(or the equivalent `git -C untracked-dir ls-files -o --directory`) began
reporting
    untracked-dir/
instead of listing paths underneath that directory.  It may also be
worth noting that the real command in question was
    git -C untracked-dir ls-files -o --directory '*'
which is equivalent to
    git ls-files -o --directory 'untracked-dir/*'
which behaves the same for the purposes of this issue (the '*' can match
the empty string), but becomes relevant for the proposed fix.

At first, based on the report, I decided to try to view this as a
regression and tried to find a way to recover the old behavior without
breaking other stuff, or at least breaking as little as possible.
However, in the end, I couldn't figure out a way to do it that wouldn't
just cause lots more problems than it solved.  The old behavior was a
bug:
  * Although older git would avoid cleaning anything with `git clean -f
    .git`, it would wipe out everything under that direcotry with `git
    clean -f .git/`.  Despite the difference in command used, this is
    relevant because the exact same change that fixed clean changed the
    behavior of ls-files.
  * Older git would report different results based solely on presence or
    absence of a trailing slash for $SUBDIR in the command `git ls-files
    -o --directory $SUBDIR`.
  * Older git violated the documented behavior of not recursing into
    directories that matched the pathspec when --directory was
    specified.
  * And, after all, commit b9670c1f5e (dir: fix checks on common prefix
    directory, 2019-12-19) didn't overlook this issue; it explicitly
    stated that the behavior of the command was being changed to bring
    it inline with the docs.

(Also, if it helps, despite that commit being merged during the 2.25
series, this bug was not reported during the 2.25 cycle, nor even during
most of the 2.26 cycle -- it was reported a day before 2.26 was
released.  So the impact of the change is at least somewhat small.)

Instead of relying on a bug of ls-files in reporting the wrong content,
change the invocation of ls-files used by git-completion to make it grab
paths one depth deeper.  Do this by changing '$DIR/*' (match $DIR/ plus
0 or more characters) into '$DIR/?*' (match $DIR/ plus 1 or more
characters).  Note that the '?' character should not be added when
trying to complete a filename (e.g. 'git ls-files -o --directory
"merge.c?*"' would not correctly return "merge.c" when such a file
exists), so we have to make sure to add the '?' character only in cases
where the path specified so far is a directory.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

Fix error-prone fill_directory() API; make it only return matches

Traditionally, the expected calling convention for the dir.c API was:

    fill_directory(&dir, ..., pathspec)
    foreach entry in dir->entries:
        if (dir_path_match(entry, pathspec))
            process_or_display(entry)

This may have made sense once upon a time, because the fill_directory() call
could use cheap checks to avoid doing full pathspec matching, and an external
caller may have wanted to do other post-processing of the results anyway.
However:

    * this structure makes it easy for users of the API to get it wrong

    * this structure actually makes it harder to understand
      fill_directory() and the functions it uses internally.  It has
      tripped me up several times while trying to fix bugs and
      restructure things.

    * relying on post-filtering was already found to produce wrong
      results; pathspec matching had to be added internally for multiple
      cases in order to get the right results (see commits 404ebceda01c
      (dir: also check directories for matching pathspecs, 2019-09-17)
      and 89a1f4aaf765 (dir: if our pathspec might match files under a
      dir, recurse into it, 2019-09-17))

    * it's bad for performance: fill_directory() already has to do lots
      of checks and knows the subset of cases where it still needs to do
      more checks.  Forcing external callers to do full pathspec
      matching means they must re-check _every_ path.

So, add the pathspec matching within the fill_directory() internals, and
remove it from external callers.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: replace double pathspec matching with single in treat_directory()

treat_directory() had a call to both do_match_pathspec() and
match_pathspec(). These calls have migrated through the code somewhat
since their introduction, but we don't actually need both. Replace the
two calls with one, and while at it, move the check earlier in order to
reduce the need for callers of fill_directory() to do post-filtering of
results.

The next patch will address post-filtering more forcefully and provide
more relevant history and context.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory()

Handling DIR_KEEP_UNTRACKED_CONTENTS within treat_directory() instead of
as a post-processing step in read_directory():
  * allows us to directly access and remove the relevant entries instead
    of needing to calculate which ones need to be removed
  * keeps the logic for directory handling in one location (and puts it
    closer the the logic for stripping out extra ignored entries, which
    seems logical).

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: replace exponential algorithm with a linear one

dir's read_directory_recursive() naturally operates recursively in order
to walk the directory tree.  Treating of directories is sometimes weird
because there are so many different permutations about how to handle
directories.  Some examples:

   * 'git ls-files -o --directory' only needs to know that a directory
     itself is untracked; it doesn't need to recurse into it to see what
     is underneath.

   * 'git status' needs to recurse into an untracked directory, but only
     to determine whether or not it is empty.  If there are no files
     underneath, the directory itself will be omitted from the output.
     If it is not empty, only the directory will be listed.

   * 'git status --ignored' needs to recurse into untracked directories
     and report all the ignored entries and then report the directory as
     untracked -- UNLESS all the entries under the directory are
     ignored, in which case we don't print any of the entries under the
     directory and just report the directory itself as ignored.  (Note
     that although this forces us to walk all untracked files underneath
     the directory as well, we strip them from the output, except for
     users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)

   * For 'git clean', we may need to recurse into a directory that
     doesn't match any specified pathspecs, if it's possible that there
     is an entry underneath the directory that can match one of the
     pathspecs.  In such a case, we need to be careful to omit the
     directory itself from the list of paths (see commit 404ebceda01c
     ("dir: also check directories for matching pathspecs", 2019-09-17))

Part of the tension noted above is that the treatment of a directory can
change based on the files within it, and based on the various settings
in dir->flags.  Trying to keep this in mind while reading over the code,
it is easy to think in terms of "treat_directory() tells us what to do
with a directory, and read_directory_recursive() is the thing that
recurses".  Since we need to look into a directory to know how to treat
it, though, it is quite easy to decide to (also) recurse into the
directory from treat_directory() by adding a read_directory_recursive()
call.  Adding such a call is actually fine, IF we make sure that
read_directory_recursive() does not also recurse into that same
directory.

Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
for ignored files", 2017-05-18), added exactly such a case to the code,
meaning we'd have two calls to read_directory_recursive() for an
untracked directory.  So, if we had a file named
   one/two/three/four/five/somefile.txt
and nothing in one/ was tracked, then 'git status --ignored' would
call read_directory_recursive() twice on the directory 'one/', and
each of those would call read_directory_recursive() twice on the
directory 'one/two/', and so on until read_directory_recursive() was
called 2^5 times for 'one/two/three/four/five/'.

Avoid calling read_directory_recursive() twice per level by moving a
lot of the special logic into treat_directory().

Since dir.c is somewhat complex, extra cruft built up around this over
time.  While trying to unravel it, I noticed several instances where the
first call to read_directory_recursive() would return e.g.
path_untracked for some directory and a later one would return e.g.
path_none, despite the fact that the directory clearly should have been
considered untracked.  The code happened to work due to the side-effect
from the first invocation of adding untracked entries to dir->entries;
this allowed it to get the correct output despite the supposed override
in return value by the later call.

I am somewhat concerned that there are still bugs and maybe even
testcases with the wrong expectation.  I have tried to carefully
document treat_directory() since it becomes more complex after this
change (though much of this complexity came from elsewhere that probably
deserved better comments to begin with).  However, much of my work felt
more like a game of whackamole while attempting to make the code match
the existing regression tests than an attempt to create an
implementation that matched some clear design.  That seems wrong to me,
but the rules of existing behavior had so many special cases that I had
a hard time coming up with some overarching rules about what correct
behavior is for all cases, forcing me to hope that the regression tests
are correct and sufficient.  Such a hope seems likely to be ill-founded,
given my experience with dir.c-related testcases in the last few months:

  Examples where the documentation was hard to parse or even just wrong:
   * 3aca58045f4f (git-clean.txt: do not claim we will delete files with
                   -n/--dry-run, 2019-09-17)
   * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
                   repository, 2019-09-17)
   * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
  Examples where testcases were declared wrong and changed:
   * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
                   repository, 2019-09-17)
   * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
   * a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within
                   leading directories", 2019-12-10)
  Examples where testcases were clearly inadequate:
   * 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an
                   ignored file breakage, 2019-08-25)
   * 7541cc530239 (t7300: add testcases showing failure to clean specified
                   pathspecs, 2019-09-17)
   * a5e916c7453b (dir: fix off-by-one error in match_pathspec_item,
                   2019-09-17)
   * 404ebceda01c (dir: also check directories for matching pathspecs,
                   2019-09-17)
   * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
                   repository, 2019-09-17)
   * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
   * 452efd11fbf6 (t3011: demonstrate directory traversal failures,
                   2019-12-10)
   * b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19)
  Examples where "correct behavior" was unclear to everyone:
    https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
  Other commits of note:
   * 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17)

However, on the positive side, it does make the code much faster.  For
the following simple shell loop in an empty repository:

  for depth in $(seq 10 25)
  do
    dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
    rm -rf dir
    mkdir -p $dirs
    >$dirs/untracked-file
    /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
  done

I saw the following timings, in seconds (note that the numbers are a
little noisy from run-to-run, but the trend is very clear with every
run):

    10: 0.03
    11: 0.05
    12: 0.08
    13: 0.19
    14: 0.29
    15: 0.50
    16: 1.05
    17: 2.11
    18: 4.11
    19: 8.60
    20: 17.55
    21: 33.87
    22: 68.71
    23: 140.05
    24: 274.45
    25: 551.15

For the above run, using strace I can look for the number of untracked
directories opened and can verify that it matches the expected
2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).

After this fix, with strace I can verify that the number of untracked
directories that are opened drops to just $depth, and the timings all
drop to 0.00.  In fact, it isn't until a depth of 190 nested directories
that it sometimes starts reporting a time of 0.01 seconds and doesn't
consistently report 0.01 seconds until there are 240 nested directories.
The previous code would have taken
  17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
to have completed the 240 nested directories case.  It's not often
that you get to speed something up by a factor of 3*10^69.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: refactor treat_directory to clarify control flow

The logic in treat_directory() is handled by a multi-case
switch statement, but this switch is very asymmetrical, as
the first two cases are simple but the third is more
complicated than the rest of the method. In fact, the third
case includes a "break" statement that leads to the block
of code outside the switch statement. That is the only way
to reach that block, as the switch handles all possible
values from directory_exists_in_index();

Extract the switch statement into a series of "if" statements.
This simplifies the trivial cases, while clarifying how to
reach the "show_other_directories" case. This is particularly
important as the "show_other_directories" case will expand
in a later change.

Helped-by: Elijah Newren <redacted>
Signed-off-by: Derrick Stolee <redacted>
Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: fix confusion based on variable tense

Despite having contributed several fixes in this area, I have for months
(years?) assumed that the "exclude" variable was a directive; this
caused me to think of it as a different mode we operate in and left me
confused as I tried to build up a mental model around why we'd need such
a directive.  I mostly tried to ignore it while focusing on the pieces I
was trying to understand.

Then I finally traced this variable all back to a call to is_excluded(),
meaning it was actually functioning as an adjective.  In particular, it
was a checked property ("Does this path match a rule in .gitignore?"),
rather than a mode passed in from the caller.  Change the variable name
to match the part of speech used by the function called to define it,
which will hopefully make these bits of code slightly clearer to the
next reader.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: fix broken comment

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: consolidate treat_path() and treat_one_path()

Commit 16e2cfa90993 ("read_directory(): further split treat_path()",
2010-01-08) split treat_one_path() out of treat_path(), because
treat_leading_path() would not have access to a dirent but wanted to
re-use as much of treat_path() as possible. Not re-using all of
treat_path() caused other bugs, as noted in commit b9670c1f5e6b ("dir:
fix checks on common prefix directory", 2019-12-19). Finally, in commit
ad6f2157f951 ("dir: restructure in a way to avoid passing around a
struct dirent", 2020-01-16), dirents were removed from treat_path() and
other functions entirely.

Since the only reason for splitting these functions was the lack of a
dirent -- which no longer applies to either function -- and since the
split caused problems in the past resulting in us not using
treat_one_path() separately anymore, just undo the split.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

dir: fix simple typo in comment

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

t3000: add more testcases testing a variety of ls-files issues

This adds seven new ls-files tests. While currently all seven test
pass, my earlier rounds of restructuring dir.c to replace an exponential
algorithm with a linear one passed all the tests in the testsuite but
failed six of these seven new tests. Add these tests to increase our
case coverage.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

t7063: more thorough status checking

It turns out the t7063 has some testcases that even without using the
untracked cache cover situations that nothing else in the testsuite
handles. Checking the results of
git status --porcelain
both with and without the untracked cache, and comparing both against
our expected results helped uncover a critical bug in some dir.c
restructuring.

Unfortunately, it's not easy to run status and tell it to ignore the
untracked cache; the only knob we have is core.untrackedCache=false,
which is used to instruct git to *delete* the untracked cache (which
might also ignore the untracked cache when it operates, but that isn't
specified in the docs).

Create a simple helper that will create a clone of the index that is
missing the untracked cache bits, and use it to compare that the results
with the untracked cache match the results we get without the untracked
cache.

Signed-off-by: Elijah Newren <redacted>
Signed-off-by: Junio C Hamano <redacted>

clone: use "quick" lookup while following tags

When cloning with --single-branch, we implement git-fetch's usual
tag-following behavior, grabbing any tag objects that point to objects
we have locally.

When we're a partial clone, though, our has_object_file() check will
actually lazy-fetch each tag. That not only defeats the purpose of
--single-branch, but it does it incredibly slowly, potentially kicking
off a new fetch for each tag. This is even worse for a shallow clone,
which implies --single-branch, because even tags which are supersets of
each other will be fetched individually.

We can fix this by passing OBJECT_INFO_SKIP_FETCH_OBJECT to the call,
which is what git-fetch does in this case.

Likewise, let's include OBJECT_INFO_QUICK, as that's what git-fetch
does. The rationale is discussed in 5827a03545 (fetch: use "quick"
has_sha1_file for tag following, 2016-10-13), but here the tradeoff
would apply even more so because clone is very unlikely to be racing
with another process repacking our newly-created repository.

This may provide a very small speedup even in the non-partial case case,
as we'd avoid calling reprepare_packed_git() for each tag (though in
practice, we'd only have a single packfile, so that reprepare should be
quite cheap).

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

user-manual.conf: don't specify [listingblock]

This is the config file we use when we build the user manual with
AsciiDoc. The comment at the top of this chunk that we're removing says
the following:

"unbreak" docbook-xsl v1.68 for manpages (sic!). v1.69 works with or
without this.

This comes from d19fbc3c17 ("Documentation: add git user's manual",
2007-01-07), where it looks like this conf file in general and this
snippet in particular was copy-pasted from asciidoc.conf.

This chunk is very similar to something we just got rid of for the
manpages, and because this appears to be aimed at v1.68 -- which we no
longer support for the manpages as of a few commits ago --, it's
tempting to get rid of this. That reveals an interesting aspect of
"works with or without this": it turns out it actually works /better/
without!

Dropping this makes us render code snippets and shell listings using
<screen> rather than <literallayout>, just like Asciidoctor does. In
user-manual.pdf, this puts the contents into dimmed-background,
easy-to-distinguish-from-the-surrounding-text boxes, as opposed to
white-background (transparent) boxes.

Signed-off-by: Martin Ågren <redacted>
Signed-off-by: Junio C Hamano <redacted>

t3432: test `--merge' with `rebase.abbreviateCommands = true', too

When fast forwarding, `git --merge' should act the same whether
`rebase.abbreviateCommands' is set or not, but so far it was not the
case. This duplicates the tests ensuring that `--merge' works when fast
forwarding to check if it also works with abbreviated commands.

Signed-off-by: Alban Gruin <redacted>
Signed-off-by: Junio C Hamano <redacted>

sequencer: don't abbreviate a command if it doesn't have a short form

When the sequencer is requested to abbreviate commands, it will replace
those that do not have a short form (eg. `noop') by a comment mark.
`noop' serves no purpose, except when fast-forwarding (ie. by running
`git rebase'). Removing it will break this command when
`rebase.abbreviateCommands' is set to true.

Teach todo_list_to_strbuf() to check if a command has an actual
short form, and to ignore it if not.

Signed-off-by: Alban Gruin <redacted>
Signed-off-by: Junio C Hamano <redacted>

MyFirstObjectWalk: remove unnecessary conditional statement

In the given example, `commit` cannot be `NULL` (because this is the
loop condition: if it was `NULL`, the loop body would not be entered at
all). It took this developer a moment or two to see that this is
therefore dead code.

Let's remove it, to avoid puzzling future readers.

Signed-off-by: Johannes Schindelin <redacted>
Reviewed-by: Emily Shaffer <redacted>
Signed-off-by: Junio C Hamano <redacted>

oidset: stop referring to sha1-array

Ths has been oid_array for some time, though the source only recently
moved from sha1-array.[ch] to oid-array.[ch]. In either case, we should
say "oid-array" here.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

ref-filter: stop referring to "sha1 array"

A comment refers to the "sha1s in the given sha1 array". But this became
an oid_array along with everywhere else in 910650d2f8 (Rename sha1_array
to oid_array, 2017-03-31). Plus there's an extra line of leftover
editing cruft we can drop.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

bisect: stop referring to sha1_array

Our join_sha1_array_hex() function long ago switched to using an
oid_array; let's change the name to match.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

test-tool: rename sha1-array to oid-array

This matches the actual data structure name, as well as the source file
that contains the code we're testing. The test scripts need updating to
use the new name, as well.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

oid_array: rename source file from sha1-array

We renamed the actual data structure in 910650d2f8 (Rename sha1_array to
oid_array, 2017-03-31), but the file is still called sha1-array. Besides
being slightly confusing, it makes it more annoying to grep for leftover
occurrences of "sha1" in various files, because the header is included
in so many places.

Let's complete the transition by renaming the source and header files
(and fixing up a few comment references).

I kept the "-" in the name, as that seems to be our style; cf.
fc1395f4a4 (sha1_file.c: rename to use dash in file name, 2018-04-10).
We also have oidmap.h and oidset.h without any punctuation, but those
are "struct oidmap" and "struct oidset" in the code. We _could_ make
this "oidarray" to match, but somehow it looks uglier to me because of
the length of "array" (plus it would be a very invasive patch for little
gain).

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

oid_array: use size_t for iteration

The previous commit started using size_t for our allocations. There are
some iterations that use int or unsigned, though. These aren't dangerous
with respect to memory, but they could produce incorrect results.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

oid_array: use size_t for count and allocation

The oid_array object uses an "int" to store the number of items and the
allocated size. It's rather unlikely for somebody to have more than 2^31
objects in a repository (the sha1's alone would be 40GB!), but if they
do, we'd overflow our alloc variable.

You can reproduce this case with something like:

  git init repo
  cd repo

  # make a pack with 2^24 objects
  perl -e '
    my $nr = 2**24;

    for (my $i = 0; $i < $nr; $i++) {
    print "blob\n";
    print "data 4\n";
    print pack("N", $i);
    }
  ' | git fast-import

  # now make 256 copies of it; most of these objects will be duplicates,
  # but oid_array doesn't de-dup until all values are read and it can
  # sort the result.
  cd .git/objects/pack/
  pack=$(echo *.pack)
  idx=$(echo *.idx)
  for i in $(seq 0 255); do
    # no need to waste disk space
    ln "$pack" "pack-extra-$i.pack"
    ln "$idx" "pack-extra-$i.idx"
  done

  # and now force an oid_array to store all of it
  git cat-file --batch-all-objects --batch-check

which results in:

  fatal: size_t overflow: 32 * 18446744071562067968

So the good news is that st_mult() sees the problem (the large number is
because our int wraps negative, and then that gets cast to a size_t),
doing the job it was meant to: bailing in crazy situations rather than
causing an undersized buffer.

But we should avoid hitting this case at all, and instead limit
ourselves based on what malloc() is willing to give us. We can easily do
that by switching to size_t.

The cat-file process above made it to ~120GB virtual set size before the
integer overflow (our internal hash storage is 32-bytes now in
preparation for sha256, so we'd expect ~128GB total needed, plus
potentially more to copy from one realloc'd block to another)). After
this patch (and about 130GB of RAM+swap), it does eventually read in the
whole set. No test for obvious reasons.

Signed-off-by: Jeff King <redacted>
Signed-off-by: Junio C Hamano <redacted>

docs: add a FAQ

Git is an enormously flexible and powerful piece of software.  However,
it can be intimidating for many users and there are a set of common
questions that users often ask.  While we already have some new user
documentation, it's worth adding a FAQ to address common questions that
users often have.  Even though some of this is addressed elsewhere in
the documentation, experience has shown that it is difficult for users
to find, so a centralized location is helpful.

Add such a FAQ and fill it with some common questions and answers.
While there are few entries now, we can expand it in the future to cover
more things as we find new questions that users have.  Let's also add
section markers so that people answering questions can directly link
users to the proper answer.

The FAQ also addresses common configuration questions that apply not
only to Git as an independent piece of software but also the ecosystem
of CI tools and hosting providers that people use, since these are the
source of common questions.  An attempt has been made to avoid
mentioning any particular provider or tool, but to nevertheless cover
common configurations that apply to a wide variety of such tools.

Note that the long lines for certain questions are required, since
Asciidoctor does not permit broken lines there.

Signed-off-by: brian m. carlson <redacted>
Signed-off-by: Junio C Hamano <redacted>

strbuf: provide function to append whole lines

While the strbuf interface already provides functions to read a line
into it that completely replaces its current contents, we do not have an
interface that allows for appending lines without discarding current
contents.

Add a new function `strbuf_appendwholeline` that reads a line including
its terminating character into a strbuf non-destructively. This is a
preparatory step for git-update-ref(1) reading standard input line-wise
instead of as a block.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>

git-update-ref.txt: add missing word

The description for the "verify" command is lacking a single word "is",
which this commit corrects.

Signed-off-by: Patrick Steinhardt <redacted>
Signed-off-by: Junio C Hamano <redacted>