Revision - b16a827 - bloom/diff: properly short-circuit on max_changes

Revision b16a8277644e4b1d21c08d97de757105039dc7ae authored by Derrick Stolee on 16 September 2020, 18:07:52 UTC, committed by Junio C Hamano on 17 September 2020, 16:31:25 UTC

bloom/diff: properly short-circuit on max_changes

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

1 parent 9a7a9ed

Files
Changes

Permalinks

trace.h

#ifndef TRACE_H
#define TRACE_H

#include "git-compat-util.h"
#include "strbuf.h"

/**
 * The trace API can be used to print debug messages to stderr or a file. Trace
 * code is inactive unless explicitly enabled by setting `GIT_TRACE*` environment
 * variables.
 *
 * The trace implementation automatically adds `timestamp file:line ... \n` to
 * all trace messages. E.g.:
 *
 * ------------
 * 23:59:59.123456 git.c:312               trace: built-in: git 'foo'
 * 00:00:00.000001 builtin/foo.c:99        foo: some message
 * ------------
 *
 * Bugs & Caveats
 * --------------
 *
 * GIT_TRACE_* environment variables can be used to tell Git to show
 * trace output to its standard error stream. Git can often spawn a pager
 * internally to run its subcommand and send its standard output and
 * standard error to it.
 *
 * Because GIT_TRACE_PERFORMANCE trace is generated only at the very end
 * of the program with atexit(), which happens after the pager exits, it
 * would not work well if you send its log to the standard error output
 * and let Git spawn the pager at the same time.
 *
 * As a work around, you can for example use '--no-pager', or set
 * GIT_TRACE_PERFORMANCE to another file descriptor which is redirected
 * to stderr, or set GIT_TRACE_PERFORMANCE to a file specified by its
 * absolute path.
 *
 * For example instead of the following command which by default may not
 * print any performance information:
 *
 * ------------
 * GIT_TRACE_PERFORMANCE=2 git log -1
 * ------------
 *
 * you may want to use:
 *
 * ------------
 * GIT_TRACE_PERFORMANCE=2 git --no-pager log -1
 * ------------
 *
 * or:
 *
 * ------------
 * GIT_TRACE_PERFORMANCE=3 3>&2 git log -1
 * ------------
 *
 * or:
 *
 * ------------
 * GIT_TRACE_PERFORMANCE=/path/to/log/file git log -1
 * ------------
 *
 */

/**
 * Defines a trace key (or category). The default (for API functions that
 * don't take a key) is `GIT_TRACE`.
 *
 * E.g. to define a trace key controlled by environment variable `GIT_TRACE_FOO`:
 *
 * ------------
 * static struct trace_key trace_foo = TRACE_KEY_INIT(FOO);
 *
 * static void trace_print_foo(const char *message)
 * {
 * 	trace_printf_key(&trace_foo, "%s", message);
 * }
 * ------------
 *
 * Note: don't use `const` as the trace implementation stores internal state in
 * the `trace_key` structure.
 */
struct trace_key {
	const char * const key;
	int fd;
	unsigned int initialized : 1;
	unsigned int  need_close : 1;
};

extern struct trace_key trace_default_key;

#define TRACE_KEY_INIT(name) { "GIT_TRACE_" #name, 0, 0, 0 }
extern struct trace_key trace_perf_key;
extern struct trace_key trace_setup_key;

void trace_repo_setup(const char *prefix);

/**
 * Checks whether the trace key is enabled. Used to prevent expensive
 * string formatting before calling one of the printing APIs.
 */
int trace_want(struct trace_key *key);

/**
 * Enables or disables tracing for the specified key, as if the environment
 * variable was set to the given value.
 */
void trace_override_envvar(struct trace_key *key, const char *value);

/**
 * Disables tracing for the specified key, even if the environment variable
 * was set.
 */
void trace_disable(struct trace_key *key);

/**
 * Returns nanoseconds since the epoch (01/01/1970), typically used
 * for performance measurements.
 * Currently there are high precision timer implementations for Linux (using
 * `clock_gettime(CLOCK_MONOTONIC)`) and Windows (`QueryPerformanceCounter`).
 * Other platforms use `gettimeofday` as time source.
 */
uint64_t getnanotime(void);

void trace_command_performance(const char **argv);
void trace_verbatim(struct trace_key *key, const void *buf, unsigned len);
uint64_t trace_performance_enter(void);

#ifndef HAVE_VARIADIC_MACROS

/**
 * Prints a formatted message, similar to printf.
 */
__attribute__((format (printf, 1, 2)))
void trace_printf(const char *format, ...);

__attribute__((format (printf, 2, 3)))
void trace_printf_key(struct trace_key *key, const char *format, ...);

/**
 * Prints a formatted message, followed by a quoted list of arguments.
 */
__attribute__((format (printf, 2, 3)))
void trace_argv_printf(const char **argv, const char *format, ...);

/**
 * Prints the strbuf, without additional formatting (i.e. doesn't
 * choke on `%` or even `\0`).
 */
void trace_strbuf(struct trace_key *key, const struct strbuf *data);

/**
 * Prints elapsed time (in nanoseconds) if GIT_TRACE_PERFORMANCE is enabled.
 *
 * Example:
 * ------------
 * uint64_t t = 0;
 * for (;;) {
 * 	// ignore
 * t -= getnanotime();
 * // code section to measure
 * t += getnanotime();
 * // ignore
 * }
 * trace_performance(t, "frotz");
 * ------------
 */
__attribute__((format (printf, 2, 3)))
void trace_performance(uint64_t nanos, const char *format, ...);

/**
 * Prints elapsed time since 'start' if GIT_TRACE_PERFORMANCE is enabled.
 *
 * Example:
 * ------------
 * uint64_t start = getnanotime();
 * // code section to measure
 * trace_performance_since(start, "foobar");
 * ------------
 */
__attribute__((format (printf, 2, 3)))
void trace_performance_since(uint64_t start, const char *format, ...);

__attribute__((format (printf, 1, 2)))
void trace_performance_leave(const char *format, ...);

#else

/*
 * Macros to add file:line - see above for C-style declarations of how these
 * should be used.
 */

/*
 * TRACE_CONTEXT may be set to __FUNCTION__ if the compiler supports it. The
 * default is __FILE__, as it is consistent with assert(), and static function
 * names are not necessarily unique.
 *
 * __FILE__ ":" __FUNCTION__ doesn't work with GNUC, as __FILE__ is supplied
 * by the preprocessor as a string literal, and __FUNCTION__ is filled in by
 * the compiler as a string constant.
 */
#ifndef TRACE_CONTEXT
# define TRACE_CONTEXT __FILE__
#endif

/*
 * Note: with C99 variadic macros, __VA_ARGS__ must include the last fixed
 * parameter ('format' in this case). Otherwise, a call without variable
 * arguments will have a surplus ','. E.g.:
 *
 *  #define foo(format, ...) bar(format, __VA_ARGS__)
 *  foo("test");
 *
 * will expand to
 *
 *  bar("test",);
 *
 * which is invalid (note the ',)'). With GNUC, '##__VA_ARGS__' drops the
 * comma, but this is non-standard.
 */

#define trace_printf_key(key, ...)					    \
	do {								    \
		if (trace_pass_fl(key))					    \
			trace_printf_key_fl(TRACE_CONTEXT, __LINE__, key,   \
					    __VA_ARGS__);		    \
	} while (0)

#define trace_printf(...) trace_printf_key(&trace_default_key, __VA_ARGS__)

#define trace_argv_printf(argv, ...)					    \
	do {								    \
		if (trace_pass_fl(&trace_default_key))			    \
			trace_argv_printf_fl(TRACE_CONTEXT, __LINE__,	    \
					    argv, __VA_ARGS__);		    \
	} while (0)

#define trace_strbuf(key, data)						    \
	do {								    \
		if (trace_pass_fl(key))					    \
			trace_strbuf_fl(TRACE_CONTEXT, __LINE__, key, data);\
	} while (0)

#define trace_performance(nanos, ...)					    \
	do {								    \
		if (trace_pass_fl(&trace_perf_key))			    \
			trace_performance_fl(TRACE_CONTEXT, __LINE__, nanos,\
					     __VA_ARGS__);		    \
	} while (0)

#define trace_performance_since(start, ...)				    \
	do {								    \
		if (trace_pass_fl(&trace_perf_key))			    \
			trace_performance_fl(TRACE_CONTEXT, __LINE__,       \
					     getnanotime() - (start),	    \
					     __VA_ARGS__);		    \
	} while (0)

#define trace_performance_leave(...)					    \
	do {								    \
		if (trace_pass_fl(&trace_perf_key))			    \
			trace_performance_leave_fl(TRACE_CONTEXT, __LINE__, \
						   getnanotime(),	    \
						   __VA_ARGS__);	    \
	} while (0)

/* backend functions, use non-*fl macros instead */
__attribute__((format (printf, 4, 5)))
void trace_printf_key_fl(const char *file, int line, struct trace_key *key,
			 const char *format, ...);
__attribute__((format (printf, 4, 5)))
void trace_argv_printf_fl(const char *file, int line, const char **argv,
			  const char *format, ...);
void trace_strbuf_fl(const char *file, int line, struct trace_key *key,
		     const struct strbuf *data);
__attribute__((format (printf, 4, 5)))
void trace_performance_fl(const char *file, int line,
			  uint64_t nanos, const char *fmt, ...);
__attribute__((format (printf, 4, 5)))
void trace_performance_leave_fl(const char *file, int line,
				uint64_t nanos, const char *fmt, ...);
static inline int trace_pass_fl(struct trace_key *key)
{
	return key->fd || !key->initialized;
}

#endif /* HAVE_VARIADIC_MACROS */

#endif /* TRACE_H */

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...