Revision - 4f22b10 - do not stream large files to pack when filters are in use

Revision 4f22b1015d4203ccdf2b66f27ee5946504342ace authored by Jeff King on 24 February 2012, 22:10:17 UTC, committed by Junio C Hamano on 24 February 2012, 22:18:20 UTC

do not stream large files to pack when filters are in use

Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.

The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size.  Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.

This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.

The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

1 parent 4c3b57b

Files
Changes

Permalinks

git-merge-one-file.sh

#!/bin/sh
#
# Copyright (c) Linus Torvalds, 2005
#
# This is the git per-file merge script, called with
#
#   $1 - original file SHA1 (or empty)
#   $2 - file in branch1 SHA1 (or empty)
#   $3 - file in branch2 SHA1 (or empty)
#   $4 - pathname in repository
#   $5 - original file mode (or empty)
#   $6 - file in branch1 mode (or empty)
#   $7 - file in branch2 mode (or empty)
#
# Handle some trivial cases.. The _really_ trivial cases have
# been handled already by git read-tree, but that one doesn't
# do any merges that might change the tree layout.

USAGE='<orig blob> <our blob> <their blob> <path>'
USAGE="$USAGE <orig mode> <our mode> <their mode>"
LONG_USAGE="Usage: git merge-one-file $USAGE

Blob ids and modes should be empty for missing files."

SUBDIRECTORY_OK=Yes
. git-sh-setup
cd_to_toplevel
require_work_tree

if ! test "$#" -eq 7
then
	echo "$LONG_USAGE"
	exit 1
fi

case "${1:-.}${2:-.}${3:-.}" in
#
# Deleted in both or deleted in one and unchanged in the other
#
"$1.." | "$1.$1" | "$1$1.")
	if [ "$2" ]; then
		echo "Removing $4"
	else
		# read-tree checked that index matches HEAD already,
		# so we know we do not have this path tracked.
		# there may be an unrelated working tree file here,
		# which we should just leave unmolested.  Make sure
		# we do not have it in the index, though.
		exec git update-index --remove -- "$4"
	fi
	if test -f "$4"; then
		rm -f -- "$4" &&
		rmdir -p "$(expr "z$4" : 'z\(.*\)/')" 2>/dev/null || :
	fi &&
		exec git update-index --remove -- "$4"
	;;

#
# Added in one.
#
".$2.")
	# the other side did not add and we added so there is nothing
	# to be done, except making the path merged.
	exec git update-index --add --cacheinfo "$6" "$2" "$4"
	;;
"..$3")
	echo "Adding $4"
	if test -f "$4"
	then
		echo "ERROR: untracked $4 is overwritten by the merge."
		exit 1
	fi
	git update-index --add --cacheinfo "$7" "$3" "$4" &&
		exec git checkout-index -u -f -- "$4"
	;;

#
# Added in both, identically (check for same permissions).
#
".$3$2")
	if [ "$6" != "$7" ]; then
		echo "ERROR: File $4 added identically in both branches,"
		echo "ERROR: but permissions conflict $6->$7."
		exit 1
	fi
	echo "Adding $4"
	git update-index --add --cacheinfo "$6" "$2" "$4" &&
		exec git checkout-index -u -f -- "$4"
	;;

#
# Modified in both, but differently.
#
"$1$2$3" | ".$2$3")

	case ",$6,$7," in
	*,120000,*)
		echo "ERROR: $4: Not merging symbolic link changes."
		exit 1
		;;
	*,160000,*)
		echo "ERROR: $4: Not merging conflicting submodule changes."
		exit 1
		;;
	esac

	src2=`git-unpack-file $3`
	case "$1" in
	'')
		echo "Added $4 in both, but differently."
		# This extracts OUR file in $orig, and uses git apply to
		# remove lines that are unique to ours.
		orig=`git-unpack-file $2`
		sz0=`wc -c <"$orig"`
		@@DIFF@@ -u -La/$orig -Lb/$orig $orig $src2 | git apply --no-add
		sz1=`wc -c <"$orig"`

		# If we do not have enough common material, it is not
		# worth trying two-file merge using common subsections.
		expr $sz0 \< $sz1 \* 2 >/dev/null || : >$orig
		;;
	*)
		echo "Auto-merging $4"
		orig=`git-unpack-file $1`
		;;
	esac

	# Be careful for funny filename such as "-L" in "$4", which
	# would confuse "merge" greatly.
	src1=`git-unpack-file $2`
	git merge-file "$src1" "$orig" "$src2"
	ret=$?
	msg=
	if [ $ret -ne 0 ]; then
		msg='content conflict'
	fi

	# Create the working tree file, using "our tree" version from the
	# index, and then store the result of the merge.
	git checkout-index -f --stage=2 -- "$4" && cat "$src1" >"$4" || exit 1
	rm -f -- "$orig" "$src1" "$src2"

	if [ "$6" != "$7" ]; then
		if [ -n "$msg" ]; then
			msg="$msg, "
		fi
		msg="${msg}permissions conflict: $5->$6,$7"
		ret=1
	fi
	if [ "$1" = '' ]; then
		ret=1
	fi

	if [ $ret -ne 0 ]; then
		echo "ERROR: $msg in $4"
		exit 1
	fi
	exec git update-index -- "$4"
	;;

*)
	echo "ERROR: $4: Not handling case $1 -> $2 -> $3"
	;;
esac
exit 1

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...