Revision e2093926a098a8ccf0f1d10f6df8dad452cb28d3 authored by Ross Zwisler on 02 June 2017, 21:46:37 UTC, committed by Linus Torvalds on 02 June 2017, 22:07:37 UTC
We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent d0f0931
Raw File
faddr2line
#!/bin/bash
#
# Translate stack dump function offsets.
#
# addr2line doesn't work with KASLR addresses.  This works similarly to
# addr2line, but instead takes the 'func+0x123' format as input:
#
#   $ ./scripts/faddr2line ~/k/vmlinux meminfo_proc_show+0x5/0x568
#   meminfo_proc_show+0x5/0x568:
#   meminfo_proc_show at fs/proc/meminfo.c:27
#
# If the address is part of an inlined function, the full inline call chain is
# printed:
#
#   $ ./scripts/faddr2line ~/k/vmlinux native_write_msr+0x6/0x27
#   native_write_msr+0x6/0x27:
#   arch_static_branch at arch/x86/include/asm/msr.h:121
#    (inlined by) static_key_false at include/linux/jump_label.h:125
#    (inlined by) native_write_msr at arch/x86/include/asm/msr.h:125
#
# The function size after the '/' in the input is optional, but recommended.
# It's used to help disambiguate any duplicate symbol names, which can occur
# rarely.  If the size is omitted for a duplicate symbol then it's possible for
# multiple code sites to be printed:
#
#   $ ./scripts/faddr2line ~/k/vmlinux raw_ioctl+0x5
#   raw_ioctl+0x5/0x20:
#   raw_ioctl at drivers/char/raw.c:122
#
#   raw_ioctl+0x5/0xb1:
#   raw_ioctl at net/ipv4/raw.c:876
#
# Multiple addresses can be specified on a single command line:
#
#   $ ./scripts/faddr2line ~/k/vmlinux type_show+0x10/45 free_reserved_area+0x90
#   type_show+0x10/0x2d:
#   type_show at drivers/video/backlight/backlight.c:213
#
#   free_reserved_area+0x90/0x123:
#   free_reserved_area at mm/page_alloc.c:6429 (discriminator 2)


set -o errexit
set -o nounset

command -v awk >/dev/null 2>&1 || die "awk isn't installed"
command -v readelf >/dev/null 2>&1 || die "readelf isn't installed"
command -v addr2line >/dev/null 2>&1 || die "addr2line isn't installed"

usage() {
	echo "usage: faddr2line <object file> <func+offset> <func+offset>..." >&2
	exit 1
}

warn() {
	echo "$1" >&2
}

die() {
	echo "ERROR: $1" >&2
	exit 1
}

# Try to figure out the source directory prefix so we can remove it from the
# addr2line output.  HACK ALERT: This assumes that start_kernel() is in
# kernel/init.c!  This only works for vmlinux.  Otherwise it falls back to
# printing the absolute path.
find_dir_prefix() {
	local objfile=$1

	local start_kernel_addr=$(readelf -sW $objfile | awk '$8 == "start_kernel" {printf "0x%s", $2}')
	[[ -z $start_kernel_addr ]] && return

	local file_line=$(addr2line -e $objfile $start_kernel_addr)
	[[ -z $file_line ]] && return

	local prefix=${file_line%init/main.c:*}
	if [[ -z $prefix ]] || [[ $prefix = $file_line ]]; then
		return
	fi

	DIR_PREFIX=$prefix
	return 0
}

__faddr2line() {
	local objfile=$1
	local func_addr=$2
	local dir_prefix=$3
	local print_warnings=$4

	local func=${func_addr%+*}
	local offset=${func_addr#*+}
	offset=${offset%/*}
	local size=
	[[ $func_addr =~ "/" ]] && size=${func_addr#*/}

	if [[ -z $func ]] || [[ -z $offset ]] || [[ $func = $func_addr ]]; then
		warn "bad func+offset $func_addr"
		DONE=1
		return
	fi

	# Go through each of the object's symbols which match the func name.
	# In rare cases there might be duplicates.
	while read symbol; do
		local fields=($symbol)
		local sym_base=0x${fields[0]}
		local sym_type=${fields[1]}
		local sym_end=0x${fields[3]}

		# calculate the size
		local sym_size=$(($sym_end - $sym_base))
		if [[ -z $sym_size ]] || [[ $sym_size -le 0 ]]; then
			warn "bad symbol size: base: $sym_base end: $sym_end"
			DONE=1
			return
		fi
		sym_size=0x$(printf %x $sym_size)

		# calculate the address
		local addr=$(($sym_base + $offset))
		if [[ -z $addr ]] || [[ $addr = 0 ]]; then
			warn "bad address: $sym_base + $offset"
			DONE=1
			return
		fi
		addr=0x$(printf %x $addr)

		# weed out non-function symbols
		if [[ $sym_type != t ]] && [[ $sym_type != T ]]; then
			[[ $print_warnings = 1 ]] &&
				echo "skipping $func address at $addr due to non-function symbol of type '$sym_type'"
			continue
		fi

		# if the user provided a size, make sure it matches the symbol's size
		if [[ -n $size ]] && [[ $size -ne $sym_size ]]; then
			[[ $print_warnings = 1 ]] &&
				echo "skipping $func address at $addr due to size mismatch ($size != $sym_size)"
			continue;
		fi

		# make sure the provided offset is within the symbol's range
		if [[ $offset -gt $sym_size ]]; then
			[[ $print_warnings = 1 ]] &&
				echo "skipping $func address at $addr due to size mismatch ($offset > $sym_size)"
			continue
		fi

		# separate multiple entries with a blank line
		[[ $FIRST = 0 ]] && echo
		FIRST=0

		# pass real address to addr2line
		echo "$func+$offset/$sym_size:"
		addr2line -fpie $objfile $addr | sed "s; $dir_prefix\(\./\)*; ;"
		DONE=1

	done < <(nm -n $objfile | awk -v fn=$func '$3 == fn { found=1; line=$0; start=$1; next } found == 1 { found=0; print line, $1 }')
}

[[ $# -lt 2 ]] && usage

objfile=$1
[[ ! -f $objfile ]] && die "can't find objfile $objfile"
shift

DIR_PREFIX=supercalifragilisticexpialidocious
find_dir_prefix $objfile

FIRST=1
while [[ $# -gt 0 ]]; do
	func_addr=$1
	shift

	# print any matches found
	DONE=0
	__faddr2line $objfile $func_addr $DIR_PREFIX 0

	# if no match was found, print warnings
	if [[ $DONE = 0 ]]; then
		__faddr2line $objfile $func_addr $DIR_PREFIX 1
		warn "no match for $func_addr"
	fi
done
back to top