Revision - e209392 - dax: fix race between colliding PMD & PTE entries

Revision e2093926a098a8ccf0f1d10f6df8dad452cb28d3 authored by Ross Zwisler on 02 June 2017, 21:46:37 UTC, committed by Linus Torvalds on 02 June 2017, 22:07:37 UTC

dax: fix race between colliding PMD & PTE entries

We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1 parent d0f0931

Files
Changes

Permalinks

config

#!/bin/bash
# Manipulate options in a .config file from the command line

myname=${0##*/}

# If no prefix forced, use the default CONFIG_
CONFIG_="${CONFIG_-CONFIG_}"

usage() {
	cat >&2 <<EOL
Manipulate options in a .config file from the command line.
Usage:
$myname options command ...
commands:
	--enable|-e option   Enable option
	--disable|-d option  Disable option
	--module|-m option   Turn option into a module
	--set-str option string
	                     Set option to "string"
	--set-val option value
	                     Set option to value
	--undefine|-u option Undefine option
	--state|-s option    Print state of option (n,y,m,undef)

	--enable-after|-E beforeopt option
                             Enable option directly after other option
	--disable-after|-D beforeopt option
                             Disable option directly after other option
	--module-after|-M beforeopt option
                             Turn option into module directly after other option

	commands can be repeated multiple times

options:
	--file config-file   .config file to change (default .config)
	--keep-case|-k       Keep next symbols' case (dont' upper-case it)

$myname doesn't check the validity of the .config file. This is done at next
make time.

By default, $myname will upper-case the given symbol. Use --keep-case to keep
the case of all following symbols unchanged.

$myname uses 'CONFIG_' as the default symbol prefix. Set the environment
variable CONFIG_ to the prefix to use. Eg.: CONFIG_="FOO_" $myname ...
EOL
	exit 1
}

checkarg() {
	ARG="$1"
	if [ "$ARG" = "" ] ; then
		usage
	fi
	case "$ARG" in
	${CONFIG_}*)
		ARG="${ARG/${CONFIG_}/}"
		;;
	esac
	if [ "$MUNGE_CASE" = "yes" ] ; then
		ARG="`echo $ARG | tr a-z A-Z`"
	fi
}

txt_append() {
	local anchor="$1"
	local insert="$2"
	local infile="$3"
	local tmpfile="$infile.swp"

	# sed append cmd: 'a\' + newline + text + newline
	cmd="$(printf "a\\%b$insert" "\n")"

	sed -e "/$anchor/$cmd" "$infile" >"$tmpfile"
	# replace original file with the edited one
	mv "$tmpfile" "$infile"
}

txt_subst() {
	local before="$1"
	local after="$2"
	local infile="$3"
	local tmpfile="$infile.swp"

	sed -e "s:$before:$after:" "$infile" >"$tmpfile"
	# replace original file with the edited one
	mv "$tmpfile" "$infile"
}

txt_delete() {
	local text="$1"
	local infile="$2"
	local tmpfile="$infile.swp"

	sed -e "/$text/d" "$infile" >"$tmpfile"
	# replace original file with the edited one
	mv "$tmpfile" "$infile"
}

set_var() {
	local name=$1 new=$2 before=$3

	name_re="^($name=|# $name is not set)"
	before_re="^($before=|# $before is not set)"
	if test -n "$before" && grep -Eq "$before_re" "$FN"; then
		txt_append "^$before=" "$new" "$FN"
		txt_append "^# $before is not set" "$new" "$FN"
	elif grep -Eq "$name_re" "$FN"; then
		txt_subst "^$name=.*" "$new" "$FN"
		txt_subst "^# $name is not set" "$new" "$FN"
	else
		echo "$new" >>"$FN"
	fi
}

undef_var() {
	local name=$1

	txt_delete "^$name=" "$FN"
	txt_delete "^# $name is not set" "$FN"
}

if [ "$1" = "--file" ]; then
	FN="$2"
	if [ "$FN" = "" ] ; then
		usage
	fi
	shift 2
else
	FN=.config
fi

if [ "$1" = "" ] ; then
	usage
fi

MUNGE_CASE=yes
while [ "$1" != "" ] ; do
	CMD="$1"
	shift
	case "$CMD" in
	--keep-case|-k)
		MUNGE_CASE=no
		continue
		;;
	--refresh)
		;;
	--*-after|-E|-D|-M)
		checkarg "$1"
		A=$ARG
		checkarg "$2"
		B=$ARG
		shift 2
		;;
	-*)
		checkarg "$1"
		shift
		;;
	esac
	case "$CMD" in
	--enable|-e)
		set_var "${CONFIG_}$ARG" "${CONFIG_}$ARG=y"
		;;

	--disable|-d)
		set_var "${CONFIG_}$ARG" "# ${CONFIG_}$ARG is not set"
		;;

	--module|-m)
		set_var "${CONFIG_}$ARG" "${CONFIG_}$ARG=m"
		;;

	--set-str)
		# sed swallows one level of escaping, so we need double-escaping
		set_var "${CONFIG_}$ARG" "${CONFIG_}$ARG=\"${1//\"/\\\\\"}\""
		shift
		;;

	--set-val)
		set_var "${CONFIG_}$ARG" "${CONFIG_}$ARG=$1"
		shift
		;;
	--undefine|-u)
		undef_var "${CONFIG_}$ARG"
		;;

	--state|-s)
		if grep -q "# ${CONFIG_}$ARG is not set" $FN ; then
			echo n
		else
			V="$(grep "^${CONFIG_}$ARG=" $FN)"
			if [ $? != 0 ] ; then
				echo undef
			else
				V="${V/#${CONFIG_}$ARG=/}"
				V="${V/#\"/}"
				V="${V/%\"/}"
				V="${V//\\\"/\"}"
				echo "${V}"
			fi
		fi
		;;

	--enable-after|-E)
		set_var "${CONFIG_}$B" "${CONFIG_}$B=y" "${CONFIG_}$A"
		;;

	--disable-after|-D)
		set_var "${CONFIG_}$B" "# ${CONFIG_}$B is not set" "${CONFIG_}$A"
		;;

	--module-after|-M)
		set_var "${CONFIG_}$B" "${CONFIG_}$B=m" "${CONFIG_}$A"
		;;

	# undocumented because it ignores --file (fixme)
	--refresh)
		yes "" | make oldconfig
		;;

	*)
		usage
		;;
	esac
done

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...