Revision e2093926a098a8ccf0f1d10f6df8dad452cb28d3 authored by Ross Zwisler on 02 June 2017, 21:46:37 UTC, committed by Linus Torvalds on 02 June 2017, 22:07:37 UTC
We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent d0f0931
Raw File
Makefile.modpost
# ===========================================================================
# Module versions
# ===========================================================================
#
# Stage one of module building created the following:
# a) The individual .o files used for the module
# b) A <module>.o file which is the .o files above linked together
# c) A <module>.mod file in $(MODVERDIR)/, listing the name of the
#    the preliminary <module>.o file, plus all .o files

# Stage 2 is handled by this file and does the following
# 1) Find all modules from the files listed in $(MODVERDIR)/
# 2) modpost is then used to
# 3)  create one <module>.mod.c file pr. module
# 4)  create one Module.symvers file with CRC for all exported symbols
# 5) compile all <module>.mod.c files
# 6) final link of the module to a <module.ko> file

# Step 3 is used to place certain information in the module's ELF
# section, including information such as:
#   Version magic (see include/linux/vermagic.h for full details)
#     - Kernel release
#     - SMP is CONFIG_SMP
#     - PREEMPT is CONFIG_PREEMPT
#     - GCC Version
#   Module info
#     - Module version (MODULE_VERSION)
#     - Module alias'es (MODULE_ALIAS)
#     - Module license (MODULE_LICENSE)
#     - See include/linux/module.h for more details

# Step 4 is solely used to allow module versioning in external modules,
# where the CRC of each module is retrieved from the Module.symvers file.

# KBUILD_MODPOST_WARN can be set to avoid error out in case of undefined
# symbols in the final module linking stage
# KBUILD_MODPOST_NOFINAL can be set to skip the final link of modules.
# This is solely useful to speed up test compiles
PHONY := _modpost
_modpost: __modpost

include include/config/auto.conf
include scripts/Kbuild.include

# When building external modules load the Kbuild file to retrieve EXTRA_SYMBOLS info
ifneq ($(KBUILD_EXTMOD),)

# set src + obj - they may be used when building the .mod.c file
obj := $(KBUILD_EXTMOD)
src := $(obj)

# Include the module's Makefile to find KBUILD_EXTRA_SYMBOLS
include $(if $(wildcard $(KBUILD_EXTMOD)/Kbuild), \
             $(KBUILD_EXTMOD)/Kbuild, $(KBUILD_EXTMOD)/Makefile)
endif

include scripts/Makefile.lib

kernelsymfile := $(objtree)/Module.symvers
modulesymfile := $(firstword $(KBUILD_EXTMOD))/Module.symvers

# Step 1), find all modules listed in $(MODVERDIR)/
MODLISTCMD := find $(MODVERDIR) -name '*.mod' | xargs -r grep -h '\.ko$$' | sort -u
__modules := $(shell $(MODLISTCMD))
modules   := $(patsubst %.o,%.ko, $(wildcard $(__modules:.ko=.o)))

# Stop after building .o files if NOFINAL is set. Makes compile tests quicker
_modpost: $(if $(KBUILD_MODPOST_NOFINAL), $(modules:.ko:.o),$(modules))

# Step 2), invoke modpost
#  Includes step 3,4
modpost = scripts/mod/modpost                    \
 $(if $(CONFIG_MODVERSIONS),-m)                  \
 $(if $(CONFIG_MODULE_SRCVERSION_ALL),-a,)       \
 $(if $(KBUILD_EXTMOD),-i,-o) $(kernelsymfile)   \
 $(if $(KBUILD_EXTMOD),-I $(modulesymfile))      \
 $(if $(KBUILD_EXTRA_SYMBOLS), $(patsubst %, -e %,$(KBUILD_EXTRA_SYMBOLS))) \
 $(if $(KBUILD_EXTMOD),-o $(modulesymfile))      \
 $(if $(CONFIG_DEBUG_SECTION_MISMATCH),,-S)      \
 $(if $(CONFIG_SECTION_MISMATCH_WARN_ONLY),,-E)  \
 $(if $(KBUILD_EXTMOD)$(KBUILD_MODPOST_WARN),-w)

MODPOST_OPT=$(subst -i,-n,$(filter -i,$(MAKEFLAGS)))

# We can go over command line length here, so be careful.
quiet_cmd_modpost = MODPOST $(words $(filter-out vmlinux FORCE, $^)) modules
      cmd_modpost = $(MODLISTCMD) | sed 's/\.ko$$/.o/' | $(modpost) $(MODPOST_OPT) -s -T -

PHONY += __modpost
__modpost: $(modules:.ko=.o) FORCE
	$(call cmd,modpost) $(wildcard vmlinux)

quiet_cmd_kernel-mod = MODPOST $@
      cmd_kernel-mod = $(modpost) $@

vmlinux.o: FORCE
	$(call cmd,kernel-mod)

# Declare generated files as targets for modpost
$(symverfile):         __modpost ;
$(modules:.ko=.mod.c): __modpost ;


# Step 5), compile all *.mod.c files

# modname is set to make c_flags define KBUILD_MODNAME
modname = $(notdir $(@:.mod.o=))

quiet_cmd_cc_o_c = CC      $@
      cmd_cc_o_c = $(CC) $(c_flags) $(KBUILD_CFLAGS_MODULE) $(CFLAGS_MODULE) \
		   -c -o $@ $<

$(modules:.ko=.mod.o): %.mod.o: %.mod.c FORCE
	$(call if_changed_dep,cc_o_c)

targets += $(modules:.ko=.mod.o)

ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)

# Step 6), final link of the modules with optional arch pass after final link
quiet_cmd_ld_ko_o = LD [M]  $@
      cmd_ld_ko_o =                                                     \
	$(LD) -r $(LDFLAGS)                                             \
                 $(KBUILD_LDFLAGS_MODULE) $(LDFLAGS_MODULE)             \
                 -o $@ $(filter-out FORCE,$^) ;                         \
	$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)

$(modules): %.ko :%.o %.mod.o FORCE
	+$(call if_changed,ld_ko_o)

targets += $(modules)


# Add FORCE to the prequisites of a target to force it to be always rebuilt.
# ---------------------------------------------------------------------------

PHONY += FORCE

FORCE:

# Read all saved command lines and dependencies for the $(targets) we
# may be building above, using $(if_changed{,_dep}). As an
# optimization, we don't need to read them if the target does not
# exist, we will rebuild anyway in that case.

targets := $(wildcard $(sort $(targets)))
cmd_files := $(wildcard $(foreach f,$(targets),$(dir $(f)).$(notdir $(f)).cmd))

ifneq ($(cmd_files),)
  include $(cmd_files)
endif


# Declare the contents of the .PHONY variable as phony.  We keep that
# information in a variable se we can use it in if_changed and friends.

.PHONY: $(PHONY)
back to top