Revision e2093926a098a8ccf0f1d10f6df8dad452cb28d3 authored by Ross Zwisler on 02 June 2017, 21:46:37 UTC, committed by Linus Torvalds on 02 June 2017, 22:07:37 UTC
We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent d0f0931
History
File Mode Size
bitops
4level-fixup.h -rw-r--r-- 1.0 KB
5level-fixup.h -rw-r--r-- 1.1 KB
asm-offsets.h -rw-r--r-- 35 bytes
asm-prototypes.h -rw-r--r-- 468 bytes
atomic-long.h -rw-r--r-- 6.8 KB
atomic.h -rw-r--r-- 5.1 KB
atomic64.h -rw-r--r-- 2.2 KB
audit_change_attr.h -rw-r--r-- 445 bytes
audit_dir_write.h -rw-r--r-- 416 bytes
audit_read.h -rw-r--r-- 202 bytes
audit_signal.h -rw-r--r-- 36 bytes
audit_write.h -rw-r--r-- 377 bytes
barrier.h -rw-r--r-- 5.6 KB
bitops.h -rw-r--r-- 1.1 KB
bitsperlong.h -rw-r--r-- 553 bytes
bug.h -rw-r--r-- 6.4 KB
bugs.h -rw-r--r-- 228 bytes
cache.h -rw-r--r-- 345 bytes
cacheflush.h -rw-r--r-- 1.3 KB
checksum.h -rw-r--r-- 2.2 KB
clkdev.h -rw-r--r-- 706 bytes
cmpxchg-local.h -rw-r--r-- 1.4 KB
cmpxchg.h -rw-r--r-- 2.2 KB
current.h -rw-r--r-- 217 bytes
delay.h -rw-r--r-- 1.1 KB
device.h -rw-r--r-- 245 bytes
div64.h -rw-r--r-- 6.9 KB
dma-contiguous.h -rw-r--r-- 199 bytes
dma.h -rw-r--r-- 514 bytes
early_ioremap.h -rw-r--r-- 1.5 KB
emergency-restart.h -rw-r--r-- 209 bytes
exec.h -rw-r--r-- 697 bytes
export.h -rw-r--r-- 2.2 KB
extable.h -rw-r--r-- 763 bytes
fb.h -rw-r--r-- 232 bytes
fixmap.h -rw-r--r-- 2.8 KB
ftrace.h -rw-r--r-- 460 bytes
futex.h -rw-r--r-- 3.7 KB
getorder.h -rw-r--r-- 1.4 KB
gpio.h -rw-r--r-- 4.4 KB
hardirq.h -rw-r--r-- 493 bytes
hugetlb.h -rw-r--r-- 758 bytes
hw_irq.h -rw-r--r-- 270 bytes
ide_iops.h -rw-r--r-- 752 bytes
int-ll64.h -rw-r--r-- 893 bytes
io.h -rw-r--r-- 19.6 KB
ioctl.h -rw-r--r-- 467 bytes
iomap.h -rw-r--r-- 3.1 KB
irq.h -rw-r--r-- 364 bytes
irq_regs.h -rw-r--r-- 980 bytes
irq_work.h -rw-r--r-- 155 bytes
irqflags.h -rw-r--r-- 1.5 KB
kdebug.h -rw-r--r-- 143 bytes
kmap_types.h -rw-r--r-- 159 bytes
kprobes.h -rw-r--r-- 829 bytes
kvm_para.h -rw-r--r-- 441 bytes
linkage.h -rw-r--r-- 225 bytes
local.h -rw-r--r-- 2.2 KB
local64.h -rw-r--r-- 3.8 KB
mcs_spinlock.h -rw-r--r-- 260 bytes
memory_model.h -rw-r--r-- 2.1 KB
mm-arch-hooks.h -rw-r--r-- 388 bytes
mm_hooks.h -rw-r--r-- 837 bytes
mmu.h -rw-r--r-- 410 bytes
mmu_context.h -rw-r--r-- 842 bytes
module.h -rw-r--r-- 1.1 KB
msi.h -rw-r--r-- 799 bytes
page.h -rw-r--r-- 2.4 KB
param.h -rw-r--r-- 328 bytes
parport.h -rw-r--r-- 565 bytes
pci.h -rw-r--r-- 542 bytes
pci_iomap.h -rw-r--r-- 2.0 KB
percpu.h -rw-r--r-- 12.2 KB
pgalloc.h -rw-r--r-- 303 bytes
pgtable-nop4d-hack.h -rw-r--r-- 1.8 KB
pgtable-nop4d.h -rw-r--r-- 1.6 KB
pgtable-nopmd.h -rw-r--r-- 1.9 KB
pgtable-nopud.h -rw-r--r-- 1.9 KB
pgtable.h -rw-r--r-- 26.2 KB
preempt.h -rw-r--r-- 1.9 KB
ptrace.h -rw-r--r-- 1.6 KB
qrwlock.h -rw-r--r-- 5.0 KB
qrwlock_types.h -rw-r--r-- 431 bytes
qspinlock.h -rw-r--r-- 4.2 KB
qspinlock_types.h -rw-r--r-- 2.3 KB
resource.h -rw-r--r-- 1.0 KB
rwsem.h -rw-r--r-- 3.0 KB
seccomp.h -rw-r--r-- 1.3 KB
sections.h -rw-r--r-- 4.5 KB
segment.h -rw-r--r-- 249 bytes
serial.h -rw-r--r-- 306 bytes
set_memory.h -rw-r--r-- 323 bytes
siginfo.h -rw-r--r-- 568 bytes
signal.h -rw-r--r-- 269 bytes
simd.h -rw-r--r-- 397 bytes
sizes.h -rw-r--r-- 78 bytes
spinlock.h -rw-r--r-- 290 bytes
statfs.h -rw-r--r-- 130 bytes
string.h -rw-r--r-- 281 bytes
switch_to.h -rw-r--r-- 992 bytes
syscall.h -rw-r--r-- 6.2 KB
syscalls.h -rw-r--r-- 700 bytes
termios-base.h -rw-r--r-- 2.1 KB
termios.h -rw-r--r-- 2.8 KB
timex.h -rw-r--r-- 469 bytes
tlb.h -rw-r--r-- 9.1 KB
tlbflush.h -rw-r--r-- 446 bytes
topology.h -rw-r--r-- 2.1 KB
trace_clock.h -rw-r--r-- 352 bytes
uaccess-unaligned.h -rw-r--r-- 733 bytes
uaccess.h -rw-r--r-- 5.2 KB
unaligned.h -rw-r--r-- 1.0 KB
unistd.h -rw-r--r-- 279 bytes
user.h -rw-r--r-- 242 bytes
vga.h -rw-r--r-- 548 bytes
vmlinux.lds.h -rw-r--r-- 27.0 KB
vtime.h -rw-r--r-- 52 bytes
word-at-a-time.h -rw-r--r-- 2.7 KB
xor.h -rw-r--r-- 13.6 KB

back to top