Revision - 8aef188 - VFS: Fix vfsmount overput on simultaneous automount

Revision 8aef18845266f5c05904c610088f2d1ed58f6be3 authored by Al Viro on 16 June 2011, 14:10:06 UTC, committed by Al Viro on 16 June 2011, 15:28:16 UTC

VFS: Fix vfsmount overput on simultaneous automount

[Kudos to dhowells for tracking that crap down]

If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.

The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.

During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.

The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount().  However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.

The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed.  follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt.  That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move.  The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.

The BUG can be reproduced with the following test program:

	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <sys/wait.h>
	int main(int argc, char **argv)
	{
		int pid, ws;
		struct stat buf;
		pid = fork();
		stat(argv[1], &buf);
		if (pid > 0) wait(&ws);
		return 0;
	}

and the following procedure:

 (1) Mount an NFS volume that on the server has something else mounted on a
     subdirectory.  For instance, I can mount / from my server:

	mount warthog:/ /mnt -t nfs4 -r

     On the server /data has another filesystem mounted on it, so NFS will see
     a change in FSID as it walks down the path, and will mark /mnt/data as
     being a mountpoint.  This will cause the automount code to be triggered.

     !!! Do not look inside the mounted fs at this point !!!

 (2) Run the above program on a file within the submount to generate two
     simultaneous automount requests:

	/tmp/forkstat /mnt/data/testfile

 (3) Unmount the automounted submount:

	umount /mnt/data

 (4) Unmount the original mount:

	umount /mnt

     At this point the kernel should throw a BUG with something like the
     following:

	BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:

 [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
 [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
 [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
 [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
 [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b
 [<ffffffff81186261>] mntput_no_expire+0x18d/0x199
 [<ffffffff811862a8>] mntput+0x3b/0x44
 [<ffffffff81186d87>] release_mounts+0xa2/0xbf
 [<ffffffff811876af>] sys_umount+0x47a/0x4ba
 [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
 [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b

as do_umount() is inlined.  However, you can see release_mounts() in there.

Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.

Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1 parent 50338b8

Files
Changes

Permalinks

SubmitChecklist

Linux Kernel patch submission checklist
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here are some basic things that developers should do if they want to see their
kernel patch submissions accepted more quickly.

These are all above and beyond the documentation that is provided in
Documentation/SubmittingPatches and elsewhere regarding submitting Linux
kernel patches.


1: If you use a facility then #include the file that defines/declares
   that facility.  Don't depend on other header files pulling in ones
   that you use.

2: Builds cleanly with applicable or modified CONFIG options =y, =m, and
   =n.  No gcc warnings/errors, no linker warnings/errors.

2b: Passes allnoconfig, allmodconfig

2c: Builds successfully when using O=builddir

3: Builds on multiple CPU architectures by using local cross-compile tools
   or some other build farm.

4: ppc64 is a good architecture for cross-compilation checking because it
   tends to use `unsigned long' for 64-bit quantities.

5: Check your patch for general style as detailed in
   Documentation/CodingStyle.  Check for trivial violations with the
   patch style checker prior to submission (scripts/checkpatch.pl).
   You should be able to justify all violations that remain in
   your patch.

6: Any new or modified CONFIG options don't muck up the config menu.

7: All new Kconfig options have help text.

8: Has been carefully reviewed with respect to relevant Kconfig
   combinations.  This is very hard to get right with testing -- brainpower
   pays off here.

9: Check cleanly with sparse.

10: Use 'make checkstack' and 'make namespacecheck' and fix any problems
    that they find.  Note: checkstack does not point out problems explicitly,
    but any one function that uses more than 512 bytes on the stack is a
    candidate for change.

11: Include kernel-doc to document global kernel APIs.  (Not required for
    static functions, but OK there also.) Use 'make htmldocs' or 'make
    mandocs' to check the kernel-doc and fix any issues.

12: Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
    CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
    CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
    enabled.

13: Has been build- and runtime tested with and without CONFIG_SMP and
    CONFIG_PREEMPT.

14: If the patch affects IO/Disk, etc: has been tested with and without
    CONFIG_LBDAF.

15: All codepaths have been exercised with all lockdep features enabled.

16: All new /proc entries are documented under Documentation/

17: All new kernel boot parameters are documented in
    Documentation/kernel-parameters.txt.

18: All new module parameters are documented with MODULE_PARM_DESC()

19: All new userspace interfaces are documented in Documentation/ABI/.
    See Documentation/ABI/README for more information.
    Patches that change userspace interfaces should be CCed to
    linux-api@vger.kernel.org.

20: Check that it all passes `make headers_check'.

21: Has been checked with injection of at least slab and page-allocation
    failures.  See Documentation/fault-injection/.

    If the new code is substantial, addition of subsystem-specific fault
    injection might be appropriate.

22: Newly-added code has been compiled with `gcc -W' (use "make
    EXTRA_CFLAGS=-W").  This will generate lots of noise, but is good for
    finding bugs like "warning: comparison between signed and unsigned".

23: Tested after it has been merged into the -mm patchset to make sure
    that it still works with all of the other queued patches and various
    changes in the VM, VFS, and other subsystems.

24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
    source code that explains the logic of what they are doing and why.

25: If any ioctl's are added by the patch, then also update
    Documentation/ioctl/ioctl-number.txt.

26: If your modified source code depends on or uses any of the kernel
    APIs or features that are related to the following kconfig symbols,
    then test multiple builds with the related kconfig symbols disabled
    and/or =m (if that option is available) [not all of these at the
    same time, just various/random combinations of them]:

    CONFIG_SMP, CONFIG_SYSFS, CONFIG_PROC_FS, CONFIG_INPUT, CONFIG_PCI,
    CONFIG_BLOCK, CONFIG_PM, CONFIG_HOTPLUG, CONFIG_MAGIC_SYSRQ,
    CONFIG_NET, CONFIG_INET=n (but latter with CONFIG_NET=y)

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...