Revision - 8aef188 - VFS: Fix vfsmount overput on simultaneous automount

Revision 8aef18845266f5c05904c610088f2d1ed58f6be3 authored by Al Viro on 16 June 2011, 14:10:06 UTC, committed by Al Viro on 16 June 2011, 15:28:16 UTC

VFS: Fix vfsmount overput on simultaneous automount

[Kudos to dhowells for tracking that crap down]

If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.

The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.

During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.

The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount().  However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.

The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed.  follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt.  That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move.  The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.

The BUG can be reproduced with the following test program:

	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <sys/wait.h>
	int main(int argc, char **argv)
	{
		int pid, ws;
		struct stat buf;
		pid = fork();
		stat(argv[1], &buf);
		if (pid > 0) wait(&ws);
		return 0;
	}

and the following procedure:

 (1) Mount an NFS volume that on the server has something else mounted on a
     subdirectory.  For instance, I can mount / from my server:

	mount warthog:/ /mnt -t nfs4 -r

     On the server /data has another filesystem mounted on it, so NFS will see
     a change in FSID as it walks down the path, and will mark /mnt/data as
     being a mountpoint.  This will cause the automount code to be triggered.

     !!! Do not look inside the mounted fs at this point !!!

 (2) Run the above program on a file within the submount to generate two
     simultaneous automount requests:

	/tmp/forkstat /mnt/data/testfile

 (3) Unmount the automounted submount:

	umount /mnt/data

 (4) Unmount the original mount:

	umount /mnt

     At this point the kernel should throw a BUG with something like the
     following:

	BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:

 [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
 [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
 [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
 [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
 [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b
 [<ffffffff81186261>] mntput_no_expire+0x18d/0x199
 [<ffffffff811862a8>] mntput+0x3b/0x44
 [<ffffffff81186d87>] release_mounts+0xa2/0xbf
 [<ffffffff811876af>] sys_umount+0x47a/0x4ba
 [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
 [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b

as do_umount() is inlined.  However, you can see release_mounts() in there.

Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.

Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1 parent 50338b8

Files
Changes

Permalinks

api-intro.txt


                    Scatterlist Cryptographic API
                   
INTRODUCTION

The Scatterlist Crypto API takes page vectors (scatterlists) as
arguments, and works directly on pages.  In some cases (e.g. ECB
mode ciphers), this will allow for pages to be encrypted in-place
with no copying.

One of the initial goals of this design was to readily support IPsec,
so that processing can be applied to paged skb's without the need
for linearization.


DETAILS

At the lowest level are algorithms, which register dynamically with the
API.

'Transforms' are user-instantiated objects, which maintain state, handle all
of the implementation logic (e.g. manipulating page vectors) and provide an 
abstraction to the underlying algorithms.  However, at the user 
level they are very simple.

Conceptually, the API layering looks like this:

  [transform api]  (user interface)
  [transform ops]  (per-type logic glue e.g. cipher.c, compress.c)
  [algorithm api]  (for registering algorithms)
  
The idea is to make the user interface and algorithm registration API
very simple, while hiding the core logic from both.  Many good ideas
from existing APIs such as Cryptoapi and Nettle have been adapted for this.

The API currently supports five main types of transforms: AEAD (Authenticated
Encryption with Associated Data), Block Ciphers, Ciphers, Compressors and
Hashes.

Please note that Block Ciphers is somewhat of a misnomer.  It is in fact
meant to support all ciphers including stream ciphers.  The difference
between Block Ciphers and Ciphers is that the latter operates on exactly
one block while the former can operate on an arbitrary amount of data,
subject to block size requirements (i.e., non-stream ciphers can only
process multiples of blocks).

Support for hardware crypto devices via an asynchronous interface is
under development.

Here's an example of how to use the API:

	#include <linux/crypto.h>
	#include <linux/err.h>
	#include <linux/scatterlist.h>
	
	struct scatterlist sg[2];
	char result[128];
	struct crypto_hash *tfm;
	struct hash_desc desc;
	
	tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
	if (IS_ERR(tfm))
		fail();
		
	/* ... set up the scatterlists ... */

	desc.tfm = tfm;
	desc.flags = 0;
	
	if (crypto_hash_digest(&desc, sg, 2, result))
		fail();
	
	crypto_free_hash(tfm);

    
Many real examples are available in the regression test module (tcrypt.c).


DEVELOPER NOTES

Transforms may only be allocated in user context, and cryptographic
methods may only be called from softirq and user contexts.  For
transforms with a setkey method it too should only be called from
user context.

When using the API for ciphers, performance will be optimal if each
scatterlist contains data which is a multiple of the cipher's block
size (typically 8 bytes).  This prevents having to do any copying
across non-aligned page fragment boundaries.


ADDING NEW ALGORITHMS

When submitting a new algorithm for inclusion, a mandatory requirement
is that at least a few test vectors from known sources (preferably
standards) be included.

Converting existing well known code is preferred, as it is more likely
to have been reviewed and widely tested.  If submitting code from LGPL
sources, please consider changing the license to GPL (see section 3 of
the LGPL).

Algorithms submitted must also be generally patent-free (e.g. IDEA
will not be included in the mainline until around 2011), and be based
on a recognized standard and/or have been subjected to appropriate
peer review.

Also check for any RFCs which may relate to the use of specific algorithms,
as well as general application notes such as RFC2451 ("The ESP CBC-Mode
Cipher Algorithms").

It's a good idea to avoid using lots of macros and use inlined functions
instead, as gcc does a good job with inlining, while excessive use of
macros can cause compilation problems on some platforms.

Also check the TODO list at the web site listed below to see what people
might already be working on.


BUGS

Send bug reports to:
linux-crypto@vger.kernel.org
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
    David S. Miller <davem@redhat.com>


FURTHER INFORMATION

For further patches and various updates, including the current TODO
list, see:
http://gondor.apana.org.au/~herbert/crypto/


AUTHORS

James Morris
David S. Miller
Herbert Xu


CREDITS

The following people provided invaluable feedback during the development
of the API:

  Alexey Kuznetzov
  Rusty Russell
  Herbert Valerio Riedel
  Jeff Garzik
  Michael Richardson
  Andrew Morton
  Ingo Oeser
  Christoph Hellwig

Portions of this API were derived from the following projects:
  
  Kerneli Cryptoapi (http://www.kerneli.org/)
    Alexander Kjeldaas
    Herbert Valerio Riedel
    Kyle McMartin
    Jean-Luc Cooke
    David Bryson
    Clemens Fruhwirth
    Tobias Ringstrom
    Harald Welte

and;
  
  Nettle (http://www.lysator.liu.se/~nisse/nettle/)
    Niels Möller

Original developers of the crypto algorithms:

  Dana L. How (DES)
  Andrew Tridgell and Steve French (MD4)
  Colin Plumb (MD5)
  Steve Reid (SHA1)
  Jean-Luc Cooke (SHA256, SHA384, SHA512)
  Kazunori Miyazawa / USAGI (HMAC)
  Matthew Skala (Twofish)
  Dag Arne Osvik (Serpent)
  Brian Gladman (AES)
  Kartikey Mahendra Bhatt (CAST6)
  Jon Oberheide (ARC4)
  Jouni Malinen (Michael MIC)
  NTT(Nippon Telegraph and Telephone Corporation) (Camellia)

SHA1 algorithm contributors:
  Jean-Francois Dive
  
DES algorithm contributors:
  Raimar Falke
  Gisle Sælensminde
  Niels Möller

Blowfish algorithm contributors:
  Herbert Valerio Riedel
  Kyle McMartin

Twofish algorithm contributors:
  Werner Koch
  Marc Mutz

SHA256/384/512 algorithm contributors:
  Andrew McDonald
  Kyle McMartin
  Herbert Valerio Riedel
  
AES algorithm contributors:
  Alexander Kjeldaas
  Herbert Valerio Riedel
  Kyle McMartin
  Adam J. Richter
  Fruhwirth Clemens (i586)
  Linus Torvalds (i586)

CAST5 algorithm contributors:
  Kartikey Mahendra Bhatt (original developers unknown, FSF copyright).

TEA/XTEA algorithm contributors:
  Aaron Grothe
  Michael Ringe

Khazad algorithm contributors:
  Aaron Grothe

Whirlpool algorithm contributors:
  Aaron Grothe
  Jean-Luc Cooke

Anubis algorithm contributors:
  Aaron Grothe

Tiger algorithm contributors:
  Aaron Grothe

VIA PadLock contributors:
  Michal Ludvig

Camellia algorithm contributors:
  NTT(Nippon Telegraph and Telephone Corporation) (Camellia)

Generic scatterwalk code by Adam J. Richter <adam@yggdrasil.com>

Please send any credits updates or corrections to:
Herbert Xu <herbert@gondor.apana.org.au>

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...