Revision e91467ecd1ef381377fd327c0ded922835ec52ab authored by Christian Borntraeger on 05 August 2006, 19:13:52 UTC, committed by Linus Torvalds on 06 August 2006, 15:57:46 UTC
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.

On my s390x system I saw the following oops:

Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU:    0    Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
           0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
           000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
           00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([<0000000000144f90>] unqueue_me+0x40/0xe4)
 [<0000000000145a0c>] do_futex+0x33c/0xc40
 [<000000000014643e>] sys_futex+0x12e/0x144
 [<000000000010bb00>] sysc_noemu+0x10/0x16
 [<000002000003741c>] 0x2000003741c

The code in question is:

static int unqueue_me(struct futex_q *q)
{
        int ret = 0;
        spinlock_t *lock_ptr;

        /* In the common case we don't take the spinlock, which is nice. */
 retry:
        lock_ptr = q->lock_ptr;
        if (lock_ptr != 0) {
                spin_lock(lock_ptr);
		/*
                 * q->lock_ptr can change between reading it and
                 * spin_lock(), causing us to take the wrong lock.  This
                 * corrects the race condition.
[...]

and my compiler (gcc 4.1.0) makes the following out of it:

00000000000003c8 <unqueue_me>:
     3c8:       eb bf f0 70 00 24       stmg    %r11,%r15,112(%r15)
     3ce:       c0 d0 00 00 00 00       larl    %r13,3ce <unqueue_me+0x6>
                        3d0: R_390_PC32DBL      .rodata+0x2a
     3d4:       a7 f1 1e 00             tml     %r15,7680
     3d8:       a7 84 00 01             je      3da <unqueue_me+0x12>
     3dc:       b9 04 00 ef             lgr     %r14,%r15
     3e0:       a7 fb ff d0             aghi    %r15,-48
     3e4:       b9 04 00 b2             lgr     %r11,%r2
     3e8:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
		/* write q->lock_ptr in r12 */
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       a7 84 00 4b             je      48e <unqueue_me+0xc6>
		/* if r12 is zero then jump over the code.... */
     3fc:       e3 20 b0 28 00 04       lg      %r2,40(%r11)
		/* write q->lock_ptr in r2 */
     402:       c0 e5 00 00 00 00       brasl   %r14,402 <unqueue_me+0x3a>
                        404: R_390_PC32DBL      _spin_lock+0x2
		/* use r2 as parameter for spin_lock */

So the code becomes more or less:
if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)

Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       b9 04 00 2c             lgr     %r2,%r12
     3fc:       a7 84 00 48             je      48c <unqueue_me+0xc4>
     400:       c0 e5 00 00 00 00       brasl   %r14,400 <unqueue_me+0x38>
                        402: R_390_PC32DBL      _spin_lock+0x2

As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q->lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@timesys.com>
Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1 parent 72f0b4e
Raw File
README
[LICENSING] 

ReiserFS is hereby licensed under the GNU General
Public License version 2.

Source code files that contain the phrase "licensing governed by
reiserfs/README" are "governed files" throughout this file.  Governed
files are licensed under the GPL.  The portions of them owned by Hans
Reiser, or authorized to be licensed by him, have been in the past,
and likely will be in the future, licensed to other parties under
other licenses.  If you add your code to governed files, and don't
want it to be owned by Hans Reiser, put your copyright label on that
code so the poor blight and his customers can keep things straight.
All portions of governed files not labeled otherwise are owned by Hans
Reiser, and by adding your code to it, widely distributing it to
others or sending us a patch, and leaving the sentence in stating that
licensing is governed by the statement in this file, you accept this.
It will be a kindness if you identify whether Hans Reiser is allowed
to license code labeled as owned by you on your behalf other than
under the GPL, because he wants to know if it is okay to do so and put
a check in the mail to you (for non-trivial improvements) when he
makes his next sale.  He makes no guarantees as to the amount if any,
though he feels motivated to motivate contributors, and you can surely
discuss this with him before or after contributing.  You have the
right to decline to allow him to license your code contribution other
than under the GPL.

Further licensing options are available for commercial and/or other
interests directly from Hans Reiser: hans@reiser.to.  If you interpret
the GPL as not allowing those additional licensing options, you read
it wrongly, and Richard Stallman agrees with me, when carefully read
you can see that those restrictions on additional terms do not apply
to the owner of the copyright, and my interpretation of this shall
govern for this license.  

Finally, nothing in this license shall be interpreted to allow you to
fail to fairly credit me, or to remove my credits, without my
permission, unless you are an end user not redistributing to others.
If you have doubts about how to properly do that, or about what is
fair, ask.  (Last I spoke with him Richard was contemplating how best
to address the fair crediting issue in the next GPL version.)

[END LICENSING]

Reiserfs is a file system based on balanced tree algorithms, which is
described at http://devlinux.com/namesys.

Stop reading here.  Go there, then return.

Send bug reports to yura@namesys.botik.ru.

mkreiserfs and other utilities are in reiserfs/utils, or wherever your
Linux provider put them.  There is some disagreement about how useful
it is for users to get their fsck and mkreiserfs out of sync with the
version of reiserfs that is in their kernel, with many important
distributors wanting them out of sync.:-) Please try to remember to
recompile and reinstall fsck and mkreiserfs with every update of
reiserfs, this is a common source of confusion.  Note that some of the
utilities cannot be compiled without accessing the balancing code
which is in the kernel code, and relocating the utilities may require
you to specify where that code can be found.

Yes, if you update your reiserfs kernel module you do have to
recompile your kernel, most of the time.  The errors you get will be
quite cryptic if your forget to do so.

Real users, as opposed to folks who want to hack and then understand
what went wrong, will want REISERFS_CHECK off.

Hideous Commercial Pitch: Spread your development costs across other OS
vendors.  Select from the best in the world, not the best in your
building, by buying from third party OS component suppliers.  Leverage
the software component development power of the internet.  Be the most
aggressive in taking advantage of the commercial possibilities of
decentralized internet development, and add value through your branded
integration that you sell as an operating system.  Let your competitors
be the ones to compete against the entire internet by themselves.  Be
hip, get with the new economic trend, before your competitors do.  Send
email to hans@reiser.to.

To understand the code, after reading the website, start reading the
code by reading reiserfs_fs.h first.

Hans Reiser was the project initiator, primary architect, source of all
funding for the first 5.5 years, and one of the programmers.  He owns
the copyright.

Vladimir Saveljev was one of the programmers, and he worked long hours
writing the cleanest code.  He always made the effort to be the best he
could be, and to make his code the best that it could be.  What resulted
was quite remarkable. I don't think that money can ever motivate someone
to work the way he did, he is one of the most selfless men I know.

Yura helps with benchmarking, coding hashes, and block pre-allocation
code.

Anatoly Pinchuk is a former member of our team who worked closely with
Vladimir throughout the project's development.  He wrote a quite
substantial portion of the total code.  He realized that there was a
space problem with packing tails of files for files larger than a node
that start on a node aligned boundary (there are reasons to want to node
align files), and he invented and implemented indirect items and
unformatted nodes as the solution.

Konstantin Shvachko, with the help of the Russian version of a VC,
tried to put me in a position where I was forced into giving control
of the project to him.  (Fortunately, as the person paying the money
for all salaries from my dayjob I owned all copyrights, and you can't
really force takeovers of sole proprietorships.)  This was something
curious, because he never really understood the value of our project,
why we should do what we do, or why innovation was possible in
general, but he was sure that he ought to be controlling it.  Every
innovation had to be forced past him while he was with us.  He added
two years to the time required to complete reiserfs, and was a net
loss for me.  Mikhail Gilula was a brilliant innovator who also left
in a destructive way that erased the value of his contributions, and
that he was shown much generosity just makes it more painful.

Grigory Zaigralin was an extremely effective system administrator for
our group.

Igor Krasheninnikov was wonderful at hardware procurement, repair, and
network installation.

Jeremy Fitzhardinge wrote the teahash.c code, and he gives credit to a
textbook he got the algorithm from in the code.  Note that his analysis
of how we could use the hashing code in making 32 bit NFS cookies work
was probably more important than the actual algorithm.  Colin Plumb also
contributed to it.

Chris Mason dived right into our code, and in just a few months produced
the journaling code that dramatically increased the value of ReiserFS.
He is just an amazing programmer.

Igor Zagorovsky is writing much of the new item handler and extent code
for our next major release.

Alexander Zarochentcev (sometimes known as zam, or sasha), wrote the
resizer, and is hard at work on implementing allocate on flush.  SGI
implemented allocate on flush before us for XFS, and generously took
the time to convince me we should do it also.  They are great people,
and a great company.

Yuri Shevchuk and Nikita Danilov are doing squid cache optimization.

Vitaly Fertman is doing fsck.

Jeff Mahoney, of SuSE, contributed a few cleanup fixes, most notably
the endian safe patches which allow ReiserFS to run on any platform
supported by the Linux kernel.

SuSE, IntegratedLinux.com, Ecila, MP3.com, bigstorage.com, and the
Alpha PC Company made it possible for me to not have a day job
anymore, and to dramatically increase our staffing.  Ecila funded
hypertext feature development, MP3.com funded journaling, SuSE funded
core development, IntegratedLinux.com funded squid web cache
appliances, bigstorage.com funded HSM, and the alpha PC company funded
the alpha port.  Many of these tasks were helped by sponsors other
than the ones just named.  SuSE has helped in much more than just
funding....

back to top