Revision 5b77e95dd7790ff6c8fbf1cd8d0104ebed818a03 authored by Alexander Potapenko on 02 April 2019, 11:28:13 UTC, committed by Ingo Molnar on 06 April 2019, 07:52:02 UTC
There's a number of problems with how arch/x86/include/asm/bitops.h
is currently using assembly constraints for the memory region
bitops are modifying:

1) Use memory clobber in bitops that touch arbitrary memory

Certain bit operations that read/write bits take a base pointer and an
arbitrarily large offset to address the bit relative to that base.
Inline assembly constraints aren't expressive enough to tell the
compiler that the assembly directive is going to touch a specific memory
location of unknown size, therefore we have to use the "memory" clobber
to indicate that the assembly is going to access memory locations other
than those listed in the inputs/outputs.

To indicate that BTR/BTS instructions don't necessarily touch the first
sizeof(long) bytes of the argument, we also move the address to assembly
inputs.

This particular change leads to size increase of 124 kernel functions in
a defconfig build. For some of them the diff is in NOP operations, other
end up re-reading values from memory and may potentially slow down the
execution. But without these clobbers the compiler is free to cache
the contents of the bitmaps and use them as if they weren't changed by
the inline assembly.

2) Use byte-sized arguments for operations touching single bytes.

Passing a long value to ANDB/ORB/XORB instructions makes the compiler
treat sizeof(long) bytes as being clobbered, which isn't the case. This
may theoretically lead to worse code in the case of heavy optimization.

Practical impact:

I've built a defconfig kernel and looked through some of the functions
generated by GCC 7.3.0 with and without this clobber, and didn't spot
any miscompilations.

However there is a (trivial) theoretical case where this code leads to
miscompilation:

  https://lkml.org/lkml/2019/3/28/393

using just GCC 8.3.0 with -O2.  It isn't hard to imagine someone writes
such a function in the kernel someday.

So the primary motivation is to fix an existing misuse of the asm
directive, which happens to work in certain configurations now, but
isn't guaranteed to work under different circumstances.

[ --mingo: Added -stable tag because defconfig only builds a fraction
  of the kernel and the trivial testcase looks normal enough to
  be used in existing or in-development code. ]

Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: James Y Knight <jyknight@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190402112813.193378-1-glider@google.com
[ Edited the changelog, tidied up one of the defines. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1 parent faa3604
Raw File
cpu-load.txt
========
CPU load
========

Linux exports various bits of information via ``/proc/stat`` and
``/proc/uptime`` that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example::

    $ iostat
    Linux 2.6.18.3-exp (linmac)     02/20/2007

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              10.01    0.00    2.92    5.44    0.00   81.63

    ...

Here the system thinks that over the default sampling period the
system spent 10.01% of the time doing work in user space, 2.92% in the
kernel, and was overall 81.63% of the time idle.

In most cases the ``/proc/stat``	 information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.

So how is this information collected?  Whenever timer interrupt is
signalled the kernel looks what kind of task was running at this
moment and increments the counter that corresponds to this tasks
kind/state.  The problem with this is that the system could have
switched between various states multiple times between two timer
interrupts yet the counter is incremented only for the last state.


Example
-------

If we imagine the system with one task that periodically burns cycles
in the following manner::

     time line between two timer interrupts
    |--------------------------------------|
     ^                                    ^
     |_ something begins working          |
                                          |_ something goes to sleep
                                         (only to be awaken quite soon)

In the above situation the system will be 0% loaded according to the
``/proc/stat`` (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.

One can imagine many more situations where this behavior of the kernel
will lead to quite erratic information inside ``/proc/stat``::


	/* gcc -o hog smallhog.c */
	#include <time.h>
	#include <limits.h>
	#include <signal.h>
	#include <sys/time.h>
	#define HIST 10

	static volatile sig_atomic_t stop;

	static void sighandler (int signr)
	{
	(void) signr;
	stop = 1;
	}
	static unsigned long hog (unsigned long niters)
	{
	stop = 0;
	while (!stop && --niters);
	return niters;
	}
	int main (void)
	{
	int i;
	struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
				.it_value = { .tv_sec = 0, .tv_usec = 1 } };
	sigset_t set;
	unsigned long v[HIST];
	double tmp = 0.0;
	unsigned long n;
	signal (SIGALRM, &sighandler);
	setitimer (ITIMER_REAL, &it, NULL);

	hog (ULONG_MAX);
	for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
	for (i = 0; i < HIST; ++i) tmp += v[i];
	tmp /= HIST;
	n = tmp - (tmp / 3.0);

	sigemptyset (&set);
	sigaddset (&set, SIGALRM);

	for (;;) {
		hog (n);
		sigwait (&set, &i);
	}
	return 0;
	}


References
----------

- http://lkml.org/lkml/2007/2/12/6
- Documentation/filesystems/proc.txt (1.8)


Thanks
------

Con Kolivas, Pavel Machek
back to top