Revision 6e2df0581f569038719cf2bc2b3baa3fcc83cab4 authored by Peter Zijlstra on 08 November 2019, 10:11:52 UTC, committed by Peter Zijlstra on 08 November 2019, 21:34:14 UTC
Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
1 parent e3b8b6a
Raw File
cpuid.rst
.. SPDX-License-Identifier: GPL-2.0

==============
KVM CPUID bits
==============

:Author: Glauber Costa <glommer@gmail.com>

A guest running on a kvm host, can check some of its features using
cpuid. This is not always guaranteed to work, since userspace can
mask-out some, or even all KVM-related cpuid features before launching
a guest.

KVM cpuid functions are:

function: KVM_CPUID_SIGNATURE (0x40000000)

returns::

   eax = 0x40000001
   ebx = 0x4b4d564b
   ecx = 0x564b4d56
   edx = 0x4d

Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
The value in eax corresponds to the maximum cpuid function present in this leaf,
and will be updated if more functions are added in the future.
Note also that old hosts set eax value to 0x0. This should
be interpreted as if the value was 0x40000001.
This function queries the presence of KVM cpuid leafs.

function: define KVM_CPUID_FEATURES (0x40000001)

returns::

          ebx, ecx
          eax = an OR'ed group of (1 << flag)

where ``flag`` is defined as below:

================================= =========== ================================
flag                              value       meaning
================================= =========== ================================
KVM_FEATURE_CLOCKSOURCE           0           kvmclock available at msrs
                                              0x11 and 0x12

KVM_FEATURE_NOP_IO_DELAY          1           not necessary to perform delays
                                              on PIO operations

KVM_FEATURE_MMU_OP                2           deprecated

KVM_FEATURE_CLOCKSOURCE2          3           kvmclock available at msrs

                                              0x4b564d00 and 0x4b564d01
KVM_FEATURE_ASYNC_PF              4           async pf can be enabled by
                                              writing to msr 0x4b564d02

KVM_FEATURE_STEAL_TIME            5           steal time can be enabled by
                                              writing to msr 0x4b564d03

KVM_FEATURE_PV_EOI                6           paravirtualized end of interrupt
                                              handler can be enabled by
                                              writing to msr 0x4b564d04

KVM_FEATURE_PV_UNHAULT            7           guest checks this feature bit
                                              before enabling paravirtualized
                                              spinlock support

KVM_FEATURE_PV_TLB_FLUSH          9           guest checks this feature bit
                                              before enabling paravirtualized
                                              tlb flush

KVM_FEATURE_ASYNC_PF_VMEXIT       10          paravirtualized async PF VM EXIT
                                              can be enabled by setting bit 2
                                              when writing to msr 0x4b564d02

KVM_FEATURE_PV_SEND_IPI           11          guest checks this feature bit
                                              before enabling paravirtualized
                                              sebd IPIs

KVM_FEATURE_PV_POLL_CONTROL       12          host-side polling on HLT can
                                              be disabled by writing
                                              to msr 0x4b564d05.

KVM_FEATURE_PV_SCHED_YIELD        13          guest checks this feature bit
                                              before using paravirtualized
                                              sched yield.

KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                              per-cpu warps are expeced in
                                              kvmclock
================================= =========== ================================

::

      edx = an OR'ed group of (1 << flag)

Where ``flag`` here is defined as below:

================== ============ =================================
flag               value        meaning
================== ============ =================================
KVM_HINTS_REALTIME 0            guest checks this feature bit to
                                determine that vCPUs are never
                                preempted for an unlimited time
                                allowing optimizations
================== ============ =================================
back to top