Revision - efad4e4 - mm, memory_hotplug: is_mem_section_removable do not pass the end [...]

Revision efad4e475c312456edb3c789d0996d12ed744c13 authored by Michal Hocko on 01 February 2019, 22:20:34 UTC, committed by Linus Torvalds on 01 February 2019, 23:46:23 UTC

mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone

Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
[1].  I have pushed back on those fixes because I believed that it is
much better to plug the problem at the initialization time rather than
play whack-a-mole all over the hotplug code and find all the places
which expect the full memory section to be initialized.

We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug:
initialize struct pages for the full memory section") merged and cause a
regression [2][3].  The reason is that there might be memory layouts
when two NUMA nodes share the same memory section so the merged fix is
simply incorrect.

In order to plug this hole we really have to be zone range aware in
those handlers.  I have split up the original patch into two.  One is
unchanged (patch 2) and I took a different approach for `removable'
crash.

[1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
[3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

This patch (of 2):

Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
removable state of a memory block:

 page:000003d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
   is_mem_section_removable+0xb4/0x190
   show_mem_removable+0x9a/0xd8
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops

The reason is that the memory block spans the zone boundary and we are
stumbling over an unitialized struct page.  Fix this by enforcing zone
range in is_mem_section_removable so that we never run away from a zone.

Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1 parent 9bcdeb5

Files
Changes

Permalinks

speculation.txt

This document explains potential effects of speculation, and how undesirable
effects can be mitigated portably using common APIs.

===========
Speculation
===========

To improve performance and minimize average latencies, many contemporary CPUs
employ speculative execution techniques such as branch prediction, performing
work which may be discarded at a later stage.

Typically speculative execution cannot be observed from architectural state,
such as the contents of registers. However, in some cases it is possible to
observe its impact on microarchitectural state, such as the presence or
absence of data in caches. Such state may form side-channels which can be
observed to extract secret information.

For example, in the presence of branch prediction, it is possible for bounds
checks to be ignored by code which is speculatively executed. Consider the
following code:

	int load_array(int *array, unsigned int index)
	{
		if (index >= MAX_ARRAY_ELEMS)
			return 0;
		else
			return array[index];
	}

Which, on arm64, may be compiled to an assembly sequence such as:

	CMP	<index>, #MAX_ARRAY_ELEMS
	B.LT	less
	MOV	<returnval>, #0
	RET
  less:
	LDR	<returnval>, [<array>, <index>]
	RET

It is possible that a CPU mis-predicts the conditional branch, and
speculatively loads array[index], even if index >= MAX_ARRAY_ELEMS. This
value will subsequently be discarded, but the speculated load may affect
microarchitectural state which can be subsequently measured.

More complex sequences involving multiple dependent memory accesses may
result in sensitive information being leaked. Consider the following
code, building on the prior example:

	int load_dependent_arrays(int *arr1, int *arr2, int index)
	{
		int val1, val2,

		val1 = load_array(arr1, index);
		val2 = load_array(arr2, val1);

		return val2;
	}

Under speculation, the first call to load_array() may return the value
of an out-of-bounds address, while the second call will influence
microarchitectural state dependent on this value. This may provide an
arbitrary read primitive.

====================================
Mitigating speculation side-channels
====================================

The kernel provides a generic API to ensure that bounds checks are
respected even under speculation. Architectures which are affected by
speculation-based side-channels are expected to implement these
primitives.

The array_index_nospec() helper in <linux/nospec.h> can be used to
prevent information from being leaked via side-channels.

A call to array_index_nospec(index, size) returns a sanitized index
value that is bounded to [0, size) even under cpu speculation
conditions.

This can be used to protect the earlier load_array() example:

	int load_array(int *array, unsigned int index)
	{
		if (index >= MAX_ARRAY_ELEMS)
			return 0;
		else {
			index = array_index_nospec(index, MAX_ARRAY_ELEMS);
			return array[index];
		}
	}

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...