Revision 0f640dca08330dfc7820d610578e5935b5e654b2 authored by Mike Snitzer on 31 January 2013, 14:11:14 UTC, committed by Alasdair G Kergon on 31 January 2013, 14:11:14 UTC
thin_io_hints() is blindly copying the queue limits from the thin-pool
which can lead to incorrect limits being set.  The fix here simply
deletes the thin_io_hints() hook which leaves the existing stacking
infrastructure to set the limits correctly.

When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
from spanning multiple chunks.  Otherwise we can see problems.  If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/raid0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:

md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0

This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304).  So the bio
splitting that DM is doing clearly isn't respecting the MD limits.

max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision).  So this explains why bi_size is 130560.

But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn.  This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer. Some additional context for this is available in the header for
commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").

Long story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.

Fix this by eliminating thin_io_hints.  Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints.  But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.

Reported-by: Daniel Browning <db@kavod.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
1 parent 949db15
History
File Mode Size
Kconfig -rw-r--r-- 15.0 KB
Kconfig.debug -rw-r--r-- 1015 bytes
Makefile -rw-r--r-- 2.0 KB
backing-dev.c -rw-r--r-- 21.2 KB
balloon_compaction.c -rw-r--r-- 9.6 KB
bootmem.c -rw-r--r-- 21.1 KB
bounce.c -rw-r--r-- 6.6 KB
cleancache.c -rw-r--r-- 6.5 KB
compaction.c -rw-r--r-- 32.4 KB
debug-pagealloc.c -rw-r--r-- 2.1 KB
dmapool.c -rw-r--r-- 13.1 KB
fadvise.c -rw-r--r-- 3.6 KB
failslab.c -rw-r--r-- 1.3 KB
filemap.c -rw-r--r-- 67.3 KB
filemap_xip.c -rw-r--r-- 11.3 KB
fremap.c -rw-r--r-- 6.8 KB
frontswap.c -rw-r--r-- 10.3 KB
highmem.c -rw-r--r-- 9.9 KB
huge_memory.c -rw-r--r-- 73.7 KB
hugetlb.c -rw-r--r-- 82.4 KB
hugetlb_cgroup.c -rw-r--r-- 10.7 KB
hwpoison-inject.c -rw-r--r-- 3.3 KB
init-mm.c -rw-r--r-- 619 bytes
internal.h -rw-r--r-- 11.1 KB
interval_tree.c -rw-r--r-- 3.2 KB
kmemcheck.c -rw-r--r-- 2.8 KB
kmemleak-test.c -rw-r--r-- 3.3 KB
kmemleak.c -rw-r--r-- 52.5 KB
ksm.c -rw-r--r-- 55.2 KB
maccess.c -rw-r--r-- 1.6 KB
madvise.c -rw-r--r-- 11.9 KB
memblock.c -rw-r--r-- 29.1 KB
memcontrol.c -rw-r--r-- 178.6 KB
memory-failure.c -rw-r--r-- 42.3 KB
memory.c -rw-r--r-- 113.9 KB
memory_hotplug.c -rw-r--r-- 35.7 KB
mempolicy.c -rw-r--r-- 70.9 KB
mempool.c -rw-r--r-- 10.5 KB
migrate.c -rw-r--r-- 44.3 KB
mincore.c -rw-r--r-- 7.8 KB
mlock.c -rw-r--r-- 15.5 KB
mm_init.c -rw-r--r-- 3.7 KB
mmap.c -rw-r--r-- 80.6 KB
mmu_context.c -rw-r--r-- 1.4 KB
mmu_notifier.c -rw-r--r-- 9.4 KB
mmzone.c -rw-r--r-- 1.9 KB
mprotect.c -rw-r--r-- 10.2 KB
mremap.c -rw-r--r-- 14.3 KB
msync.c -rw-r--r-- 2.4 KB
nobootmem.c -rw-r--r-- 11.2 KB
nommu.c -rw-r--r-- 51.3 KB
oom_kill.c -rw-r--r-- 19.4 KB
page-writeback.c -rw-r--r-- 69.1 KB
page_alloc.c -rw-r--r-- 169.5 KB
page_cgroup.c -rw-r--r-- 11.9 KB
page_io.c -rw-r--r-- 6.8 KB
page_isolation.c -rw-r--r-- 7.0 KB
pagewalk.c -rw-r--r-- 5.7 KB
percpu-km.c -rw-r--r-- 2.8 KB
percpu-vm.c -rw-r--r-- 12.9 KB
percpu.c -rw-r--r-- 57.1 KB
pgtable-generic.c -rw-r--r-- 4.6 KB
process_vm_access.c -rw-r--r-- 13.3 KB
quicklist.c -rw-r--r-- 2.4 KB
readahead.c -rw-r--r-- 16.1 KB
rmap.c -rw-r--r-- 51.6 KB
shmem.c -rw-r--r-- 76.8 KB
slab.c -rw-r--r-- 117.7 KB
slab.h -rw-r--r-- 6.2 KB
slab_common.c -rw-r--r-- 11.1 KB
slob.c -rw-r--r-- 15.3 KB
slub.c -rw-r--r-- 129.0 KB
sparse-vmemmap.c -rw-r--r-- 5.9 KB
sparse.c -rw-r--r-- 20.7 KB
swap.c -rw-r--r-- 23.1 KB
swap_state.c -rw-r--r-- 10.3 KB
swapfile.c -rw-r--r-- 63.1 KB
truncate.c -rw-r--r-- 18.3 KB
util.c -rw-r--r-- 9.1 KB
vmalloc.c -rw-r--r-- 66.0 KB
vmscan.c -rw-r--r-- 99.7 KB
vmstat.c -rw-r--r-- 33.9 KB

back to top