sort by:
Revision Author Date Message Commit Date
a59db67 dm cache: fix problematic dual use of a single migration count variable Introduce a new variable to count the number of allocated migration structures. The existing variable cache->nr_migrations became overloaded. It was used to: i) track of the number of migrations in flight for the purposes of quiescing during suspend. ii) to estimate the amount of background IO occuring. Recent discard changes meant that REQ_DISCARD bios are processed with a migration. Discards are not background IO so nr_migrations was not incremented. However this could cause quiescing to complete early. (i) is now handled with a new variable cache->nr_allocated_migrations. cache->nr_migrations has been renamed cache->nr_io_migrations. cleanup_migration() is now called free_io_migration(), since it decrements that variable. Also, remove the unused cache->next_migration variable that got replaced with with prealloc_structs a while ago. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org 23 January 2015, 16:06:08 UTC
9b1cc9f dm cache: share cache-metadata object across inactive and active DM tables If a DM table is reloaded with an inactive table when the device is not suspended (normal procedure for LVM2), then there will be two dm-bufio objects that can diverge. This can lead to a situation where the inactive table uses bufio to read metadata at the same time the active table writes metadata -- resulting in the inactive table having stale metadata buffers once it is promoted to the active table slot. Fix this by using reference counting and a global list of cache metadata objects to ensure there is only one metadata object per metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org 23 January 2015, 15:57:15 UTC
5164bec dm: fix missed error code if .end_io isn't implemented by target_type In bio-based DM's clone_endio(), when target_type doesn't implement .end_io (e.g. linear) r will be always be initialized 0. So if a WRITE SAME bio fails WRITE SAME will not be disabled as intended. Fix this by initializing r to error, rather than 0, in clone_endio(). Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails") Cc: stable@vger.kernel.org 17 December 2014, 17:31:13 UTC
2b94e89 dm thin: fix crash by initializing thin device's refcount and completion earlier Commit 80e96c5484be ("dm thin: do not allow thin device activation while pool is suspended") delayed the initialization of a new thin device's refcount and completion until after this new thin was added to the pool's active_thins list and the pool lock is released. This opens a race with a worker thread that walks the list and calls thin_get/put, noticing that the refcount goes to 0 and calling complete, freezing up the system and giving the oops below: kernel: BUG: unable to handle kernel NULL pointer dereference at (null) kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90 kernel: Call Trace: kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool] kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool] kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 Set the thin device's initial refcount and initialize the completion before adding it to the pool's active_thins list in thin_ctr(). Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> 17 December 2014, 17:06:25 UTC
2c43fd2 dm thin: fix missing out-of-data-space to write mode transition if blocks are released Discard bios and thin device deletion have the potential to release data blocks. If the thin-pool is in out-of-data-space mode, and blocks were released, transition the thin-pool back to full write mode. The correct time to do this is just after the thin-pool metadata commit. It cannot be done before the commit because the space maps will not allow immediate reuse of the data blocks in case there's a rollback following power failure. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org 17 December 2014, 16:59:36 UTC
45ec9bd dm thin: fix inability to discard blocks when in out-of-data-space mode When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard function pointer was incorrectly being set to process_prepared_discard_passdown rather than process_prepared_discard. This incorrect function pointer meant the discard was being passed down, but not effecting the mapping. As such any discard that was issued, in an attempt to reclaim blocks, would not successfully free data space. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org 17 December 2014, 16:59:36 UTC
9b8ec91 Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull core block fix from Jens Axboe: "Jan reported a problem this morning with a crash in blk-mq, and after looking over the recent changes, it's obvious that the blk-mq-tag waitqueue handling change is buggy. We could end up _not_ doing finish_wait() before switching to a new waitqueue, thus corrupting the wait task list" * 'for-linus' of git://git.kernel.dk/linux-block: Revert "blk-mq: Micro-optimize bt_get()" 16 December 2014, 01:25:20 UTC
27cb882 Merge tag 'rpmsg-3.19-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/rpmsg Pull rpmsg update from Ohad Ben-Cohen: "A single patch from Suman Anna which makes rpmsg use less buffers when small vrings are being used" * tag 'rpmsg-3.19-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/rpmsg: rpmsg: use less buffers when vrings are small 16 December 2014, 01:07:58 UTC
988adfd Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux Pull drm updates from Dave Airlie: "Highlights: - AMD KFD driver merge This is the AMD HSA interface for exposing a lowlevel interface for GPGPU use. They have an open source userspace built on top of this interface, and the code looks as good as it was going to get out of tree. - Initial atomic modesetting work The need for an atomic modesetting interface to allow userspace to try and send a complete set of modesetting state to the driver has arisen, and been suffering from neglect this past year. No more, the start of the common code and changes for msm driver to use it are in this tree. Ongoing work to get the userspace ioctl finished and the code clean will probably wait until next kernel. - DisplayID 1.3 and tiled monitor exposed to userspace. Tiled monitor property is now exposed for userspace to make use of. - Rockchip drm driver merged. - imx gpu driver moved out of staging Other stuff: - core: panel - MIPI DSI + new panels. expose suggested x/y properties for virtual GPUs - i915: Initial Skylake (SKL) support gen3/4 reset work start of dri1/ums removal infoframe tracking fixes for lots of things. - nouveau: tegra k1 voltage support GM204 modesetting support GT21x memory reclocking work - radeon: CI dpm fixes GPUVM improvements Initial DPM fan control - rcar-du: HDMI support added removed some support for old boards slave encoder driver for Analog Devices adv7511 - exynos: Exynos4415 SoC support - msm: a4xx gpu support atomic helper conversion - tegra: iommu support universal plane support ganged-mode DSI support - sti: HDMI i2c improvements - vmwgfx: some late fixes. - qxl: use suggested x/y properties" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits) drm: sti: fix module compilation issue drm/i915: save/restore GMBUS freq across suspend/resume on gen4 drm: sti: correctly cleanup CRTC and planes drm: sti: add HQVDP plane drm: sti: add cursor plane drm: sti: enable auxiliary CRTC drm: sti: fix delay in VTG programming drm: sti: prepare sti_tvout to support auxiliary crtc drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off} drm: sti: fix hdmi avi infoframe drm: sti: remove event lock while disabling vblank drm: sti: simplify gdp code drm: sti: clear all mixer control drm: sti: remove gpio for HDMI hot plug detection drm: sti: allow to change hdmi ddc i2c adapter drm/doc: Document drm_add_modes_noedid() usage drm/i915: Remove '& 0xffff' from the mask given to WA_REG() drm/i915: Invert the mask and val arguments in wa_add() and WA_REG() drm: Zero out DRM object memory upon cleanup drm/i915/bdw: Fix the write setting up the WIZ hashing mode ... 15 December 2014, 23:52:01 UTC
26178ec x86: mm: consolidate VM_FAULT_RETRY handling The VM_FAULT_RETRY handling was confusing and incorrect for the case of returning to kernel mode. We need to handle the exception table fixup if we return to kernel mode due to a fatal signal - it will basically look to the kernel user mode access like the access failed due to the VM going away from udner it. Which is correct - the process is dying - and avoids the whole "repeat endless kernel page faults" case. Handling the VM_FAULT_RETRY early and in just one place also simplifies the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we know that we can just drop the lock. The remaining accounting and possible error handling is thread-local and does not need the mmap_sem. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 15 December 2014, 23:07:33 UTC
7fb08ec x86: mm: move mmap_sem unlock from mm_fault_error() to caller This replaces four copies in various stages of mm_fault_error() handling with just a single one. It will also allow for more natural placement of the unlocking after some further cleanup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 15 December 2014, 22:46:06 UTC
35d37c6 Revert "blk-mq: Micro-optimize bt_get()" This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf. The optimization is only really safe for a single queue, otherwise 'bs' and 'bt' can indeed change, and if we don't do a finish_wait() for each loop, we'll potentially change the wait structure and corrupt task wait list. Reported-by: Jan Kara <jack@suse.cz> 15 December 2014, 15:30:26 UTC
4e0cd68 drm: sti: fix module compilation issue When compiling in module some symbol aren't missing, export them correctly. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Signed-off-by: Dave Airlie <airlied@redhat.com> 15 December 2014, 07:07:57 UTC
67e2c38 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "In terms of changes, there's general maintenance to the Smack, SELinux, and integrity code. The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT, which allows IMA appraisal to require signatures. Support for reading keys from rootfs before init is call is also added" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits) selinux: Remove security_ops extern security: smack: fix out-of-bounds access in smk_parse_smack() VFS: refactor vfs_read() ima: require signature based appraisal integrity: provide a hook to load keys when rootfs is ready ima: load x509 certificate from the kernel integrity: provide a function to load x509 certificate from the kernel integrity: define a new function integrity_read_file() Security: smack: replace kzalloc with kmem_cache for inode_smack Smack: Lock mode for the floor and hat labels ima: added support for new kernel cmdline parameter ima_template_fmt ima: allocate field pointers array on demand in template_desc_init_fields() ima: don't allocate a copy of template_fmt in template_desc_init_fields() ima: display template format in meas. list if template name length is zero ima: added error messages to template-related functions ima: use atomic bit operations to protect policy update interface ima: ignore empty and with whitespaces policy lines ima: no need to allocate entry for comment ima: report policy load status ima: use path names cache ... 15 December 2014, 04:36:37 UTC
6ae840e Merge tag 'char-misc-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here's the big char/misc driver update for 3.19-rc1 Lots of little things all over the place in different drivers, and a new subsystem, "coresight" has been added. Full details are in the shortlog" * tag 'char-misc-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (73 commits) parport: parport_pc, do not remove parent devices early spmi: Remove shutdown/suspend/resume kernel-doc carma-fpga-program: drop videobuf dependency carma-fpga: drop videobuf dependency carma-fpga-program.c: fix compile errors i8k: Fix temperature bug handling in i8k_get_temp() cxl: Name interrupts in /proc/interrupt CXL: Return error to PSL if IRQ demultiplexing fails & print clearer warning coresight-replicator: remove .owner field for driver coresight: fixed comments in coresight.h coresight: fix typo in comment in coresight-priv.h coresight: bindings for coresight drivers coresight: Adding ABI documentation w1: support auto-load of w1_bq27000 module. w1: avoid potential u16 overflow cn: verify msg->len before making callback mei: export fw status registers through sysfs mei: read and print all six FW status registers mei: txe: add cherrytrail device id mei: kill cached host and me csr values ... 15 December 2014, 00:43:47 UTC
e6b5be2 Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg KH: "Here's the set of driver core patches for 3.19-rc1. They are dominated by the removal of the .owner field in platform drivers. They touch a lot of files, but they are "simple" changes, just removing a line in a structure. Other than that, a few minor driver core and debugfs changes. There are some ath9k patches coming in through this tree that have been acked by the wireless maintainers as they relied on the debugfs changes. Everything has been in linux-next for a while" * tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits) Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries" fs: debugfs: add forward declaration for struct device type firmware class: Deletion of an unnecessary check before the function call "vunmap" firmware loader: fix hung task warning dump devcoredump: provide a one-way disable function device: Add dev_<level>_once variants ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries ath: use seq_file api for ath9k debugfs files debugfs: add helper function to create device related seq_file drivers/base: cacheinfo: remove noisy error boot message Revert "core: platform: add warning if driver has no owner" drivers: base: support cpu cache information interface to userspace via sysfs drivers: base: add cpu_device_create to support per-cpu devices topology: replace custom attribute macros with standard DEVICE_ATTR* cpumask: factor out show_cpumap into separate helper function driver core: Fix unbalanced device reference in drivers_probe driver core: fix race with userland in device_add() sysfs/kernfs: make read requests on pre-alloc files use the buffer. sysfs/kernfs: allow attributes to request write buffer be pre-allocated. fs: sysfs: return EGBIG on write if offset is larger than file size ... 15 December 2014, 00:10:09 UTC
37da7bb Merge tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here's the big tty/serial driver update for 3.19-rc1. There are a number of TTY core changes/fixes in here from Peter Hurley that have all been teted in linux-next for a long time now. There are also the normal serial driver updates as well, full details in the changelog below" * tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits) serial: pxa: hold port.lock when reporting modem line changes tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put" tty: Deletion of unnecessary checks before two function calls n_tty: Fix read_buf race condition, increment read_head after pushing data serial: of-serial: add PM suspend/resume support Revert "serial: of-serial: add PM suspend/resume support" Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type" serial: 8250: don't attempt a trylock if in sysrq serial: core: Add big-endian iotype serial: samsung: use port->fifosize instead of hardcoded values serial: samsung: prefer to use fifosize from driver data serial: samsung: fix style problems serial: samsung: wait for transfer completion before clock disable serial: icom: fix error return code serial: tegra: clean up tty-flag assignments serial: Fix io address assign flow with Fintek PCI-to-UART Product serial: mxs-auart: fix tx_empty against shift register serial: mxs-auart: fix gpio change detection on interrupt serial: mxs-auart: Fix mxs_auart_set_ldisc() serial: 8250_dw: Use 64-bit access for OCTEON. ... 14 December 2014, 23:23:32 UTC
e7cf773 Merge tag 'usb-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB updates from Greg KH: "Here's the big set of USB and PHY patches for 3.19-rc1. The normal churn in the USB gadget area is in here, as well as xhci and other individual USB driver updates. The PHY tree is also in here, as there were dependancies on the USB tree. All of these have been in linux-next" * tag 'usb-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (351 commits) arm: omap3: twl: remove usb phy init data usbip: fix error handling in stub_probe() usb: gadget: udc: missing curly braces USB: mos7720: delete some unneeded code wusb: replace memset by memzero_explicit usbip: remove unneeded structure usb: xhci: fix comment for PORT_DEV_REMOVE xhci: don't use the same variable for stopped and halted rings current TD xhci: clear extra bits from slot context when setting max exit latency xhci: cleanup finish_td function USB: adutux: NULL dereferences on disconnect usb: chipidea: fix platform_no_drv_owner.cocci warnings usb: chipidea: Fixed a few typos in comments Documentation: bindings: add doc for the USB2 ChipIdea USB driver usb: chipidea: add a usb2 driver for ci13xxx usb: chipidea: fix phy handling usb: chipidea: remove duplicate dev_set_drvdata for host_start usb: chipidea: parameter 'mode' isn't needed for hw_device_reset usb: chipidea: add controller reset API usb: chipidea: remove flag CI_HDRC_REQUIRE_TRANSCEIVER ... 14 December 2014, 22:57:16 UTC
7a02d08 Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next Pull squashfs update from Phillip Lougher: "These patches optionally add LZ4 compression support to Squashfs. LZ4 is a lightweight compression algorithm which can be used on embedded systems to reduce CPU and memory overhead (in comparison to the standard zlib compression). These patches add the wrapper code to allow Squashfs to use the existing LZ4 decompression code, and the necessary configuration option" * tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next: Squashfs: Add LZ4 compression configuration option Squashfs: add LZ4 compression support 14 December 2014, 22:42:53 UTC
980f3c3 Merge tag 'gpio-v3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull take two of the GPIO updates: "Same stuff as last time, now with a fixup patch for the previous compile error plus I ran a few extra rounds of compile-testing. This is the bulk of GPIO changes for the v3.19 series: - A new API that allows setting more than one GPIO at the time. This is implemented for the new descriptor-based API only and makes it possible to e.g. toggle a clock and data line at the same time, if the hardware can do this with a single register write. Both consumers and drivers need new calls, and the core will fall back to driving individual lines where needed. Implemented for the MPC8xxx driver initially - Patched the mdio-mux-gpio and the serial mctrl driver that drives modems to use the new multiple-setting API to set several signals simultaneously - Get rid of the global GPIO descriptor array, and instead allocate descriptors dynamically for each GPIO on a certain GPIO chip. This moves us closer to getting rid of the limitation of using the global, static GPIO numberspace - New driver and device tree bindings for 74xx ICs - New driver and device tree bindings for the VF610 Vybrid - Support the RCAR r8a7793 and r8a7794 - Guidelines for GPIO device tree bindings trying to get things a bit more strict with the advent of combined device properties - Suspend/resume support for the MVEBU driver - A slew of minor fixes and improvements" * tag 'gpio-v3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (33 commits) gpio: mcp23s08: fix up compilation error gpio: pl061: document gpio-ranges property for bindings file gpio: pl061: hook request if gpio-ranges avaiable gpio: mcp23s08: Add option to configure IRQ output polarity as active high gpio: fix deferred probe detection for legacy API serial: mctrl_gpio: use gpiod_set_array function mdio-mux-gpio: Use GPIO descriptor interface and new gpiod_set_array function gpio: remove const modifier from gpiod_get_direction() gpio: remove gpio_descs global array gpio: mxs: implement get_direction callback gpio: em: Use dynamic allocation of GPIOs gpio: Check if base is positive before calling gpio_is_valid() gpio: mcp23s08: Add simple IRQ support for SPI devices gpio: mcp23s08: request a shared interrupt gpio: mcp23s08: Do not free unrequested interrupt gpio: rcar: Add r8a7793 and r8a7794 support gpio-mpc8xxx: add mpc8xxx_gpio_set_multiple function gpiolib: allow simultaneous setting of multiple GPIO outputs gpio: mvebu: add suspend/resume support gpio: gpio-davinci: remove duplicate check on resource .. 14 December 2014, 22:05:05 UTC
7d22286 Merge git://git.kvack.org/~bcrl/aio-next Pull aio updates from Benjamin LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: Skip timer for io_getevents if timeout=0 aio: Make it possible to remap aio ring 14 December 2014, 21:36:57 UTC
9689519 Merge branch 'i2c/for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "For 3.19, the I2C subsystem has to offer special candy this time. Right in time for Christmas :) - I2C slave framework: finally, a generic mechanism for Linux being an I2C slave (if the bus driver supports that). Docs are still missing but will come later this cycle, the code is good enough to go. - I2C muxes represent their topology in sysfs much more detailed. This will help users to navigate around much easier. - irq population of i2c clients is now done at probe time, not device creation time, to have better support for deferred probing. - new drivers for Imagination SCB, Amlogic Meson - DMA support added for Freescale IMX, Renesas SHMobile - slightly bigger driver updates to OMAP, i801, AT91, and rk3x (mostly quirk handling, timing updates, and using better kernel interfaces) - eeprom driver can now write with byte-access (very slow, but OK to have) - and the bunch of smaller fixes, cleanups, ID updates..." * 'i2c/for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (56 commits) i2c: sh_mobile: remove unneeded DMA mask i2c: rcar: add slave support i2c: slave-eeprom: add eeprom simulator driver i2c: core changes for slave support MAINTAINERS: add I2C dt bindings also to I2C realm i2c: designware: Fix falling time bindings doc i2c: davinci: switch to use platform_get_irq Documentation: i2c: Use PM ops instead of legacy suspend/resume i2c: sh_mobile: optimize irq entry i2c: pxa: add support for SCCB devices omap: i2c: don't check bus state IP rev3.3 and earlier i2c: s3c2410: Handle i2c sys_cfg register in i2c driver i2c: rk3x: add Kconfig dependency on COMMON_CLK i2c: omap: add notes related to i2c multimaster mode i2c: omap: don't reset controller if Arbitration Lost detected i2c: omap: implement workaround for handling invalid BB-bit values i2c: omap: cleanup register definitions i2c: rk3x: handle dynamic clock rate changes correctly i2c: at91: enable probe deferring on dma channel request i2c: at91: remove legacy DMA support ... 14 December 2014, 20:54:40 UTC
8fd9589 Merge tag 'md/3.19' of git://neil.brown.name/md Pull md updates from Neil Brown: "Three fixes for md. I did have a largish set of locking changes queued, but late testing showed they weren't quite as stable as I thought and while I fixed what I found, I decided it safer to delay them a release ... particularly as I'll be AFK for a few weeks. So expect a larger batch next time :-)" * tag 'md/3.19' of git://neil.brown.name/md: md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. md: fix semicolon.cocci warnings md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. 14 December 2014, 20:13:05 UTC
536e89e Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes (mainly Andy's TLS fixes), plus a cleanup" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tls: Disallow unusual TLS segments x86/tls: Validate TLS entries to protect espfix MAINTAINERS: Add me as x86 VDSO submaintainer x86/asm: Unify segment selector defines x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c file directly x86_64, switch_to(): Load TLS descriptors before switching DS and ES x86/mm: Use min() instead of min_t() in the e820 printout code x86/mm: Fix zone ranges boot printout x86/doc: Update documentation after file shuffling 14 December 2014, 19:51:50 UTC
0e58af4 x86/tls: Disallow unusual TLS segments Users have no business installing custom code segments into the GDT, and segments that are not present but are otherwise valid are a historical source of interesting attacks. For completeness, block attempts to set the L bit. (Prior to this patch, the L bit would have been silently dropped.) This is an ABI break. I've checked glibc, musl, and Wine, and none of them look like they'll have any trouble. Note to stable maintainers: this is a hardening patch that fixes no known bugs. Given the possibility of ABI issues, this probably shouldn't be backported quickly. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org # optional Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: security@kernel.org <security@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Ingo Molnar <mingo@kernel.org> 14 December 2014, 07:50:31 UTC
41bdc78 x86/tls: Validate TLS entries to protect espfix Installing a 16-bit RW data segment into the GDT defeats espfix. AFAICT this will not affect glibc, Wine, or dosemu at all. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: security@kernel.org <security@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Ingo Molnar <mingo@kernel.org> 14 December 2014, 07:50:31 UTC
f0905c5 MAINTAINERS: Add me as x86 VDSO submaintainer Here goes... :) Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1042001e502f8e0deb0edfeeac209b68378650cf.1418430292.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> 14 December 2014, 07:47:22 UTC
5f785de aio: Skip timer for io_getevents if timeout=0 In this case, it is basically a polling. Let's not involve timer at all because that would hurt performance for application event loops. In an arbitrary test I've done, io_getevents syscall elapsed time reduces from 50000+ nanoseconds to a few hundereds. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> 13 December 2014, 22:50:20 UTC
e4a0d3e aio: Make it possible to remap aio ring There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> 13 December 2014, 22:49:50 UTC
9ea18f8 Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ... 13 December 2014, 22:22:26 UTC
caf292a Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ... 13 December 2014, 22:14:23 UTC
8f4385d Merge tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixlet from Steven Rostedt: "Remove unnecessary preempt_disable in printk()" * tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: printk: Do not disable preemption for accessing printk_func 13 December 2014, 22:04:41 UTC
52bb452 Merge tag 'trace-fixes-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Here's two fixes: 1) Discovered by Fengguang Wu's tests. I changed the parameters to the function graph x86 prepare_ftrace_return call but forgot to update the call from entry_32 (i386 version). This patch corrects that. 2) I was tracing some code and found that the sched_switch tracepoint was showing tasks in the INTERRUPTIBLE state as RUNNING. This was due to the updates to convert preempt_count into a per_cpu variable. The tracepoint logic was made to use the tasks saved_preempt_count which could hold a stale "PREEMPT_ACTIVE", instead of using the current preempt_count() call" * tag 'trace-fixes-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/sched: Check preempt_count() for current when reading task->state ftrace/x86: Update i386 call to prepare_ftrace_return() 13 December 2014, 22:01:11 UTC
a99abce Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit Pull audit updates from Paul Moore: "Two small patches from the audit next branch; only one of which has any real significant code changes, the other is simply a MAINTAINERS update for audit. The single code patch is pretty small and rather straightforward, it changes the audit "version" number reported to userspace from an integer to a bitmap which is used to indicate the functionality of the running kernel. This really doesn't have much impact on the kernel, but it will make life easier for the audit userspace folks. Thankfully we were still on a version number which allowed us to do this without breaking userspace" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: audit: convert status version to a feature bitmap audit: add Paul Moore to the MAINTAINERS entry 13 December 2014, 21:41:28 UTC
e3aa91a Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: - The crypto API is now documented :) - Disallow arbitrary module loading through crypto API. - Allow get request with empty driver name through crypto_user. - Allow speed testing of arbitrary hash functions. - Add caam support for ctr(aes), gcm(aes) and their derivatives. - nx now supports concurrent hashing properly. - Add sahara support for SHA1/256. - Add ARM64 version of CRC32. - Misc fixes. * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits) crypto: tcrypt - Allow speed testing of arbitrary hash functions crypto: af_alg - add user space interface for AEAD crypto: qat - fix problem with coalescing enable logic crypto: sahara - add support for SHA1/256 crypto: sahara - replace tasklets with kthread crypto: sahara - add support for i.MX53 crypto: sahara - fix spinlock initialization crypto: arm - replace memset by memzero_explicit crypto: powerpc - replace memset by memzero_explicit crypto: sha - replace memset by memzero_explicit crypto: sparc - replace memset by memzero_explicit crypto: algif_skcipher - initialize upon init request crypto: algif_skcipher - removed unneeded code crypto: algif_skcipher - Fixed blocking recvmsg crypto: drbg - use memzero_explicit() for clearing sensitive data crypto: drbg - use MODULE_ALIAS_CRYPTO crypto: include crypto- module prefix in template crypto: user - add MODULE_ALIAS crypto: sha-mb - remove a bogus NULL check crytpo: qat - Fix 64 bytes requests ... 13 December 2014, 21:33:26 UTC
78a45c6 Merge branch 'akpm' (second patch-bomb from Andrew) Merge second patchbomb from Andrew Morton: - the rest of MM - misc fs fixes - add execveat() syscall - new ratelimit feature for fault-injection - decompressor updates - ipc/ updates - fallocate feature creep - fsnotify cleanups - a few other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (99 commits) cgroups: Documentation: fix trivial typos and wrong paragraph numberings parisc: percpu: update comments referring to __get_cpu_var percpu: update local_ops.txt to reflect this_cpu operations percpu: remove __get_cpu_var and __raw_get_cpu_var macros fsnotify: remove destroy_list from fsnotify_mark fsnotify: unify inode and mount marks handling fallocate: create FAN_MODIFY and IN_MODIFY events mm/cma: make kmemleak ignore CMA regions slub: fix cpuset check in get_any_partial slab: fix cpuset check in fallback_alloc shmdt: use i_size_read() instead of ->i_size ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments ipc/msg: increase MSGMNI, remove scaling ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM ipc/sem.c: change memory barrier in sem_lock() to smp_rmb() lib/decompress.c: consistency of compress formats for kernel image decompress_bunzip2: off by one in get_next_block() usr/Kconfig: make initrd compression algorithm selection not expert fault-inject: add ratelimit option ratelimit: add initialization macro ... 13 December 2014, 21:00:36 UTC
29d293b cgroups: Documentation: fix trivial typos and wrong paragraph numberings Signed-off-by: SeongJae Park <sj38.park@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
6ddb798 parisc: percpu: update comments referring to __get_cpu_var __get_cpu_var was removed. Update comments to refer to this_cpu_ptr() instead. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
7d94a82 percpu: update local_ops.txt to reflect this_cpu operations Update the documentation to reflect changes due to the availability of this_cpu operations. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
6c51ec4 percpu: remove __get_cpu_var and __raw_get_cpu_var macros No user is left in the kernel source tree. Therefore we can drop the definitions. This is the final merge of the transition away from __get_cpu_var. After this patch the kernel will not build if anyone uses __get_cpu_var. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
37d469e fsnotify: remove destroy_list from fsnotify_mark destroy_list is used to track marks which still need waiting for srcu period end before they can be freed. However by the time mark is added to destroy_list it isn't in group's list of marks anymore and thus we can reuse fsnotify_mark->g_list for queueing into destroy_list. This saves two pointers for each fsnotify_mark. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
0809ab6 fsnotify: unify inode and mount marks handling There's a lot of common code in inode and mount marks handling. Factor it out to a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
820c12d fallocate: create FAN_MODIFY and IN_MODIFY events The fanotify and the inotify API can be used to monitor changes of the file system. System call fallocate() modifies files. Hence it should trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY) events. The most interesting case is FALLOC_FL_COLLAPSE_RANGE because this value allows to create arbitrary file content from random data. This patch adds the missing call to fsnotify_modify(). The FAN_MODIFY and IN_MODIFY event will be created when fallocate() succeeds. It will even be created if the file length remains unchanged, e.g. when calling fanotify with flag FALLOC_FL_KEEP_SIZE. This logic was primarily chosen to keep the coding simple. It resembles the logic of the write() system call. When we call write() we always create a FAN_MODIFY event, even in the case of overwriting with identical data. Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was actually changed. Furthermore even if if the filesize remains unchanged, fallocate() may influence whether a subsequent write() will succeed and hence the fallocate() call may be considered a modification. The fallocate(2) man page teaches: After a successful call, subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space. So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in different outcomes of a subsequent write depending on the values of offset and len. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@parisplace.org> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
620951e mm/cma: make kmemleak ignore CMA regions kmemleak will add allocations as objects to a pool. The memory allocated for each object in this pool is periodically searched for pointers to other allocated objects. This only works for memory that is mapped into the kernel's virtual address space, which happens not to be the case for most CMA regions. Furthermore, CMA regions are typically used to store data transferred to or from a device and therefore don't contain pointers to other objects. Without this, the kernel crashes on the first execution of the scan_gray_list() because it tries to access highmem. Perhaps a more appropriate fix would be to reject any object that can't map to a kernel virtual address? [akpm@linux-foundation.org: add comment] [akpm@linux-foundation.org: fix comment, per Catalin] [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()] Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
dee2f8a slub: fix cpuset check in get_any_partial If we fail to allocate from the current node's stock, we look for free objects on other nodes before calling the page allocator (see get_any_partial). While checking other nodes we respect cpuset constraints by calling cpuset_zone_allowed. We enforce hardwall check. As a result, we will fallback to the page allocator even if there are some pages cached on other nodes, but the current cpuset doesn't have them set. However, the page allocator uses softwall check for kernel allocations, so it may allocate from one of the other nodes in this case. Therefore we should use softwall cpuset check in get_any_partial to conform with the cpuset check in the page allocator. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
061d707 slab: fix cpuset check in fallback_alloc fallback_alloc is called on kmalloc if the preferred node doesn't have free or partial slabs and there's no pages on the node's free list (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries to locate a free or partial slab on other allowed nodes' lists. While iterating over the preferred node's zonelist it skips those zones which hardwall cpuset check returns false for. That means that for a task bound to a specific node using cpusets fallback_alloc will always ignore free slabs on other nodes and go directly to the reclaimer, which, however, may allocate from other nodes if cpuset.mem_hardwall is unset (default). As a result, we may get lists of free slabs grow without bounds on other nodes, which is bad, because inactive slabs are only evicted by cache_reap at a very slow rate and cannot be dropped forcefully. To reproduce the issue, run a process that will walk over a directory tree with lots of files inside a cpuset bound to a node that constantly experiences memory pressure. Look at num_slabs vs active_slabs growth as reported by /proc/slabinfo. To avoid this we should use softwall cpuset check in fallback_alloc. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:53 UTC
07a46ed shmdt: use i_size_read() instead of ->i_size Andrew Morton noted http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org that the shmdt uses inode->i_size outside of i_mutex being held. There is one more case in shm.c in shm_destroy(). This converts both users over to use i_size_read(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
d3c9790 ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments This is a highly-contrived scenario. But, a single shmdt() call can be induced in to unmapping memory from mulitple shm segments. Example code is here: http://www.sr71.net/~dave/intel/shmfun.c The fix is pretty simple: Record the 'struct file' for the first VMA we encounter and then stick to it. Decline to unmap anything not from the same file and thus the same segment. I found this by inspection and the odds of anyone hitting this in practice are pretty darn small. Lightly tested, but it's a pretty small patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
0050ee0 ipc/msg: increase MSGMNI, remove scaling SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
e843e7d ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM a) SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: Increase the sysv sem limits so that all known applications will work with these defaults. b) With regards to the maximum supported: Some of the specified hard limits are not correct anymore, therefore the patch updates the documentation. - SEMMNI must stay below IPCMNI, which is 32768. As for SHMMAX: Stay a bit below this limit. - SEMMSL was limited to 8k, to ensure that the kmalloc for the kernel array was limited to 16 kB (order=2) This doesn't apply anymore: - the allocation size isn't sizeof(short)*nsems anymore. - ipc_alloc falls back to vmalloc - SEMOPM should stay below 1000, to limit the kmalloc in semtimedop() to an order=1 allocation. Therefore: Leave it at 500 (order=0 allocation). Note: If an administrator must limit the memory allocations, then he can set the values as necessary. Or he can disable sysv entirely (as e.g. done by Android). Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
2e094ab ipc/sem.c: change memory barrier in sem_lock() to smp_rmb() When I fixed bugs in the sem_lock() logic, I was more conservative than necessary. Therefore it is safe to replace the smp_mb() with smp_rmb(). And: With smp_rmb(), semop() syscalls are up to 10% faster. The race we must protect against is: sem->lock is free sma->complex_count = 0 sma->sem_perm.lock held by thread B thread A: A: spin_lock(&sem->lock) B: sma->complex_count++; (now 1) B: spin_unlock(&sma->sem_perm.lock); A: spin_is_locked(&sma->sem_perm.lock); A: XXXXX memory barrier A: if (sma->complex_count == 0) Thread A must read the increased complex_count value, i.e. the read must not be reordered with the read of sem_perm.lock done by spin_is_locked(). Since it's about ordering of reads, smp_rmb() is sufficient. [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
a060bfe lib/decompress.c: consistency of compress formats for kernel image Magic number of compress formats for kernel image is defined by two bytes. These numbers are written in hexadecimal number, nevertheless magic number for only gunzip is written in octal number. The formats should be consistent for readability. Therefore, magic numbers for gunzip are also defined by hexadecimal number. Signed-off-by: Haesung Kim <matia.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
b5c8afe decompress_bunzip2: off by one in get_next_block() "origPtr" is used as an offset into the bd->dbuf[] array. That array is allocated in start_bunzip() and has "bd->dbufSize" number of elements so the test here should be >= instead of >. Later we check "origPtr" again before using it as an offset so I don't know if this bug can be triggered in real life. Fixes: bc22c17e12c1 ('bzip2/lzma: library support for gzip, bzip2 and lzma decompression') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
ec72c66 usr/Kconfig: make initrd compression algorithm selection not expert The kernel has support for (nearly) every compression algorithm known to man, each to handle some particular microscopic niche. Unfortunately all of these always get compiled in if you want to support INITRDs, and can be only disabled when CONFIG_EXPERT is set. I don't see why I need to set EXPERT just to properly configure the initrd compression algorithms, and not always include every possible algorithm Usually the initrd is just compressed with gzip anyways, at least that's true on all distributions I use. Remove the dependencies for initrd compression on CONFIG_EXPERT. Make the various options just default y, which should be good enough to not break any previous configuration. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
6adc4a2 fault-inject: add ratelimit option Current debug levels are not optimal. Especially if one want to provoke big numbers of faults(broken device simulator) then any verbose level will produce giant numbers of identical logging messages. Let's add ratelimit parameter for that purpose. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
89e3f90 ratelimit: add initialization macro Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
92cab82 fs/affs/file.c: remove obsolete pagesize check linux kernel doesn't manage page sizes below 4kb. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:52 UTC
9abb408 fs/affs/file.c: add support to O_DIRECT Based on ext2_direct_IO Tested with O_DIRECT file open and sysbench/mariadb with 1% written queries improvement (update_non_index test) on a volume created with mkaffs. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
1ee54b0 fs/affs/amigaffs.c: use va_format instead of buffer/vnsprintf -Remove ErrorBuffer and use %pV -Add __printf to enable argument mistmatch warnings Original patch by Joe Perches. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
7633978 fs/affs/file.c: forward declaration clean-up -Move file_operations to avoid forward declarations. -Remove unused declarations. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
957e3fa gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs Following the suggestions from Andrew Morton and Stephen Rothwell, Dont expand the ARCH list in kernel/gcov/Kconfig. Instead, define a ARCH_HAS_GCOV_PROFILE_ALL bool which architectures can enable. set ARCH_HAS_GCOV_PROFILE_ALL on Architectures where it was previously allowed + ARM64 which I tested. Signed-off-by: Riku Voipio <riku.voipio@linaro.org> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
d539395 kexec: remove unnecessary KERN_ERR from kexec.c Remove unnecessary KERN_ERR from pr_err() within kexec.c. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
38351a3 sparc: hook up execveat system call Signed-off-by: David Drysdale <drysdale@google.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
c9b26b8 syscalls: add selftest for execveat(2) Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
27d6ec7 x86: hook up execveat system call Hook up x86-64, i386 and x32 ABIs. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
51f39a1 syscalls: implement execveat() system call This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
c0ef0cc fat: fix data past EOF resulting from fsx testsuite When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues. fsx ./file2 -Z -r 4096 -w 4096 ... .. truncating to largest ever: 0x907c fallocating to largest ever: 0x11137 truncating to largest ever: 0x2c6fe truncating to largest ever: 0x2cfdf fallocating to largest ever: 0x40000 Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e ... .. The reason being, it is doing a truncate down, but the zeroing does not happen on the last block boundary when offset is not aligned. Even though it calls truncate_setsize()->truncate_inode_pages()-> truncate_inode_pages_range() and considers the partial zeroout but it retrieves the page using find_lock_page() - which only looks the page in the cache. So, zeroing out does not happen in case of direct IO. Make a truncate page based around block_truncate_page for FAT filesystem and invoke that helper to zerout in case the offset is not aligned with the blocksize. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
f441ada befs: remove dead code Coverity id: 1042674 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
1dd61aa mm/zbud: init user ops only when it is needed When zbud is initialized through the zpool wrapper, pool->ops which points to user-defined operations is always set regardless of whether it is specified from the upper layer. This causes zbud_reclaim_page() to iterate its loop for evicting pool pages out without any gain. This patch sets the user-defined ops only when it is needed, so that zbud_reclaim_page() can bail out the reclamation loop earlier if there is no user-defined operations specified. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Sunae Seo <sunae.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:51 UTC
442cc43 mm/zswap: delete unnecessary check before calling free_percpu() free_percpu() tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
dd01d7d mm/zswap: add __init to some functions in zswap zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by __init init_zswap() Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
083914e zram: use DEVICE_ATTR_[RW|RO|WO] to define zram sys device attribute In current zram, we use DEVICE_ATTR() to define sys device attributes. SO, we need to set (S_IRUGO | S_IWUSR) permission and other arguments manually. Linux already provids the macro DEVICE_ATTR_[RW|RO|WO] to define sys device attribute. It is simple and readable. This patch uses kernel defined macro DEVICE_ATTR_[RW|RO|WO] to define zram device attribute. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
1813665 mm/zsmalloc: allocate exactly size of struct zs_pool In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool) ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); This patch allocate memory of exactly needed size. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
df8b5bb mm/zsmalloc: avoid duplicate assignment of prev_class In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times. And the prev_class only references to the previous size_class. So we do not need unnecessary assignement. This patch assigns *prev_class* when a new size_class structure is allocated and uses prev_class to check whether the first class has been allocated. [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES] Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
d49b1c2 mm/zram: correct ZRAM_ZERO flag bit position In struct zram_table_entry, the element *value* contains obj size and obj zram flags. Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj zram_flags. So the first zram flag(ZRAM_ZERO) should be from ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1). This patch fixes this cosmetic issue. Also fix a typo, "page in now accessed" -> "page is now accessed" Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
40f9fb8 mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim found zsmalloc even does not support allocating an obj with the size of ZS_MAX_ALLOC_SIZE in some situations. For example: In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then: ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by: MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE) ZS_SIZE_CLASS_DELTA is 256 bytes So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1 = 256 In zs_create_pool(), the max size obj which can be allocated will be: ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312 We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper users we can do. [1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html This patch fixes this issue by dynamiclly calculating zs_size_classes when module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it. [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability] Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
af4ee5e zsmalloc: correct fragile [kmap|kunmap]_atomic use The kunmap_atomic should use virtual address getting by kmap_atomic. However, some pieces of code in zsmalloc uses modified address, not the one got by kmap_atomic for kunmap_atomic. It's okay for working because zsmalloc modifies the address inner PAGE_SIZE bounday so it works with current kmap_atomic's implementation. But it's still fragile with potential changing of kmap_atomic so let's correct it. I got a subtle bug when I implemented a new feature of zsmalloc (compaction) due to a link's mishandling (the link was over page boundary). Although it was totally my mistake, it took a while to find the cause because an unpredictable kmapped address was unmapped causing an almost random crash. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
b1b00a5 zsmalloc: fix zs_init cpu notifier error handling Mahendran Ganesh reported that zpool-enabled zsmalloc should not call zpool_unregister_driver() from zs_init() if cpu notifier registration has failed, because error handling is performed before we register the driver via zpool_register_driver() call. Factor out cpu notifier registration and unregistration code and fix zs_init() error handling. link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html [akpm@linux-foundation.org: squash bogus gcc warning] [akpm@linux-foundation.org: use __init and __exit] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
8c7f010 zram: implement rw_page operation of zram This patch implements rw_page operation for zram block device. I implemented the feature in zram and tested it. Test bed was the G2, LG electronic mobile device, whtich has msm8974 processor and 2GB memory. With a memory allocation test program consuming memory, the system generates swap. Operating time of swap_write_page() was measured. -------------------------------------------------- | | operating time | improvement | | | (20 runs average) | | -------------------------------------------------- |with patch | 1061.15 us | +2.4% | -------------------------------------------------- |without patch| 1087.35 us | | -------------------------------------------------- Each test(with paged_io,with BIO) result set shows normal distribution and has equal variance. I mean the two values are valid result to compare. I can say operation with paged I/O(without BIO) is faster 2.4% with confidence level 95%. [minchan@kernel.org: make rw_page opeartion return 0] [minchan@kernel.org: rely on the bi_end_io for zram_rw_page fails] [sergey.senozhatsky@gmail.com: code cleanup] [minchan@kernel.org: add comment] Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
54850e7 zram: change parameter from vaild_io_request() This patch changes parameter of valid_io_request for common usage. The purpose of valid_io_request() is to determine if bio request is valid or not. This patch use I/O start address and size instead of a BIO parameter for common usage. Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:50 UTC
b627cff zram: remove bio parameter from zram_bvec_rw() Recently rw_page block device operation has been added. This patchset implements rw_page operation for zram block device and does some clean-up. This patch (of 3): Remove an unnecessary parameter(bio) from zram_bvec_rw() and zram_bvec_read(). zram_bvec_read() doesn't use a bio parameter, so remove it. zram_bvec_rw() calls a read/write operation not using bio, so a rw parameter replaces a bio parameter. Signed-off-by: karam.lee <karam.lee@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
9eec4cd zsmalloc: merge size_class to reduce fragmentation zsmalloc has many size_classes to reduce fragmentation and they are in 16 bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And, zsmalloc has constraint that each zspage has 4 pages at maximum. In this situation, we can see interesting aspect. Let's think about size_class for 1488, 1472, ..., 1376. To prevent external fragmentation, they uses 4 pages per zspage and so all they can contain 11 objects at maximum. 16384 (4096 * 4) = 1488 * 11 + remains 16384 (4096 * 4) = 1472 * 11 + remains 16384 (4096 * 4) = ... 16384 (4096 * 4) = 1376 * 11 + remains It means that they have same characteristics and classification between them isn't needed. If we use one size_class for them, we can reduce fragementation and save some memory since both the 1488 and 1472 sized classes can only fit 11 objects into 4 pages, and an object that's 1472 bytes can fit into an object that's 1488 bytes, merging these classes to always use objects that are 1488 bytes will reduce the total number of size classes. And reducing the total number of size classes reduces overall fragmentation, because a wider range of compressed pages can fit into a single size class, leaving less unused objects in each size class. For this purpose, this patch implement size_class merging. If there is size_class that have same pages_per_zspage and same number of objects per zspage with previous size_class, we don't create new size_class. Instead, we use previous, same characteristic size_class. With this way, above example sizes (1488, 1472, ..., 1376) use just one size_class so we can get much more memory utilization. Below is result of my simple test. TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel source code, remove directory in descending order in size. (drivers arch fs sound include net Documentation firmware kernel tools) Each line represents orig_data_size, compr_data_size, mem_used_total, fragmentation overhead (mem_used - compr_data_size) and overhead ratio (overhead to compr_data_size), respectively, after untar and remove operation is executed. * untar-nomerge.out orig_size compr_size used_size overhead overhead_ratio 525.88MB 199.16MB 210.23MB 11.08MB 5.56% 288.32MB 97.43MB 105.63MB 8.20MB 8.41% 177.32MB 61.12MB 69.40MB 8.28MB 13.55% 146.47MB 47.32MB 56.10MB 8.78MB 18.55% 124.16MB 38.85MB 48.41MB 9.55MB 24.58% 103.93MB 31.68MB 40.93MB 9.25MB 29.21% 84.34MB 22.86MB 32.72MB 9.86MB 43.13% 66.87MB 14.83MB 23.83MB 9.00MB 60.70% 60.67MB 11.11MB 18.60MB 7.49MB 67.48% 55.86MB 8.83MB 16.61MB 7.77MB 88.03% 53.32MB 8.01MB 15.32MB 7.31MB 91.24% * untar-merge.out orig_size compr_size used_size overhead overhead_ratio 526.23MB 199.18MB 209.81MB 10.64MB 5.34% 288.68MB 97.45MB 104.08MB 6.63MB 6.80% 177.68MB 61.14MB 66.93MB 5.79MB 9.47% 146.83MB 47.34MB 52.79MB 5.45MB 11.51% 124.52MB 38.87MB 44.30MB 5.43MB 13.96% 104.29MB 31.70MB 36.83MB 5.13MB 16.19% 84.70MB 22.88MB 27.92MB 5.04MB 22.04% 67.11MB 14.83MB 19.26MB 4.43MB 29.86% 60.82MB 11.10MB 14.90MB 3.79MB 34.17% 55.90MB 8.82MB 12.61MB 3.79MB 42.97% 53.32MB 8.01MB 11.73MB 3.73MB 46.53% As you can see above result, merged one has better utilization (overhead ratio, 5th column) and uses less memory (mem_used_total, 3rd column). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Luigi Semenzato <semenzato@google.com> Cc: <juno.choi@lge.com> Cc: "seungho1.park" <seungho1.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
70bc068 mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate() Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON() to the beginning of memcg_stat_show(). This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
8135be5 memcg: fix possible use-after-free in memcg_kmem_get_cache() Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
ae6e71d mm/memcontrol.c: fix defined but not used compiler warning test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so move it into the compiler if statement [akpm@linux-foundation.org: clean up layout] Signed-off-by: Michele Curti <michele.curti@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
441c228 mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages A random seek IO benchmark appeared to regress because of a change to readahead but the real problem was the benchmark. To ensure the IO request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary (512K) but the hint is ignored by the kernel. This is correct but not necessarily obvious behaviour. As much as I dislike comment patches, the explanation for this behaviour predates current git history. Clarify why it behaves like this in case someone "fixes" fadvise or readahead for the wrong reasons. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
7e5b528 mm/vmalloc.c: fix memory ordering bug Read memory barriers must follow the read operations. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
6a2d567 oom: kill the insufficient and no longer needed PT_TRACE_EXIT check After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was "frozen" by ptrace, but it doesn't really work. If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
d003f37 oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
ba914f4 mm: remove the highmem zones' memmap in the highmem zone Since 01cefaef40c4 ("mm: provide more accurate estimation of pages occupied by memmap") allocate the pages from lowmem for the highmem zones' memmap. So It is not need to reserver the memmap's for the highmem. A 2G DDR3 for the arm platform: On node 0 totalpages: 524288 free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000 DMA zone: 3568 pages used for memmap DMA zone: 0 pages reserved DMA zone: 456704 pages, LIFO batch:31 HighMem zone: 528 pages used for memmap HighMem zone: 67584 pages, LIFO batch:15 On node 0 totalpages: 524288 free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000 DMA zone: 3568 pages used for memmap DMA zone: 0 pages reserved DMA zone: 456704 pages, LIFO batch:31 HighMem zone: 67584 pages, LIFO batch:15 Signed-off-by: Hongbo Zhong <hongbo.zhong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
2ebba6b mm: unmapped page migration avoid unmap+remap overhead Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were created for use on pages almost certainly mapped into userspace. But nowadays compaction often applies them to unmapped page cache pages: which may exacerbate contention on i_mmap_rwsem quite unnecessarily, since try_to_unmap_file() makes no preliminary page_mapped() check. Now check page_mapped() in __unmap_and_move(); and avoid repeating the same overhead in rmap_walk_file() - don't remove_migration_ptes() when we never inserted any. (The PageAnon(page) comment blocks now look even sillier than before, but clean that up on some other occasion. And note in passing that try_to_unmap_one() does not use a migration entry when PageSwapCache, so remove_migration_ptes() will then not update that swap entry to newpage pte: not a big deal, but something else to clean up later.) Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex" conversion to i_mmap_rwsem, that "The biggest winner of these changes is migration": a part of the reason might be all of that unnecessary taking of i_mmap_mutex in page migration; and it's rather a shame that I didn't get around to sending this patch in before his - this one is much less useful after Davidlohr's conversion to rwsem, but still good. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
5cec38a fs, seq_file: fallback to vmalloc instead of oom kill processes Since commit 058504edd026 ("fs/seq_file: fallback to vmalloc allocation"), seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous memory fails. This was done to address order-4 slab allocations for reading /proc/stat on large machines and noticed because PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page allocator when allocating new slab for such high-order allocations. Contiguous memory isn't necessary for caller of seq_buf_alloc(), however. Other GFP_KERNEL high-order allocations that are <= PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and oom kill processes as a result. We don't want to kill processes so that we can allocate contiguous memory in situations when contiguous memory isn't necessary. This patch does the kmalloc() allocation with __GFP_NORETRY for high-order allocations. This still utilizes memory compaction and direct reclaim in the allocation path, the only difference is that it will fail immediately instead of oom kill processes when out of memory. [akpm@linux-foundation.org: add comment] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:49 UTC
6b4f779 mm: vmscan: invoke slab shrinkers from shrink_zone() The slab shrinkers are currently invoked from the zonelist walkers in kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the eligible LRU pages and assemble a nodemask to pass to NUMA-aware shrinkers, which then again have to walk over the nodemask. This is redundant code, extra runtime work, and fairly inaccurate when it comes to the estimation of actually scannable LRU pages. The code duplication will only get worse when making the shrinkers cgroup-aware and requiring them to have out-of-band cgroup hierarchy walks as well. Instead, invoke the shrinkers from shrink_zone(), which is where all reclaimers end up, to avoid this duplication. Take the count for eligible LRU pages out of get_scan_count(), which considers many more factors than just the availability of swap space, like zone_reclaimable_pages() currently does. Accumulate the number over all visited lruvecs to get the per-zone value. Some nodes have multiple zones due to memory addressing restrictions. To avoid putting too much pressure on the shrinkers, only invoke them once for each such node, using the class zone of the allocation as the pivot zone. For now, this integrates the slab shrinking better into the reclaim logic and gets rid of duplicative invocations from kswapd, direct reclaim, and zone reclaim. It also prepares for cgroup-awareness, allowing memcg-capable shrinkers to be added at the lruvec level without much duplication of both code and runtime work. This changes kswapd behavior, which used to invoke the shrinkers for each zone, but with scan ratios gathered from the entire node, resulting in meaningless pressure quantities on multi-zone nodes. Zone reclaim behavior also changes. It used to shrink slabs until the same amount of pages were shrunk as were reclaimed from the LRUs. Now it merely invokes the shrinkers once with the zone's scan ratio, which makes the shrinkers go easier on caches that implement aging and would prefer feeding back pressure from recently used slab objects to unused LRU pages. [vdavydov@parallels.com: assure class zone is populated] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
f5f302e mm,vmacache: count number of system-wide flushes These flushes deal with sequence number overflows, such as for long lived threads. These are rare, but interesting from a debugging PoV. As such, display the number of flushes when vmacache debugging is enabled. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
16a7ade Documentation: add new page_owner document page owner is for the tracking about who allocated each page. This document explains what is the page owner feature and what is the merit of it. And, simple HOW-TO is also explained. See the document for detailed information. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
61cf5fe mm/page_owner: correct owner information for early allocated pages Extended memory to store page owner information is initialized some time later than that page allocator starts. Until initialization, many pages can be allocated and they have no owner information. This make debugging using page owner harder, so some fixup will be helpful. This patch fixes up this situation by setting fake owner information immediately after page extension is initialized. Information doesn't tell the right owner, but, at least, it can tell whether page is allocated or not, more correctly. On my testing, this patch catches 13343 early allocated pages, although they are mostly allocated from page extension feature. Anyway, after then, there is no page left that it is allocated and has no page owner flag. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
48c96a3 mm/page_owner: keep track of page owners This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg <alexn@dsv.su.se> Mel Gorman <mgorman@suse.de> Dave Hansen <dave@linux.vnet.ibm.com> Minchan Kim <minchan@kernel.org> Michal Nazarewicz <mina86@mina86.com> Andrew Morton <akpm@linux-foundation.org> Jungsoo Son <jungsoo.son@lge.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
9a92a6c stacktrace: introduce snprint_stack_trace for buffer output Current stacktrace only have the function for console output. page_owner that will be introduced in following patch needs to print the output of stacktrace into the buffer for our own output format so so new function, snprint_stack_trace(), is needed. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
dbc8358 mm/nommu: use alloc_pages_exact() rather than its own implementation do_mmap_private() in nommu.c try to allocate physically contiguous pages with arbitrary size in some cases and we now have good abstract function to do exactly same thing, alloc_pages_exact(). So, change to use it. There is no functional change. This is the preparation step for support page owner feature accurately. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
031bc57 mm/debug-pagealloc: make debug-pagealloc boottime configurable Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 13 December 2014, 20:42:48 UTC
back to top