Revision history - refs/pull/308/head - origin: https://github.com/torvalds/linux

visit type:

Revision	Author	Date	Message	Commit Date
adf7b85	Jonathan Beaudoin	27 July 2016, 21:47:03 UTC	Grammar mistake	27 July 2016, 21:47:03 UTC
08fd8c1	Linus Torvalds	27 July 2016, 18:35:37 UTC	Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from David Vrabel: "Features and fixes for 4.8-rc0: - ACPI support for guests on ARM platforms. - Generic steal time support for arm and x86. - Support cases where kernel cpu is not Xen VCPU number (e.g., if in-guest kexec is used). - Use the system workqueue instead of a custom workqueue in various places" * tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits) xen: add static initialization of steal_clock op to xen_time_ops xen/pvhvm: run xen_vcpu_setup() for the boot CPU xen/evtchn: use xen_vcpu_id mapping xen/events: fifo: use xen_vcpu_id mapping xen/events: use xen_vcpu_id mapping in events_base x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op xen: introduce xen_vcpu_id mapping x86/acpi: store ACPI ids from MADT for future usage x86/xen: update cpuid.h from Xen-4.7 xen/evtchn: add IOCTL_EVTCHN_RESTRICT xen-blkback: really don't leak mode property xen-blkback: constify instance of "struct attribute_group" xen-blkfront: prefer xenbus_scanf() over xenbus_gather() xen-blkback: prefer xenbus_scanf() over xenbus_gather() xen: support runqueue steal time on xen arm/xen: add support for vm_assist hypercall xen: update xen headers xen-pciback: drop superfluous variables xen-pciback: short-circuit read path used for merging write values ...	27 July 2016, 18:35:37 UTC
e831101	Linus Torvalds	27 July 2016, 18:16:05 UTC	Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Kexec support for arm64 - Kprobes support - Expose MIDR_EL1 and REVIDR_EL1 CPU identification registers to sysfs - Trapping of user space cache maintenance operations and emulation in the kernel (CPU errata workaround) - Clean-up of the early page tables creation (kernel linear mapping, EFI run-time maps) to avoid splitting larger blocks (e.g. pmds) into smaller ones (e.g. ptes) - VDSO support for CLOCK_MONOTONIC_RAW in clock_gettime() - ARCH_HAS_KCOV enabled for arm64 - Optimise IP checksum helpers - SWIOTLB optimisation to only allocate/initialise the buffer if the available RAM is beyond the 32-bit mask - Properly handle the "nosmp" command line argument - Fix for the initialisation of the CPU debug state during early boot - vdso-offsets.h build dependency workaround - Build fix when RANDOMIZE_BASE is enabled with MODULES off * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits) arm64: arm: Fix-up the removal of the arm64 regs_query_register_name() prototype arm64: Only select ARM64_MODULE_PLTS if MODULES=y arm64: mm: run pgtable_page_ctor() on non-swapper translation table pages arm64: mm: make create_mapping_late() non-allocating arm64: Honor nosmp kernel command line option arm64: Fix incorrect per-cpu usage for boot CPU arm64: kprobes: Add KASAN instrumentation around stack accesses arm64: kprobes: Cleanup jprobe_return arm64: kprobes: Fix overflow when saving stack arm64: kprobes: WARN if attempting to step with PSTATE.D=1 arm64: debug: remove unused local_dbg_{enable, disable} macros arm64: debug: remove redundant spsr manipulation arm64: debug: unmask PSTATE.D earlier arm64: localise Image objcopy flags arm64: ptrace: remove extra define for CPSR's E bit kprobes: Add arm64 case in kprobe example module arm64: Add kernel return probes support (kretprobes) arm64: Add trampoline code for kretprobes arm64: kprobes instruction simulation support arm64: Treat all entry code as non-kprobe-able ...	27 July 2016, 18:16:05 UTC
f9abf53	Linus Torvalds	27 July 2016, 17:54:11 UTC	Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile Pull tile architecture updates from Chris Metcalf: "A few stray changes" * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: tile: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO tile: support gcc 7 optimization to use __multi3 tile 32-bit big-endian: fix bugs in syscall argument order tile: allow disabling CONFIG_EARLY_PRINTK	27 July 2016, 17:54:11 UTC
ba4f678	Linus Torvalds	27 July 2016, 17:47:24 UTC	Merge tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes two trivial changes, one to use kmemdup and another to control the log level of recovery messages" * tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: Use kmemdup instead of kmalloc and memcpy dlm: add log_info config option	27 July 2016, 17:47:24 UTC
4fc29c1	Linus Torvalds	27 July 2016, 17:36:31 UTC	Merge tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "The major change in this version is mitigating cpu overheads on write paths by replacing redundant inode page updates with mark_inode_dirty calls. And we tried to reduce lock contentions as well to improve filesystem scalability. Other feature is setting F2FS automatically when detecting host-managed SMR. Enhancements: - ioctl to move a range of data between files - inject orphan inode errors - avoid flush commands congestion - support lazytime Bug fixes: - return proper results for some dentry operations - fix deadlock in add_link failure - disable extent_cache for fcollapse/finsert" * tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits) f2fs: clean up coding style and redundancy f2fs: get victim segment again after new cp f2fs: handle error case with f2fs_bug_on f2fs: avoid data race when deciding checkpoin in f2fs_sync_file f2fs: support an ioctl to move a range of data blocks f2fs: fix to report error number of f2fs_find_entry f2fs: avoid memory allocation failure due to a long length f2fs: reset default idle interval value f2fs: use blk_plug in all the possible paths f2fs: fix to avoid data update racing between GC and DIO f2fs: add maximum prefree segments f2fs: disable extent_cache for fcollapse/finsert inodes f2fs: refactor __exchange_data_block for speed up f2fs: fix ERR_PTR returned by bio f2fs: avoid mark_inode_dirty f2fs: move i_size_write in f2fs_write_end f2fs: fix to avoid redundant discard during fstrim f2fs: avoid mismatching block range for discard f2fs: fix incorrect f_bfree calculation in ->statfs f2fs: use percpu_rw_semaphore ...	27 July 2016, 17:36:31 UTC
0e6acf0	Linus Torvalds	27 July 2016, 16:53:35 UTC	Merge tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "The major addition is the new iomap based block mapping infrastructure. We've been kicking this about locally for years, but there are other filesystems want to use it too (e.g. gfs2). Now it is fully working, reviewed and ready for merge and be used by other filesystems. There are a lot of other fixes and cleanups in the tree, but those are XFS internal things and none are of the scale or visibility of the iomap changes. See below for details. I am likely to send another pull request next week - we're just about ready to merge some new functionality (on disk block->owner reverse mapping infrastructure), but that's a huge chunk of code (74 files changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that separate to all the "normal" pull request changes so they don't get lost in the noise. Summary of changes in this update: - generic iomap based IO path infrastructure - generic iomap based fiemap implementation - xfs iomap based Io path implementation - buffer error handling fixes - tracking of in flight buffer IO for unmount serialisation - direct IO and DAX io path separation and simplification - shortform directory format definition changes for wider platform compatibility - various buffer cache fixes - cleanups in preparation for rmap merge - error injection cleanups and fixes - log item format buffer memory allocation restructuring to prevent rare OOM reclaim deadlocks - sparse inode chunks are now fully supported" * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits) xfs: remove EXPERIMENTAL tag from sparse inode feature xfs: bufferhead chains are invalid after end_page_writeback xfs: allocate log vector buffers outside CIL context lock libxfs: directory node splitting does not have an extra block xfs: remove dax code from object file when disabled xfs: skip dirty pages in ->releasepage() xfs: remove __arch_pack xfs: kill xfs_dir2_inou_t xfs: kill xfs_dir2_sf_off_t xfs: split direct I/O and DAX path xfs: direct calls in the direct I/O path xfs: stop using generic_file_read_iter for direct I/O xfs: split xfs_file_read_iter into buffered and direct I/O helpers xfs: remove s_maxbytes enforcement in xfs_file_read_iter xfs: kill ioflags xfs: don't pass ioflags around in the ioctl path xfs: track and serialize in-flight async buffers against unmount xfs: exclude never-released buffers from buftarg I/O accounting xfs: don't reset b_retries to 0 on every failure xfs: remove extraneous buffer flag changes ...	27 July 2016, 16:53:35 UTC
fd6380b	Catalin Marinas	27 July 2016, 07:11:10 UTC	arm64: arm: Fix-up the removal of the arm64 regs_query_register_name() prototype Commit 0a8ea52c3eb1 ("arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature") inadvertently removed the arch/arm prototype instead of the arm64 one introduced by the original patch. There should not be any bisection issues since this function is not called from anywhere else (it could as well be removed from arch/arm at some point). Fixes: 0a8ea52c3eb1 ("arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	27 July 2016, 07:15:42 UTC
0e06f5c	Linus Torvalds	27 July 2016, 02:55:54 UTC	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...	27 July 2016, 02:55:54 UTC
f7816ad	Linus Torvalds	27 July 2016, 02:49:09 UTC	Merge tag 'for-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply and reset updates from Sebastian Reichel: - introduce reboot mode driver - add DT support to max8903 - add power supply support for axp221 - misc fixes * tag 'for-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: power: reset: add reboot mode driver dt-bindings: power: reset: add document for reboot-mode driver power_supply: fix return value of get_property power: qcom_smbb: Make an extcon for usb cable detection max8903: adds support for initiation via device tree max8903: adds documentation for device tree bindings. max8903: remove unnecessary 'out of memory' error message. max8903: removes non zero validity checks on gpios. max8903: adds requesting of gpios. max8903: cleans up confusing relationship between dc_valid, dok and dcm. max8903: store pointer to pdata instead of copying it. power_supply: bq27xxx_battery: Group register mappings into one table docs: Move brcm,bcm21664-resetmgr.txt power/reset: make syscon_poweroff() static power: axp20x_usb: Add support for usb power-supply on axp22x pmics power_supply: bq27xxx_battery: Index register numbers by enum power_supply: bq27xxx_battery: Fix copy/paste error in header comment MAINTAINERS: Add file patterns for power supply device tree bindings power: reset: keystone: Enable COMPILE_TEST	27 July 2016, 02:49:09 UTC
6097d55	Linus Torvalds	27 July 2016, 02:40:43 UTC	Merge tag 'regulator-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator updates from Mark Brown: "A quiet regulator API release, a few new drivers and some fixes but nothing too notable. There will also be some updates for the PWM regulator coming through the PWM tree which provide much smoother operation when taking over an already running PWM regulator after boot using some new PWM APIs. Summary: - Support for configuration of the initial suspend state from DT. - New drivers for Mediatek MT6323, Ricoh RN5T567 and X-Powers AXP809" * tag 'regulator-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (38 commits) regulator: da9053/52: Fix incorrectly stated minimum and maximum voltage limits regulator: mt6323: Constify struct regulator_ops regulator: mt6323: Fix module description regulator: mt6323: Add support for MT6323 regulator regulator: Add document for MT6323 regulator regulator: da9210: addition of device tree support regulator: act8865: Fix missing of_node_put() in act8865_pdata_from_dt() regulator: qcom_smd: Avoid overlapping linear voltage ranges regulator: s2mps11: Fix the voltage linear range for s2mps15 regulator: pwm: Fix regulator ramp delay for continuous mode regulator: da9211: add descriptions for da9212/da9214 mfd: rn5t618: Register restart handler mfd: rn5t618: Register power off callback optionally regulator: rn5t618: Add RN5T567 PMIC support mfd: rn5t618: Add Ricoh RN5T567 PMIC support ARM: dts: meson: minix-neo-x8: define PMIC as power controller regulator: tps65218: force set power-up/down strobe to 3 for dcdc3 regulator: tps65218: Enable suspend configuration regulator: tps65217: Enable suspend configuration regulator: qcom_spmi: Add support for get_mode/set_mode on switches ...	27 July 2016, 02:40:43 UTC
ae97999	Linus Torvalds	27 July 2016, 02:24:33 UTC	Merge tag 'regmap-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "Several small updates and API enhancements: - provide transparent unrolling of bulk writes into individual writes so they can be used with devices without raw formatting. - fix compatibility between I2C controllers supporting block commands and devices with more than 8 bit wide registers. - add some helpers for iopoll-like functionality and workarounds for weird interrupt controllers" * tag 'regmap-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: add iopoll-like polling macro regmap: Support bulk writes for devices without raw formatting regmap-i2c: Use i2c block command only if register value width is 8 bit regmap: irq: Add support to call client specific pre/post interrupt service regmap: Add file patterns for regmap device tree bindings	27 July 2016, 02:24:33 UTC
1cd04d2	Linus Torvalds	27 July 2016, 02:16:01 UTC	Merge tag 'gpio-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.8 kernel cycle. The big news is the completion of the chardev ABI which I'm very happy about and apart from that it's an ordinary, quite busy cycle. The details are below. The patches are tested in linux-next for some time, patches to other subsystem mostly have ACKs. I got overly ambitious with configureing lines as input for IRQ lines but it turns out that some controllers have their interrupt-enable and input-enabling in orthogonal settings so the assumption that all IRQ lines are input lines does not hold. Oh well, revert and back to the drawing board with that. Core changes: - The big item is of course the completion of the character device ABI. It has now replaced and surpassed the former unmaintainable sysfs ABI: we can now hammer (bitbang) individual lines or sets of lines and read individual lines or sets of lines from userspace, and we can also register to listen to GPIO events from userspace. As a tie-in we have two new tools in tools/gpio: gpio-hammer and gpio-event-mon that illustrate the proper use of the new ABI. As someone said: the wild west days of GPIO are now over. - Continued to remove the pointless ARCH_[WANT_OPTIONAL\|REQUIRE]_GPIOLIB Kconfig symbols. I'm patching hexagon, openrisc, powerpc, sh, unicore, ia64 and microblaze. These are either ACKed by their maintainers or patched anyways after a grace period and no response from maintainers. Some archs (ARM) come in from their trees, and others (x86) are still not fixed, so I might send a second pull request to root it out later in this merge window, or just defer to v4.9. - The GPIO tools are moved to the tools build system. New drivers: - New driver for the MAX77620/MAX20024. - New driver for the Intel Merrifield. - Enabled PCA953x for the TI PCA9536. - Enabled PCA953x for the Intel Edison. - Enabled R8A7792 in the RCAR driver. Driver improvements: - The STMPE and F7188x now supports the .get_direction() callback. - The Xilinx driver supports setting multiple lines at once. - ACPI support for the Vulcan GPIO controller. - The MMIO GPIO driver supports device tree probing. - The Acer One 10 is supported through the _DEP ACPI attribute. Cleanups: - A major cleanup of the OF/DT support code. It is way easier to read and understand now, probably this improves performance too. - Drop a few redundant .owner assignments. - Remove CLPS711x boardfile support: we are 100% DT" * tag 'gpio-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (67 commits) MAINTAINERS: Add INTEL MERRIFIELD GPIO entry gpio: dwapb: add missing fwnode_handle_put() in dwapb_gpio_get_pdata() gpio: merrifield: Protect irq_ack() and gpio_set() by lock gpio: merrifield: Introduce GPIO driver to support Merrifield gpio: intel-mid: Make it depend to X86_INTEL_MID gpio: intel-mid: Sort header block alphabetically gpio: intel-mid: Remove potentially harmful code gpio: rcar: add R8A7792 support gpiolib: remove duplicated include from gpiolib.c Revert "gpio: convince line to become input in irq helper" gpiolib: of_find_gpio(): Don't discard errors gpio: of: Allow overriding the device node gpio: free handles in fringe cases gpio: tps65218: Add platform_device_id table gpio: max77620: get gpio value based on direction gpio: lynxpoint: avoid potential warning on error path tools/gpio: add install section tools/gpio: move to tools buildsystem gpio: intel-mid: switch to devm_gpiochip_add_data() gpio: 74x164: Use spi_write() helper instead of open coding ...	27 July 2016, 02:16:01 UTC
9c1958f	Linus Torvalds	27 July 2016, 01:59:59 UTC	Merge tag 'media/v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - new framework support for HDMI CEC and remote control support - new encoding codec driver for Mediatek SoC - new frontend driver: helene tuner - added support for NetUp almost universal devices, with supports DVB-C/S/S2/T/T2 and ISDB-T - the mn88472 frontend driver got promoted from staging - a new driver for RCar video input - some soc_camera legacy drivers got removed: timb, omap1, mx2, mx3 - lots of driver cleanups, improvements and fixups * tag 'media/v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (377 commits) [media] cec: always check all_device_types and features [media] cec: poll should check if there is room in the tx queue [media] vivid: support monitor all mode [media] cec: fix test for unconfigured adapter in main message loop [media] cec: limit the size of the transmit queue [media] cec: zero unused msg part after msg->len [media] cec: don't set fh to NULL in CEC_TRANSMIT [media] cec: clear all status fields before transmit and always fill in sequence [media] cec: CEC_RECEIVE overwrote the timeout field [media] cxd2841er: Reading SNR for DVB-C added [media] cxd2841er: Reading BER and UCB for DVB-C added [media] cxd2841er: fix switch-case for DVB-C [media] cxd2841er: fix signal strength scale for ISDB-T [media] cxd2841er: adjust the dB scale for DVB-C [media] cxd2841er: provide signal strength for DVB-C [media] cxd2841er: fix BER report via DVBv5 stats API [media] mb86a20s: apply mask to val after checking for read failure [media] airspy: fix error logic during device register [media] s5p-cec/TODO: add TODO item [media] cec/TODO: drop comment about sphinx documentation ...	27 July 2016, 01:59:59 UTC
1b3fc0b	Linus Torvalds	27 July 2016, 01:48:23 UTC	Merge tag 'pstore-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore subsystem updates from Kees Cook: "This expands the supported compressors, fixes some bugs, and finally adds DT bindings" * tag 'pstore-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: add Device Tree bindings efi-pstore: implement efivars_pstore_exit() pstore: drop file opened reference count pstore: add lzo/lz4 compression support pstore: Cleanup pstore_dump() pstore: Enable compression on normal path (again) ramoops: Only unregister when registered	27 July 2016, 01:48:23 UTC
d31dcd9	Linus Torvalds	27 July 2016, 01:42:18 UTC	Merge tag 'for-linus-4.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Mashall: "Orangefs cleanups and enablement of O_DIRECT in open. Cleanups: - remove some unused defines, and also some obfuscatory ones. - remove a redundant xattr handler. - Remove useless xattr prefix arguments. - Be more picky about uid and gid handling WRT namespaces. Our use of current_user_ns() instead of init_user_ns left open the possibility that users could spoof their uids or gids when the server was running in a different namespace in "default security" mode. - Allow open(2) to succeed with O_DIRECT" * tag 'for-linus-4.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: fix namespace handling Orangefs: allow O_DIRECT in open orangefs: Remove useless xattr prefix arguments orangefs: Remove redundant "trusted." xattr handler orangefs: Remove useless defines	27 July 2016, 01:42:18 UTC
396d109	Linus Torvalds	27 July 2016, 01:35:55 UTC	Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The major change this cycle is deleting ext4's copy of the file system encryption code and switching things over to using the copies in fs/crypto. I've updated the MAINTAINERS file to add an entry for fs/crypto listing Jaeguk Kim and myself as the maintainers. There are also a number of bug fixes, most notably for some problems found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also fixed is a writeback deadlock detected by generic/130, and some potential races in the metadata checksum code" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: verify extent header depth ext4: short-cut orphan cleanup on error ext4: fix reference counting bug on block allocation error MAINTAINRES: fs-crypto maintainers update ext4 crypto: migrate into vfs's crypto engine ext2: fix filesystem deadlock while reading corrupted xattr block ext4: fix project quota accounting without quota limits enabled ext4: validate s_reserved_gdt_blocks on mount ext4: remove unused page_idx ext4: don't call ext4_should_journal_data() on the journal inode ext4: Fix WARN_ON_ONCE in ext4_commit_super() ext4: fix deadlock during page writeback ext4: correct error value of function verifying dx checksum ext4: avoid modifying checksum fields directly during checksum verification ext4: check for extents that wrap around jbd2: make journal y2038 safe jbd2: track more dependencies on transaction commit jbd2: move lockdep tracking to journal_s jbd2: move lockdep instrumentation for jbd2 handles ext4: respect the nobarrier mount option in nojournal mode ...	27 July 2016, 01:35:55 UTC
59ebc44	Linus Torvalds	27 July 2016, 01:27:20 UTC	Merge tag 'pnp-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull PNP update from Rafael Wysocki: "One simple change to make the PNP core use device_initcall() instead of module_init() to run pnpbios_thread_init() (Paul Gortmaker)" * tag 'pnp-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PNP: make pnpbios core explicitly non-modular	27 July 2016, 01:27:20 UTC
e663107	Linus Torvalds	27 July 2016, 00:56:45 UTC	Merge tag 'acpi-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "The new feaures here are the support for ACPI overlays (allowing ACPI tables to be loaded at any time from EFI variables or via configfs) and the LPI (Low-Power Idle) support. Also notable is the ACPI-based NUMA support for ARM64. Apart from that we have two new drivers, for the DPTF (Dynamic Power and Thermal Framework) power participant device and for the Intel Broxton WhiskeyCove PMIC, some more PMIC-related changes, support for the Boot Error Record Table (BERT) in APEI and support for platform-initiated graceful shutdown. Plus two new pieces of documentation and usual assorted fixes and cleanups in quite a few places. Specifics: - Support for ACPI SSDT overlays allowing Secondary System Description Tables (SSDTs) to be loaded at any time from EFI variables or via configfs (Octavian Purdila, Mika Westerberg). - Support for the ACPI LPI (Low-Power Idle) feature introduced in ACPI 6.0 and allowing processor idle states to be represented in ACPI tables in a hierarchical way (with the help of Processor Container objects) and support for ACPI idle states management on ARM64, based on LPI (Sudeep Holla). - General improvements of ACPI support for NUMA and ARM64 support for ACPI-based NUMA (Hanjun Guo, David Daney, Robert Richter). - General improvements of the ACPI table upgrade mechanism and ARM64 support for that feature (Aleksey Makarov, Jon Masters). - Support for the Boot Error Record Table (BERT) in APEI and improvements of kernel messages printed by the error injection code (Huang Ying, Borislav Petkov). - New driver for the Intel Broxton WhiskeyCove PMIC operation region and support for the REGS operation region on Broxton, PMIC code cleanups (Bin Gao, Felipe Balbi, Paul Gortmaker). - New driver for the power participant device which is part of the Dynamic Power and Thermal Framework (DPTF) and DPTF-related code reorganization (Srinivas Pandruvada). - Support for the platform-initiated graceful shutdown feature introduced in ACPI 6.1 (Prashanth Prakash). - ACPI button driver update related to lid input events generated automatically on initialization and system resume that have been problematic for some time (Lv Zheng). - ACPI EC driver cleanups (Lv Zheng). - Documentation of the ACPICA release automation process and the in-kernel ACPI AML debugger (Lv Zheng). - New blacklist entry and two fixes for the ACPI backlight driver (Alex Hung, Arvind Yadav, Ralf Gerbig). - Cleanups of the ACPI pci_slot driver (Joe Perches, Paul Gortmaker). - ACPI CPPC code changes to make it more robust against possible defects in ACPI tables and new symbol definitions for PCC (Hoan Tran). - System reboot code modification to execute the ACPI _PTS (Prepare To Sleep) method in addition to _TTS (Ocean He). - ACPICA-related change to carry out lock ordering checks in ACPICA if ACPICA debug is enabled in the kernel (Lv Zheng). - Assorted minor fixes and cleanups (Andy Shevchenko, Baoquan He, Bhaktipriya Shridhar, Paul Gortmaker, Rafael Wysocki)" * tag 'acpi-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits) ACPI: enable ACPI_PROCESSOR_IDLE on ARM64 arm64: add support for ACPI Low Power Idle(LPI) drivers: firmware: psci: initialise idle states using ACPI LPI cpuidle: introduce CPU_PM_CPU_IDLE_ENTER macro for ARM{32, 64} arm64: cpuidle: drop __init section marker to arm_cpuidle_init ACPI / processor_idle: Add support for Low Power Idle(LPI) states ACPI / processor_idle: introduce ACPI_PROCESSOR_CSTATE ACPI / DPTF: move int340x_thermal.c to the DPTF folder ACPI / DPTF: Add DPTF power participant driver ACPI / lpat: make it explicitly non-modular ACPI / dock: make dock explicitly non-modular ACPI / PCI: make pci_slot explicitly non-modular ACPI / PMIC: remove modular references from non-modular code ACPICA: Linux: Enable ACPI_MUTEX_DEBUG for Linux kernel ACPI: Rename configfs.c to acpi_configfs.c to prevent link error ACPI / debugger: Add AML debugger documentation ACPI: Add documentation describing ACPICA release automation ACPI: add support for loading SSDTs via configfs ACPI: add support for configfs efi / ACPI: load SSTDs from EFI variables ...	27 July 2016, 00:56:45 UTC
6453dbd	Linus Torvalds	27 July 2016, 00:29:07 UTC	Merge tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Again, the majority of changes go into the cpufreq subsystem, but there are no big features this time. The cpufreq changes that stand out somewhat are the governor interface rework and improvements related to the handling of frequency tables. Apart from those, there are fixes and new device/CPU IDs in drivers, cleanups and an improvement of the new schedutil governor. Next, there are some changes in the hibernation core, including a fix for a nasty problem related to the MONITOR/MWAIT usage by CPU offline during resume from hibernation, a few core improvements related to memory management during resume, a couple of additional debug features and cleanups. Finally, we have some fixes and cleanups in the devfreq subsystem, generic power domains framework improvements related to system suspend/resume, support for some new chips in intel_idle and in the power capping RAPL driver, a new version of the AnalyzeSuspend utility and some assorted fixes and cleanups. Specifics: - Rework the cpufreq governor interface to make it more straightforward and modify the conservative governor to avoid using transition notifications (Rafael Wysocki). - Rework the handling of frequency tables by the cpufreq core to make it more efficient (Viresh Kumar). - Modify the schedutil governor to reduce the number of wakeups it causes to occur in cases when the CPU frequency doesn't need to be changed (Steve Muckle, Viresh Kumar). - Fix some minor issues and clean up code in the cpufreq core and governors (Rafael Wysocki, Viresh Kumar). - Add Intel Broxton support to the intel_pstate driver (Srinivas Pandruvada). - Fix problems related to the config TDP feature and to the validity of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka, Srinivas Pandruvada). - Make intel_pstate update the cpu_frequency tracepoint even if the frequency doesn't change to avoid confusing powertop (Rafael Wysocki). - Clean up the usage of __init/__initdata in intel_pstate, mark some of its internal variables as __read_mostly and drop an unused structure element from it (Jisheng Zhang, Carsten Emde). - Clean up the usage of some duplicate MSR symbols in intel_pstate and turbostat (Srinivas Pandruvada). - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay Adiga, Viresh Kumar, Ben Dooks). - Fix a regression (introduced during the 4.5 cycle) in the pcc-cpufreq driver by reverting the problematic commit (Andreas Herrmann). - Add support for Intel Denverton to intel_idle, clean up Broxton support in it and make it explicitly non-modular (Jacob Pan, Jan Beulich, Paul Gortmaker). - Add support for Denverton and Ivy Bridge server to the Intel RAPL power capping driver and make it more careful about the handing of MSRs that may not be present (Jacob Pan, Xiaolong Wang). - Fix resume from hibernation on x86-64 by making the CPU offline during resume avoid using MONITOR/MWAIT in the "play dead" loop which may lead to an inadvertent "revival" of a "dead" CPU and a page fault leading to a kernel crash from it (Rafael Wysocki). - Make memory management during resume from hibernation more straightforward (Rafael Wysocki). - Add debug features that should help to detect problems related to hibernation and resume from it (Rafael Wysocki, Chen Yu). - Clean up hibernation core somewhat (Rafael Wysocki). - Prevent KASAN from instrumenting the hibernation core which leads to large numbers of false-positives from it (James Morse). - Prevent PM (hibernate and suspend) notifiers from being called during the cleanup phase if they have not been called during the corresponding preparation phase which is possible if one of the other notifiers returns an error at that time (Lianwei Wang). - Improve suspend-related debug printout in the tasks freezer and clean up suspend-related console handling (Roger Lu, Borislav Petkov). - Update the AnalyzeSuspend script in the kernel sources to version 4.2 (Todd Brandt). - Modify the generic power domains framework to make it handle system suspend/resume better (Ulf Hansson). - Make the runtime PM framework avoid resuming devices synchronously when user space changes the runtime PM settings for them and improve its error reporting (Rafael Wysocki, Linus Walleij). - Fix error paths in devfreq drivers (exynos, exynos-ppmu, exynos-bus) and in the core, make some devfreq code explicitly non-modular and change some of it into tristate (Bartlomiej Zolnierkiewicz, Peter Chen, Paul Gortmaker). - Add DT support to the generic PM clocks management code and make it export some more symbols (Jon Hunter, Paul Gortmaker). - Make the PCI PM core code slightly more robust against possible driver errors (Andy Shevchenko). - Make it possible to change DESTDIR and PREFIX in turbostat (Andy Shevchenko)" * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits) Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency" PM / hibernate: Introduce test_resume mode for hibernation cpufreq: export cpufreq_driver_resolve_freq() cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index() PCI / PM: check all fields in pci_set_platform_pm() cpufreq: acpi-cpufreq: use cached frequency mapping when possible cpufreq: schedutil: map raw required frequency to driver frequency cpufreq: add cpufreq_driver_resolve_freq() cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT intel_pstate: Update cpu_frequency tracepoint every time cpufreq: intel_pstate: clean remnant struct element PM / tools: scripts: AnalyzeSuspend v4.2 x86 / hibernate: Use hlt_play_dead() when resuming from hibernation cpufreq: powernv: Replacing pstate_id with frequency table index intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate() PM / hibernate: Image data protection during restoration PM / hibernate: Add missing braces in __register_nosave_region() PM / hibernate: Clean up comments in snapshot.c PM / hibernate: Clean up function headers in snapshot.c PM / hibernate: Add missing braces in hibernate_setup() ...	27 July 2016, 00:29:07 UTC
27b7902	Linus Torvalds	27 July 2016, 00:23:08 UTC	Merge tag 'platform-drivers-x86-v4.8-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86 Pull x8 platform driver updates from Darren Hart: "Several new quirks and tweaks for new platforms to existing laptop drivers. A new ACPI virtual power button driver, similar to the intel-hid driver. A rework of the dell keymap, using a single sparse keymap for all machines. A few fixes and cleanups. Summary: intel-vbtn: - new driver for Intel Virtual Button intel_pmc_core: - Convert to DEFINE_DEBUGFS_ATTRIBUTE fujitsu-laptop: - Rework brightness of eco led asus-wmi: - Add quirk_no_rfkill_wapf4 for the Asus X456UA - Add quirk_no_rfkill_wapf4 for the Asus X456UF - Add quirk_no_rfkill for the Asus Z550MA - Add quirk_no_rfkill for the Asus U303LB - Add quirk_no_rfkill for the Asus N552VW - Create quirk for airplane_mode LED - Add ambient light sensor toggle key asus-wireless: - Toggle airplane mode LED intel_telemetry: - Remove Monitor MWAIT feature dependency intel-hid: - Remove duplicated acpi_remove_notify_handler fujitsu-laptop: - Add support for eco LED - Support touchpad toggle hotkey on Skylake-based models - Remove unused macros - Use module name in debug messages hp-wmi: - Fix wifi cannot be hard-unblocked toshiba_acpi: - Bump driver version and update copyright year - Remove the position sysfs entry - Add IIO interface for accelerometer axis data dell-wmi: - Add a WMI event code for display on/off - Generate one sparse keymap for all machines - Add information about other WMI event codes - Sort WMI event codes and update comments - Ignore WMI event code 0xe045" * tag 'platform-drivers-x86-v4.8-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (26 commits) intel-vbtn: new driver for Intel Virtual Button intel_pmc_core: Convert to DEFINE_DEBUGFS_ATTRIBUTE fujitsu-laptop: Rework brightness of eco led asus-wmi: Add quirk_no_rfkill_wapf4 for the Asus X456UA asus-wmi: Add quirk_no_rfkill_wapf4 for the Asus X456UF asus-wmi: Add quirk_no_rfkill for the Asus Z550MA asus-wmi: Add quirk_no_rfkill for the Asus U303LB asus-wmi: Add quirk_no_rfkill for the Asus N552VW asus-wmi: Create quirk for airplane_mode LED asus-wireless: Toggle airplane mode LED intel_telemetry: Remove Monitor MWAIT feature dependency intel-hid: Remove duplicated acpi_remove_notify_handler asus-wmi: Add ambient light sensor toggle key fujitsu-laptop: Add support for eco LED fujitsu-laptop: Support touchpad toggle hotkey on Skylake-based models fujitsu-laptop: Remove unused macros fujitsu-laptop: Use module name in debug messages hp-wmi: Fix wifi cannot be hard-unblocked toshiba_acpi: Bump driver version and update copyright year toshiba_acpi: Remove the position sysfs entry ...	27 July 2016, 00:23:08 UTC
f7e6816	Linus Torvalds	27 July 2016, 00:12:11 UTC	Merge tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - initially based on Jens' 'for-4.8/core' (given all the flag churn) and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits that DM depends on to provide its DAX support - clean up the bio-based vs request-based DM core code by moving the request-based DM core code out to dm-rq.[hc] - reinstate bio-based support in the DM multipath target (done with the idea that fast storage like NVMe over Fabrics could benefit) -- while preserving support for request_fn and blk-mq request-based DM mpath - SCSI and DM multipath persistent reservation fixes that were coordinated with Martin Petersen. - the DM raid target saw the most extensive change this cycle; it now provides reshape and takeover support (by layering ontop of the corresponding MD capabilities) - DAX support for DM core and the linear, stripe and error targets - a DM thin-provisioning block discard vs allocation race fix that addresses potential for corruption - a stable fix for DM verity-fec's block calculation during decode - a few cleanups and fixes to DM core and various targets * tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (73 commits) dm: allow bio-based table to be upgraded to bio-based with DAX support dm snap: add fake origin_direct_access dm stripe: add DAX support dm error: add DAX support dm linear: add DAX support dm: add infrastructure for DAX support dm thin: fix a race condition between discarding and provisioning a block dm btree: fix a bug in dm_btree_find_next_single() dm raid: fix random optimal_io_size for raid0 dm raid: address checkpatch.pl complaints dm: call PR reserve/unreserve on each underlying device sd: don't use the ALL_TG_PT bit for reservations dm: fix second blk_delay_queue() parameter to be in msec units not jiffies dm raid: change logical functions to actually return bool dm raid: use rdev_for_each in status dm raid: use rs->raid_disks to avoid memory leaks on free dm raid: support delta_disks for raid1, fix table output dm raid: enhance reshape check and factor out reshape setup dm raid: allow resize during recovery dm raid: fix rs_is_recovering() to allow for lvextend ...	27 July 2016, 00:12:11 UTC
b9c220b	Catalin Marinas	26 July 2016, 17:16:55 UTC	arm64: Only select ARM64_MODULE_PLTS if MODULES=y Selecting CONFIG_RANDOMIZE_BASE=y and CONFIG_MODULES=n fails to build the module PLTs support: CC arch/arm64/kernel/module-plts.o /work/Linux/linux-2.6-aarch64/arch/arm64/kernel/module-plts.c: In function ‘module_emit_plt_entry’: /work/Linux/linux-2.6-aarch64/arch/arm64/kernel/module-plts.c:32:49: error: dereferencing pointer to incomplete type ‘struct module’ This patch selects ARM64_MODULE_PLTS conditionally only if MODULES is enabled. Fixes: f80fb3a3d508 ("arm64: add support for kernel ASLR") Cc: <stable@vger.kernel.org> # 4.6+ Reported-by: Jeff Vander Stoep <jeffv@google.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	26 July 2016, 23:20:32 UTC
8f19b0c	Huang Ying	26 July 2016, 22:27:04 UTC	thp: fix comments of __pmd_trans_huge_lock() To make the comments consistent with the already changed code. Link: http://lkml.kernel.org/r/1466200004-6196-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
cb773df	Johannes Weiner	26 July 2016, 22:27:01 UTC	cgroup: remove unnecessary 0 check from css_from_id() css_idr allocation starts at 1, so index 0 will never point to an item. css_from_id() currently filters that before asking idr_find(), but idr_find() would also just return NULL, so this is not needed. Link: http://lkml.kernel.org/r/20160617162427.GC19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
1fe4d02	Johannes Weiner	26 July 2016, 22:26:58 UTC	cgroup: fix idr leak for the first cgroup root The valid cgroup hierarchy ID range includes 0, so we can't filter for positive numbers when freeing it, or it'll leak the first ID. No big deal, just disruptive when reading the code. The ID is freed during error handling and when the reference count hits zero, so the double-free test is not necessary; remove it. Link: http://lkml.kernel.org/r/20160617162359.GB19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
25843c2	Li RongQing	26 July 2016, 22:26:56 UTC	mm: memcontrol: fix documentation for compound parameter Commit f627c2f53786 ("memcg: adjust to support new THP refcounting") adds a compound parameter for several functions, and change one as compound for mem_cgroup_move_account but it does not change the comments. Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
17408d7	Li RongQing	26 July 2016, 22:26:53 UTC	mm: memcontrol: remove BUG_ON in uncharge_list When calling uncharge_list, if a page is transparent huge we don't need to BUG_ON about non-transparent huge, since nobody should be able to see the page at this stage and this page cannot be raced against with a THP split. This check became unneeded after 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"). [mhocko@suse.com: changelog enhancements] Link: http://lkml.kernel.org/r/1465369248-13865-1-git-send-email-roy.qing.li@gmail.com Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
dd4123f	Minchan Kim	26 July 2016, 22:26:50 UTC	mm: fix build warnings in <linux/compaction.h> Randy reported below build error. > In file included from ../include/linux/balloon_compaction.h:48:0, > from ../mm/balloon_compaction.c:11: > ../include/linux/compaction.h:237:51: warning: 'struct node' declared inside parameter list [enabled by default] > static inline int compaction_register_node(struct node node) > ../include/linux/compaction.h:237:51: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] > ../include/linux/compaction.h:242:54: warning: 'struct node' declared inside parameter list [enabled by default] > static inline void compaction_unregister_node(struct node node) > It was caused by non-lru page migration which needs compaction.h but compaction.h doesn't include any header to be standalone. I think proper header for non-lru page migration is migrate.h rather than compaction.h because migrate.h has already headers needed to work non-lru page migration indirectly like isolate_mode_t, migrate_mode MIGRATEPAGE_SUCCESS. [akpm@linux-foundation.org: revert mm-balloon-use-general-non-lru-movable-page-feature-fix.patch temp fix] Link: http://lkml.kernel.org/r/20160610003304.GE29779@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
0db501f	Ebru Akagunduz	26 July 2016, 22:26:46 UTC	mm, thp: convert from optimistic swapin collapsing to conservative To detect whether khugepaged swapin is worthwhile, this patch checks the amount of young pages. There should be at least half of HPAGE_PMD_NR to swapin. Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
47f863e	Ebru Akagunduz	26 July 2016, 22:26:43 UTC	mm, thp: fix comment inconsistency for swapin readahead functions After fixing swapin issues, comment lines stayed as in old version. This patch updates the comments. Link: http://lkml.kernel.org/r/1468109345-32258-1-git-send-email-ebru.akagunduz@gmail.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
1b5946a	Kirill A. Shutemov	26 July 2016, 22:26:40 UTC	thp: update Documentation/{vm/transhuge,filesystems/proc}.txt Add info about tmpfs/shmem with huge pages. Link: http://lkml.kernel.org/r/1466021202-61880-38-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
779750d	Kirill A. Shutemov	26 July 2016, 22:26:38 UTC	shmem: split huge pages beyond i_size under memory pressure Even if user asked to allocate huge pages always (huge=always), we should be able to free up some memory by splitting pages which are partly byound i_size if memory presure comes or once we hit limit on filesystem size (-o size=). In order to do this we maintain per-superblock list of inodes, which potentially have huge pages on the border of file size. Per-fs shrinker can reclaim memory by splitting such pages. If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to free up space on the filesystem and retry allocation if it succeed. Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e496cf3	Kirill A. Shutemov	26 July 2016, 22:26:35 UTC	thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE For file mappings, we don't deposit page tables on THP allocation because it's not strictly required to implement split_huge_pmd(): we can just clear pmd and let following page faults to reconstruct the page table. But Power makes use of deposited page table to address MMU quirk. Let's hide THP page cache, including huge tmpfs, under separate config option, so it can be forbidden on Power. We can revert the patch later once solution for Power found. Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
f3f0e1d	Kirill A. Shutemov	26 July 2016, 22:26:32 UTC	khugepaged: add support of collapse for tmpfs/shmem pages This patch extends khugepaged to support collapse of tmpfs/shmem pages. We share fair amount of infrastructure with anon-THP collapse. Few design points: - First we are looking for VMA which can be suitable for mapping huge page; - If the VMA maps shmem file, the rest scan/collapse operations operates on page cache, not on page tables as in anon VMA case. - khugepaged_scan_shmem() finds a range which is suitable for huge page. The scan is lockless and shouldn't disturb system too much. - once the candidate for collapse is found, collapse_shmem() attempts to create a huge page: + scan over radix tree, making the range point to new huge page; + new huge page is not-uptodate, locked and freezed (refcount is 0), so nobody can touch them until we say so. + we swap in pages during the scan. khugepaged_scan_shmem() filters out ranges with more than khugepaged_max_ptes_swap swapped out pages. It's HPAGE_PMD_NR/8 by default. + old pages are isolated, unmapped and put to local list in case to be restored back if collapse failed. - if collapse succeed, we retract pte page tables from VMAs where huge pages mapping is possible. The huge page will be mapped as PMD on next minor fault into the range. Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
4595ef8	Kirill A. Shutemov	26 July 2016, 22:26:29 UTC	shmem: make shmem_inode_info::lock irq-safe We are going to need to call shmem_charge() under tree_lock to get accoutning right on collapse of small tmpfs pages into a huge one. The problem is that tree_lock is irq-safe and lockdep is not happy, that we take irq-unsafe lock under irq-safe[1]. Let's convert the lock to irq-safe. [1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
988ddb7	Kirill A. Shutemov	26 July 2016, 22:26:26 UTC	khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem) first: no point keep it inside the function. Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
b46e756	Kirill A. Shutemov	26 July 2016, 22:26:24 UTC	thp: extract khugepaged from mm/huge_memory.c khugepaged implementation grew to the point when it deserve separate file in source. Let's move it to mm/khugepaged.c. Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
657e303	Kirill A. Shutemov	26 July 2016, 22:26:21 UTC	shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings Let's wire up existing madvise() hugepage hints for file mappings. MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the VMA. It only has effect if the filesystem is mounted with huge=advise or huge=within_size. MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in the VMA. It doesn't prevent a huge page from being allocated by other means, i.e. page fault into different mapping or write(2) into file. Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
800d8c6	Kirill A. Shutemov	26 July 2016, 22:26:18 UTC	shmem: add huge pages support Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
c01d5b3	Hugh Dickins	26 July 2016, 22:26:15 UTC	shmem: get_unmapped_area align huge page Provide a shmem_get_unmapped_area method in file_operations, called at mmap time to decide the mapping address. It could be conditional on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making it unconditional. shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area (which we treat as a black box, highly dependent on architecture and config and executable layout). Lots of conditions, and in most cases it just goes with the address that chose; but when our huge stars are rightly aligned, yet that did not provide a suitable address, go back to ask for a larger arena, within which to align the mapping suitably. There have to be some direct calls to shmem_get_unmapped_area(), not via the file_operations: because of the way shmem_zero_setup() is called to create a shmem object late in the mmap sequence, when MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though this only matters when /proc/sys/vm/shmem_huge has been set. Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
5a6e75f	Kirill A. Shutemov	26 July 2016, 22:26:13 UTC	shmem: prepare huge= mount option and sysfs knob This patch adds new mount option "huge=". It can have following values: - "always": Attempt to allocate huge pages every time we need a new page; - "never": Do not allocate huge pages; - "within_size": Only allocate huge page if it will be fully within i_size. Also respect fadvise()/madvise() hints; - "advise: Only allocate huge pages if requested with fadvise()/madvise(); Default is "never" for now. "mount -o remount,huge= /mountpoint" works fine after mount: remounting huge=never will not attempt to break up huge pages at all, just stop more from being allocated. No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which is the appropriate option to protect those who don't want the new bloat, and with which we shall share some pmd code. Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit). Allow enabling THP only if the machine has_transparent_hugepage(). But what about Shmem with no user-visible mount? SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with huge on those. And allow shmem_enabled two further values: - "deny": For use in emergencies, to force the huge option off from all mounts; - "force": Force the huge option on for all - very useful for testing; Based on patch by Hugh Dickins. Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
65c4537	Kirill A. Shutemov	26 July 2016, 22:26:10 UTC	mm, rmap: account shmem thp pages Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and smaps. It indicates how many times we allocate and map shmem THP. NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS. Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
fc127da	Kirill A. Shutemov	26 July 2016, 22:26:07 UTC	truncate: handle file thp For shmem/tmpfs we only need to tweak truncate_inode_page() and invalidate_mapping_pages(). truncate_inode_pages_range() and invalidate_inode_pages2_range() are adjusted to use page_to_pgoff(). Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
8392937	Kirill A. Shutemov	26 July 2016, 22:26:04 UTC	filemap: prepare find and delete operations for huge pages For now, we would have HPAGE_PMD_NR entries in radix tree for every huge page. That's suboptimal and it will be changed to use Matthew's multi-order entries later. 'add' operation is not changed, because we don't need it to implement hugetmpfs: shmem uses its own implementation. Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
c78c66d	Kirill A. Shutemov	26 July 2016, 22:26:02 UTC	radix-tree: implement radix_tree_maybe_preload_order() The new helper is similar to radix_tree_maybe_preload(), but tries to preload number of nodes required to insert (1 << order) continuous naturally-aligned elements. This is required to push huge pages into pagecache. Link: http://lkml.kernel.org/r/1466021202-61880-24-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e2f0a0d	Kirill A. Shutemov	26 July 2016, 22:25:59 UTC	page-flags: relax policy for PG_mappedtodisk and PG_reclaim These flags are in use for file THP. Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
7751b2d	Kirill A. Shutemov	26 July 2016, 22:25:56 UTC	vmscan: split file huge pages before paging them out This is preparation of vmscan for file huge pages. We cannot write out huge pages, so we need to split them on the way out. Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
9a73f61	Kirill A. Shutemov	26 July 2016, 22:25:53 UTC	thp, mlock: do not mlock PTE-mapped file huge pages As with anon THP, we only mlock file huge pages if we can prove that the page is not mapped with PTE. This way we can avoid mlock leak into non-mlocked vma on split. We rely on PageDoubleMap() under lock_page() to check if the the page may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the page mapped with PTEs. Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
baa355f	Kirill A. Shutemov	26 July 2016, 22:25:51 UTC	thp: file pages support for split_huge_page() Basic scheme is the same as for anon THP. Main differences: - File pages are on radix-tree, so we have head->_count offset by HPAGE_PMD_NR. The count got distributed to small pages during split. - mapping->tree_lock prevents non-lockless access to pages under split over radix-tree; - Lockless access is prevented by setting the head->_count to 0 during split; - After split, some pages can be beyond i_size. We drop them from radix-tree. - We don't setup migration entries. Just unmap pages. It helps handling cases when i_size is in the middle of the page: no need handle unmap pages beyond i_size manually. Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
37f9f55	Kirill A. Shutemov	26 July 2016, 22:25:48 UTC	thp: run vma_adjust_trans_huge() outside i_mmap_rwsem vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary. During split we munlock the huge page which requires rmap walk. rmap wants to take the lock on its own. Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this. Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
b237ade	Kirill A. Shutemov	26 July 2016, 22:25:45 UTC	thp: prepare change_huge_pmd() for file thp change_huge_pmd() has assert which is not relvant for file page. For shared mapping it's perfectly fine to have page table entry writable, without explicit mkwrite. Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
628d47c	Kirill A. Shutemov	26 July 2016, 22:25:42 UTC	thp: skip file huge pmd on copy_huge_pmd() copy_page_range() has a check for "Don't copy ptes where a page fault will fill them correctly." It works on VMA level. We still copy all page table entries from private mappings, even if they map page cache. We can simplify copy_huge_pmd() a bit by skipping file PMDs. We don't map file private pages with PMDs, so they only can map page cache. It's safe to skip them as they can be re-faulted later. Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
af9e4d5	Kirill A. Shutemov	26 July 2016, 22:25:40 UTC	thp: handle file COW faults File COW for THP is handled on pte level: just split the pmd. It's not clear how benefitial would be allocation of huge pages on COW faults. And it would require some code to make them work. I think at some point we can consider teaching khugepaged to collapse pages in COW mappings, but allocating huge on fault is probably overkill. Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
d21b9e5	Kirill A. Shutemov	26 July 2016, 22:25:37 UTC	thp: handle file pages in split_huge_pmd() Splitting THP PMD is simple: just unmap it as in DAX case. This way we can avoid memory overhead on page table allocation to deposit. It's probably a good idea to try to allocation page table with GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area, but clearing pmd should be good enough for now. Unlike DAX, we also remove the page from rmap and drop reference. pmd_young() is transfered to PageReferenced(). Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
b507238	Kirill A. Shutemov	26 July 2016, 22:25:34 UTC	thp: support file pages in zap_huge_pmd() split_huge_pmd() for file mappings (and DAX too) is implemented by just clearing pmd entry as we can re-fill this area from page cache on pte level later. This means we don't need deposit page tables when file THP is mapped. Therefore we shouldn't try to withdraw a page table on zap_huge_pmd() file THP PMD. Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
95ecedc	Kirill A. Shutemov	26 July 2016, 22:25:31 UTC	thp, vmstats: add counters for huge file pages THP_FILE_ALLOC: how many times huge page was allocated and put page cache. THP_FILE_MAPPED: how many times file huge page was mapped. Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
1010245	Kirill A. Shutemov	26 July 2016, 22:25:29 UTC	mm: introduce do_set_pmd() With postponed page table allocation we have chance to setup huge pages. do_set_pte() calls do_set_pmd() if following criteria met: - page is compound; - pmd entry in pmd_none(); - vma has suitable size and alignment; Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
dd78fed	Kirill A. Shutemov	26 July 2016, 22:25:26 UTC	rmap: support file thp Naive approach: on mapping/unmapping the page as compound we update ->_mapcount on each 4k page. That's not efficient, but it's not obvious how we can optimize this. We can look into optimization later. PG_double_map optimization doesn't work for file pages since lifecycle of file pages is different comparing to anon pages: file page can be mapped again at any time. Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
7267ec0	Kirill A. Shutemov	26 July 2016, 22:25:23 UTC	mm: postpone page table allocation until we have page to map The idea (and most of code) is borrowed again: from Hugh's patchset on huge tmpfs[1]. Instead of allocation pte page table upfront, we postpone this until we have page to map in hands. This approach opens possibility to map the page as huge if filesystem supports this. Comparing to Hugh's patch I've pushed page table allocation a bit further: into do_set_pte(). This way we can postpone allocation even in faultaround case without moving do_fault_around() after __do_fault(). do_set_pte() got renamed to alloc_set_pte() as it can allocate page table if required. [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
bae473a	Kirill A. Shutemov	26 July 2016, 22:25:20 UTC	mm: introduce fault_env The idea borrowed from Peter's patch from patchset on speculative page faults[1]: Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. The changes are mostly mechanical with exception of faultaround code: filemap_map_pages() got reworked a bit. This patch is preparation for the next one. [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
dcddffd	Kirill A. Shutemov	26 July 2016, 22:25:18 UTC	mm: do not pass mm_struct into handle_mm_fault We always have vma->vm_mm around. Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
6fb8ddf	Kirill A. Shutemov	26 July 2016, 22:25:15 UTC	thp, mlock: update unevictable-lru.txt Add description of THP handling into unevictable-lru.txt. Link: http://lkml.kernel.org/r/1466021202-61880-7-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
1f52e67	Kirill A. Shutemov	26 July 2016, 22:25:12 UTC	khugepaged: recheck pmd after mmap_sem re-acquired Vlastimil noted[1] that pmd can be no longer valid after we drop mmap_sem. We need recheck it once mmap_sem taken again. [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
8024ee2	Ebru Akagunduz	26 July 2016, 22:25:09 UTC	mm, thp: fix locking inconsistency in collapse_huge_page After creating revalidate vma function, locking inconsistency occured due to directing the code path to wrong label. This patch directs to correct label and fix the inconsistency. Related commit that caused inconsistency: http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
7269586	Ebru Akagunduz	26 July 2016, 22:25:06 UTC	mm, thp: make swapin readahead under down_read of mmap_sem Currently khugepaged makes swapin readahead under down_write. This patch supplies to make swapin readahead under down_read instead of down_write. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. The system was forced to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. [akpm@linux-foundation.org: update comment to match new code] [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless] Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
8a966ed	Ebru Akagunduz	26 July 2016, 22:25:03 UTC	mm: make swapin readahead to improve thp collapse rate This patch makes swapin readahead to improve thp collapse rate. When khugepaged scanned pages, there can be a few of the pages in swap area. With the patch THP can collapse 4kB pages into a THP when there are up to max_ptes_swap swap ptes in a 2MB range. The patch was tested with a test program that allocates 400B of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. Without the patch, system did not swap in readahead. THP rate was %65 of the program of the memory, it did not change over time. With this patch, after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function] [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter] [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in] Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
70652f6	Ebru Akagunduz	26 July 2016, 22:24:59 UTC	mm: make optimistic check for swapin readahead Introduce a new sysfs integer knob /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes optimistic check for swapin readahead to increase thp collapse rate. Before getting swapped out pages to memory, checks them and allows up to a certain number. It also prints out using tracepoints amount of unmapped ptes. [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE] [sfr@canb.auug.org.au: build fix] Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
ef3cc4d	nimisolo	26 July 2016, 22:24:56 UTC	mm/memblock.c:memblock_add_range(): if nr_new is 0 just return If nr_new is 0 which means there's no region would be added, so just return to the caller. Signed-off-by: nimisolo <nimisolo@gmail.com> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
8a5c743	Michal Hocko	26 July 2016, 22:24:53 UTC	mm, memcg: use consistent gfp flags during readahead Vladimir has noticed that we might declare memcg oom even during readahead because read_pages only uses GFP_KERNEL (with mapping_gfp restriction) while __do_page_cache_readahead uses page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from OOMs. This gfp mask discrepancy is really unfortunate and easily fixable. Drop page_cache_alloc_readahead() which only has one user and outsource the gfp_mask logic into readahead_gfp_mask and propagate this mask from __do_page_cache_readahead down to read_pages. This alone would have only very limited impact as most filesystems are implementing ->readpages and the common implementation mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to use readahead_gfp_mask instead as this function is called only during readahead as well. The same applies to read_cache_pages. ext4 has its own ext4_mpage_readpages but the path which has pages != NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to mpage_readpages so the same can be applied to them as well. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.com: restrict gfp mask in mpage_alloc] Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Chris Mason <clm@fb.com> Cc: Steve French <sfrench@samba.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Changman Lee <cm224.lee@samsung.com> Cc: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e5e3f4c	Michal Hocko	26 July 2016, 22:24:50 UTC	mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
ba6c19f	Chen Gang	26 July 2016, 22:24:47 UTC	include/linux/memblock.h: Clean up code for several trivial details Correct the function parameters alignment, since original code already use both tabs and white spaces together for the incorrect parameters alignment functions. If one line can hold one statement within 80 columns, let it in one line (original code did not consider about the tabs/spaces for 2nd line when a statement is separated into 2 lines). Try to let '' aligned within one macro, since all related lines are short enough. Remove useless statement "idx = 0;", and always assign rgn within the 'for' statement. Link: http://lkml.kernel.org/r/1464904899-1714-1-git-send-email-chengang@emindsoft.com.cn Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
91537fe	Minchan Kim	26 July 2016, 22:24:45 UTC	mm: add NR_ZSMALLOC to vmstat zram is very popular for some of the embedded world (e.g., TV, mobile phones). On those system, zsmalloc's consumed memory size is never trivial (one of example from real product system, total memory: 800M, zsmalloc consumed: 150M), so we have used this out of tree patch to monitor system memory behavior via /proc/vmstat. With zsmalloc in vmstat, it helps in tracking down system behavior due to memory usage. [minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat] Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox [akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m] Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Chanho Min <chanho.min@lge.com> Cc: Chan Gyun Jeong <chan.jeong@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
8ea1d2a	Vlastimil Babka	26 July 2016, 22:24:42 UTC	mm, frontswap: convert frontswap_enabled to static key I have noticed that frontswap.h first declares "frontswap_enabled" as extern bool variable, and then overrides it with "#define frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled. The bool variable isn't actually instantiated anywhere. This all looks like an unfinished attempt to make frontswap_enabled reflect whether a backend is instantiated. But in the current state, all frontswap hooks call unconditionally into frontswap.c just to check if frontswap_ops is non-NULL. This should at least be checked inline, but we can further eliminate the overhead when CONFIG_FRONTSWAP is enabled and no backend registered, using a static key that is initially disabled, and gets enabled only upon first backend registration. Thus, checks for "frontswap_enabled" are replaced with "frontswap_enabled()" wrapping the static key check. There are two exceptions: - xen's selfballoon_process() was testing frontswap_enabled in code guarded by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable. The patch just removes this check. Using frontswap_enabled() does not sound correct here, as this can be true even without xen's own backend being registered. - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP) as it seems the bitmap allocation cannot currently be postponed until a backend is registered. This means that frontswap will still have some memory overhead by being configured, but without a backend. After the patch, we can expect that some functions in frontswap.c are called only when frontswap_ops is non-NULL. Change the checks there to VM_BUG_ONs. While at it, convert other BUG_ONs to VM_BUG_ONs as frontswap has been stable for some time. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
fbe84a0	Tetsuo Handa	26 July 2016, 22:24:39 UTC	mm,oom: remove unused argument from oom_scan_process_thread(). oom_scan_process_thread() does not use totalpages argument. oom_badness() uses it. Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
3aa9799	Vladimir Davydov	26 July 2016, 22:24:36 UTC	af_unix: charge buffers to kmemcg Unix sockets can consume a significant amount of system memory, hence they should be accounted to kmemcg. Since unix socket buffers are always allocated from process context, all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in sock->sk_allocation mask. Eric asked: > 1) What happens when a buffer, allocated from socket <A> lands in a > different socket <B>, maybe owned by another user/process. > > Who owns it now, in term of kmemcg accounting ? We never move memcg charges. E.g. if two processes from different cgroups are sharing a memory region, each page will be charged to the process which touched it first. Or if two processes are working with the same directory tree, inodes and dentries will be charged to the first user. The same is fair for unix socket buffers - they will be charged to the sender. > 2) Has performance impact been evaluated ? I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core x2 HT box. The results are below: # clients bandwidth (10^6bits/sec) base patched 1 67643 +- 725 64874 +- 353 - 4.0 % 4 193585 +- 2516 186715 +- 1460 - 3.5 % 8 194820 +- 377 187443 +- 1229 - 3.7 % So the accounting doesn't come for free - it takes ~4% of performance. I believe we could optimize it by using per cpu batching not only on charge, but also on uncharge in memcg core, but that's beyond the scope of this patch set - I'll take a look at this later. Anyway, if performance impact is found to be unacceptable, it is always possible to disable kmem accounting at boot time (cgroup.memory=nokmem) or not use memory cgroups at runtime at all (thanks to jump labels there'll be no overhead even if they are compiled in). Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
d86133b	Vladimir Davydov	26 July 2016, 22:24:33 UTC	pipe: account to kmemcg Pipes can consume a significant amount of system memory, hence they should be accounted to kmemcg. This patch marks pipe_inode_info and anonymous pipe buffer page allocations as __GFP_ACCOUNT so that they would be charged to kmemcg. Note, since a pipe buffer page can be "stolen" and get reused for other purposes, including mapping to userspace, we clear PageKmemcg thus resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which is introduced by this patch. A note regarding anon_pipe_buf_steal implementation. We allow to steal the page if its ref count equals 1. It looks racy, but it is correct for anonymous pipe buffer pages, because: - We lock out all other pipe users, because ->steal is called with pipe_lock held, so the page can't be spliced to another pipe from under us. - The page is not on LRU and it never was. - Thus a parallel thread can access it only by PFN. Although this is quite possible (e.g. see page_idle_get_page and balloon_page_isolate) this is not dangerous, because all such functions do is increase page ref count, check if the page is the one they are looking for, and decrease ref count if it isn't. Since our page is clean except for PageKmemcg mark, which doesn't conflict with other _mapcount users, the worst that can happen is we see page_count > 2 due to a transient ref, in which case we false-positively abort ->steal, which is still fine, because ->steal is not guaranteed to succeed. Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
3e79ec7	Vladimir Davydov	26 July 2016, 22:24:30 UTC	arch: x86: charge page tables to kmemcg Page tables can bite a relatively big chunk off system memory and their allocations are easy to trigger from userspace, so they should be accounted to kmemcg. This patch marks page table allocations as __GFP_ACCOUNT for x86. Note we must not charge allocations of kernel page tables, because they can be shared among processes from different cgroups so accounting them to a particular one can pin other cgroups for indefinitely long. So we clear __GFP_ACCOUNT flag if a page table is allocated for the kernel. Link: http://lkml.kernel.org/r/7d5c54f6a2bcbe76f03171689440003d87e6c742.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
5e8d35f	Vladimir Davydov	26 July 2016, 22:24:27 UTC	mm: memcontrol: teach uncharge_list to deal with kmem pages Page table pages are batched-freed in release_pages on most architectures. If we want to charge them to kmemcg (this is what is done later in this series), we need to teach mem_cgroup_uncharge_list to handle kmem pages. Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
4949148	Vladimir Davydov	26 July 2016, 22:24:24 UTC	mm: charge/uncharge kmemcg from generic page allocator paths Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
4526477	Vladimir Davydov	26 July 2016, 22:24:21 UTC	mm: memcontrol: cleanup kmem charge functions - Handle memcg_kmem_enabled check out to the caller. This reduces the number of function definitions making the code easier to follow. At the same time it doesn't result in code bloat, because all of these functions are used only in one or two places. - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't have to dive deep into memcg implementation to see which allocations are charged and which are not. - Refresh comments. Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
632c0a1	Vladimir Davydov	26 July 2016, 22:24:18 UTC	mm: clean up non-standard page->_mapcount users - Add a proper comment to page->_mapcount. - Introduce a macro for generating helper functions. - Place all special page->_mapcount values next to each other so that readers can see all possible values and so we don't get duplicates. Link: http://lkml.kernel.org/r/502f49000e0b63e6c62e338fac6b420bf34fb526.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
99691ad	Vladimir Davydov	26 July 2016, 22:24:16 UTC	mm: remove pointless struct in struct page definition This patchset implements per kmemcg accounting of page tables (x86-only), pipe buffers, and unix socket buffers. Patches 1-3 are just cleanups that are not supposed to introduce any functional changes. Patches 4 and 5 move charge/uncharge to generic page allocator paths for the sake of accounting pipe and unix socket buffers. Patches 5-7 make x86 page tables, pipe buffers, and unix socket buffers accountable. This patch (of 8): ... to reduce indentation level thus leaving more space for comments. Link: http://lkml.kernel.org/r/f34ffe70fce2b0b9220856437f77972d67c14275.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e77b085	Aneesh Kumar K.V	26 July 2016, 22:24:12 UTC	mm/mmu_gather: track page size with mmu gather and force flush if page size change This allows an arch which needs to do special handing with respect to different page size when flushing tlb to implement the same in mmu gather. Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e9d55e1	Aneesh Kumar K.V	26 July 2016, 22:24:09 UTC	mm: change the interface for __tlb_remove_page() This updates the generic and arch specific implementation to return true if we need to do a tlb flush. That means if a __tlb_remove_page indicate a flush is needed, the page we try to remove need to be tracked and added again after the flush. We need to track it because we have already update the pte to none and we can't just loop back. This change is done to enable us to do a tlb_flush when we try to flush a range that consists of different page sizes. For architectures like ppc64, we can do a range based tlb flush and we need to track page size for that. When we try to remove a huge page, we will force a tlb flush and starts a new mmu gather. [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3] Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
31d49da	Aneesh Kumar K.V	26 July 2016, 22:24:06 UTC	mm/hugetlb: simplify hugetlb unmap For hugetlb like THP (and unlike regular page), we do tlb flush after dropping ptl. Because of the above, we don't need to track force_flush like we do now. Instead we can simply call tlb_remove_page() which will do the flush if needed. No functionality change in this patch. Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
337d9ab	Naoya Horiguchi	26 July 2016, 22:24:03 UTC	mm: thp: check pmd_trans_unstable() after split_huge_pmd() split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing to pte entries, which can be checked with pmd_trans_unstable(). Some callers make this assertion and some do it differently and some not, so let's do it in a unified manner. Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
e3a2713	Joonsoo Kim	26 July 2016, 22:24:01 UTC	mm/page_isolation: clean up confused code When there is an isolated_page, post_alloc_hook() is called with page but __free_pages() is called with isolated_page. Since they are the same so no problem but it's very confusing. To reduce it, this patch changes isolated_page to boolean type and uses page variable consistently. Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
46f24fd	Joonsoo Kim	26 July 2016, 22:23:58 UTC	mm/page_alloc: introduce post allocation processing on page allocator This patch is motivated from Hugh and Vlastimil's concern [1]. There are two ways to get freepage from the allocator. One is using normal memory allocation API and the other is __isolate_free_page() which is internally used for compaction and pageblock isolation. Later usage is rather tricky since it doesn't do whole post allocation processing done by normal API. One problematic thing I already know is that poisoned page would not be checked if it is allocated by __isolate_free_page(). Perhaps, there would be more. We could add more debug logic for allocated page in the future and this separation would cause more problem. I'd like to fix this situation at this time. Solution is simple. This patch commonize some logic for newly allocated page and uses it on all sites. This will solve the problem. [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3] Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
f2ca0b5	Joonsoo Kim	26 July 2016, 22:23:55 UTC	mm/page_owner: use stackdepot to store stacktrace Currently, we store each page's allocation stacktrace on corresponding page_ext structure and it requires a lot of memory. This causes the problem that memory tight system doesn't work well if page_owner is enabled. Moreover, even with this large memory consumption, we cannot get full stacktrace because we allocate memory at boot time and just maintain 8 stacktrace slots to balance memory consumption. We could increase it to more but it would make system unusable or change system behaviour. To solve the problem, this patch uses stackdepot to store stacktrace. It obviously provides memory saving but there is a drawback that stackdepot could fail. stackdepot allocates memory at runtime so it could fail if system has not enough memory. But, most of allocation stack are generated at very early time and there are much memory at this time. So, failure would not happen easily. And, one failure means that we miss just one page's allocation stacktrace so it would not be a big problem. In this patch, when memory allocation failure happens, we store special stracktrace handle to the page that is failed to save stacktrace. With it, user can guess memory usage properly even if failure happens. Memory saving looks as following. (4GB memory system with page_owner) (before the patch -> after the patch) static allocation: 92274688 bytes -> 25165824 bytes dynamic allocation after boot + kernel build: 0 bytes -> 327680 bytes total: 92274688 bytes -> 25493504 bytes 72% reduction in total. Note that implementation looks complex than someone would imagine because there is recursion issue. stackdepot uses page allocator and page_owner is called at page allocation. Using stackdepot in page_owner could re-call page allcator and then page_owner. That is a recursion. To detect and avoid it, whenever we obtain stacktrace, recursion is checked and page_owner is set to dummy information if found. Dummy information means that this page is allocated for page_owner feature itself (such as stackdepot) and it's understandable behavior for user. [iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3] Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
3713767	Joonsoo Kim	26 July 2016, 22:23:52 UTC	tools/vm/page_owner: increase temporary buffer size Page owner will be changed to store more deep stacktrace so current temporary buffer size isn't enough. Increase it. Link: http://lkml.kernel.org/r/1464230275-25791-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
a9627bc	Joonsoo Kim	26 July 2016, 22:23:49 UTC	mm/page_owner: introduce split_page_owner and replace manual handling split_page() calls set_page_owner() to set up page_owner to each pages. But, it has a drawback that head page and the others have different stacktrace because callsite of set_page_owner() is slightly differnt. To avoid this problem, this patch copies head page's page_owner to the others. It needs to introduce new function, split_page_owner() but it also remove the other function, get_page_owner_gfp() so looks good to do. Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
a8efe1c	Joonsoo Kim	26 July 2016, 22:23:46 UTC	mm/page_owner: copy last_migrate_reason in copy_page_owner() Currently, copy_page_owner() doesn't copy all the owner information. It skips last_migrate_reason because copy_page_owner() is used for migration and it will be properly set soon. But, following patch will use copy_page_owner() and this skip will cause the problem that allocated page has uninitialied last_migrate_reason. To prevent it, this patch also copy last_migrate_reason in copy_page_owner(). Link: http://lkml.kernel.org/r/1464230275-25791-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
83358ec	Joonsoo Kim	26 July 2016, 22:23:43 UTC	mm/page_owner: initialize page owner without holding the zone lock It's not necessary to initialized page_owner with holding the zone lock. It would cause more contention on the zone lock although it's not a big problem since it is just debug feature. But, it is better than before so do it. This is also preparation step to use stackdepot in page owner feature. Stackdepot allocates new pages when there is no reserved space and holding the zone lock in this case will cause deadlock. Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
66c6422	Joonsoo Kim	26 July 2016, 22:23:40 UTC	mm/compaction: split freepages without holding the zone lock We don't need to split freepages with holding the zone lock. It will cause more contention on zone lock so not desirable. [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
3b1d9ca	Minchan Kim	26 July 2016, 22:23:37 UTC	zsmalloc: use OBJ_TAG_BIT for bit shifter Static check warns using tag as bit shifter. It doesn't break current working but not good for redability. Let's use OBJ_TAG_BIT as bit shifter instead of OBJ_ALLOCATED_TAG. Link: http://lkml.kernel.org/r/20160607045146.GF26230@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
9bc482d	Minchan Kim	26 July 2016, 22:23:34 UTC	zram: use __GFP_MOVABLE for memory allocation Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE from now on. I did test to see how it helps to make higher order pages. Test scenario is as follows. KVM guest, 1G memory, ext4 formated zram block device, for i in `seq 1 8`; do dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 & done wait `pidof dd` for i in `seq 1 2 8`; do rm -rf mnt/test$i.txt done fstrim -v mnt echo "init" cat /proc/buddyinfo echo "compaction" echo 1 > /proc/sys/vm/compact_memory cat /proc/buddyinfo old: init Node 0, zone DMA 208 120 51 41 11 0 0 0 0 0 0 Node 0, zone DMA32 16380 13777 9184 3805 789 54 3 0 0 0 0 compaction Node 0, zone DMA 132 82 40 39 16 2 1 0 0 0 0 Node 0, zone DMA32 5219 5526 4969 3455 1831 677 139 15 0 0 0 new: init Node 0, zone DMA 379 115 97 19 2 0 0 0 0 0 0 Node 0, zone DMA32 18891 16774 10862 3947 637 21 0 0 0 0 0 compaction Node 0, zone DMA 214 66 87 29 10 3 0 0 0 0 0 Node 0, zone DMA32 1612 3139 3154 2469 1745 990 384 94 7 0 0 As you can see, compaction made so many high-order pages. Yay! Link: http://lkml.kernel.org/r/1464736881-24886-13-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
48b4800	Minchan Kim	26 July 2016, 22:23:31 UTC	zsmalloc: page migration support This patch introduces run-time migration feature for zspage. For migration, VM uses page.lru field so it would be better to not use page.next field which is unified with page.lru for own purpose. For that, firstly, we can get first object offset of the page via runtime calculation instead of using page.index so we can use page.index as link for page chaining instead of page.next. In case of huge object, it stores handle to page.index instead of next link of page chaining because huge object doesn't need to next link for page chaining. So get_next_page need to identify huge object to return NULL. For it, this patch uses PG_owner_priv_1 flag of the page flag. For migration, it supports three functions * zs_page_isolate It isolates a zspage which includes a subpage VM want to migrate from class so anyone cannot allocate new object from the zspage. We could try to isolate a zspage by the number of subpage so subsequent isolation trial of other subpage of the zpsage shouldn't fail. For that, we introduce zspage.isolated count. With that, zs_page_isolate can know whether zspage is already isolated or not for migration so if it is isolated for migration, subsequent isolation trial can be successful without trying further isolation. * zs_page_migrate First of all, it holds write-side zspage->lock to prevent migrate other subpage in zspage. Then, lock all objects in the page VM want to migrate. The reason we should lock all objects in the page is due to race between zs_map_object and zs_page_migrate. zs_map_object zs_page_migrate pin_tag(handle) obj = handle_to_obj(handle) obj_to_location(obj, &page, &obj_idx); write_lock(&zspage->lock) if (!trypin_tag(handle)) goto unpin_object zspage = get_zspage(page); read_lock(&zspage->lock); If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be stale by migration so it goes crash. If it locks all of objects successfully, it copies content from old page to new one, finally, create new zspage chain with new page. And if it's last isolated subpage in the zspage, put the zspage back to class. * zs_page_putback It returns isolated zspage to right fullness_group list if it fails to migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage freeing to workqueue. See below about async zspage freeing. This patch introduces asynchronous zspage free. The reason to need it is we need page_lock to clear PG_movable but unfortunately, zs_free path should be atomic so the apporach is try to grab page_lock. If it got page_lock of all of pages successfully, it can free zspage immediately. Otherwise, it queues free request and free zspage via workqueue in process context. If zs_free finds the zspage is isolated when it try to free zspage, it delays the freeing until zs_page_putback finds it so it will free free the zspage finally. In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL. First of all, it will use ZS_EMPTY list for delay freeing. And with adding ZS_FULL list, it makes to identify whether zspage is isolated or not via list_empty(&zspage->list) test. [minchan@kernel.org: zsmalloc: keep first object offset in struct page] Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org [minchan@kernel.org: zsmalloc: zspage sanity check] Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
bfd093f	Minchan Kim	26 July 2016, 22:23:28 UTC	zsmalloc: use freeobj for index Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj in each zspage. If we change it with index from first_page instead of position, it makes page migration simple because we don't need to correct other entries for linked list if a page is migrated out. Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC
4aa409c	Minchan Kim	26 July 2016, 22:23:26 UTC	zsmalloc: separate free_zspage from putback_zspage Currently, putback_zspage does free zspage under class->lock if fullness become ZS_EMPTY but it makes trouble to implement locking scheme for new zspage migration. So, this patch is to separate free_zspage from putback_zspage and free zspage out of class->lock which is preparation for zspage migration. Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	26 July 2016, 23:19:19 UTC

Newer
Older