https://github.com/torvalds/linux

sort by:
Revision Author Date Message Commit Date
189fc7c ath6kl: Fix the byte alignment rule to avoid loss of bytes in a TCP segment Either first 3 bytes of the first received tcp segment or last one over MTU size file can be loss due to the byte alignment problem. Although ATH6KL_HTC_ALIGN_BYTES was defined for 'extra bytes for htc header alignment' in the patch "Fix buffer alignment for scatter-gather I/O"(1df94a857), there exists the bytes loss issue which means that it will be truncated 3 bytes in the transmitted file contents if a file which has over MTU size is transferred through TCP/IP stack. It doesn't look like TCP/IP stack bug of 3.5 or the latest version of kernel but the byte alignment issue. This patch is to use the roundup() function for the byte alignment rather than the predefined ATH6KL_HTC_ALIGN_BYTES. Signed-off-by: Myoungje Kim <mjei78@gmail.com> 27 February 2013, 05:04:05 UTC
c41b381 Merge tag 'pm+acpi-fixes-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: - Fixes for blackfin and microblaze build problems introduced by the removal of global pm_idle. From Lars-Peter Clausen. - OPP core build fix from Shawn Guo. - Error condition check fix for the new imx6q-cpufreq driver from Wei Yongjun. - Fix for an AER driver crash related to the lack of APEI initialization for acpi=off. From Rafael J Wysocki. - Fix for a USB breakage on Thinkpad T430 related to ACPI power resources and PCI wakeup from Rafael J. Wysocki. * tag 'pm+acpi-fixes-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / PM: Take unusual configurations of power resources into account imx6q-cpufreq: fix return value check in imx6q_cpufreq_probe() PM / OPP: fix condition for empty of_init_opp_table() ACPI / APEI: Fix crash in apei_hest_parse() for acpi=off microblaze idle: Fix compile error blackfin idle: Fix compile error 26 February 2013, 05:25:17 UTC
556f12f Merge tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "Host bridge hotplug - Major overhaul of ACPI host bridge add/start (Rafael Wysocki, Yinghai Lu) - Major overhaul of PCI/ACPI binding (Rafael Wysocki, Yinghai Lu) - Split out ACPI host bridge and ACPI PCI device hotplug (Yinghai Lu) - Stop caching _PRT and make independent of bus numbers (Yinghai Lu) PCI device hotplug - Clean up cpqphp dead code (Sasha Levin) - Disable ARI unless device and upstream bridge support it (Yijing Wang) - Initialize all hot-added devices (not functions 0-7) (Yijing Wang) Power management - Don't touch ASPM if disabled (Joe Lawrence) - Fix ASPM link state management (Myron Stowe) Miscellaneous - Fix PCI_EXP_FLAGS accessor (Alex Williamson) - Disable Bus Master in pci_device_shutdown (Konstantin Khlebnikov) - Document hotplug resource and MPS parameters (Yijing Wang) - Add accessor for PCIe capabilities (Myron Stowe) - Drop pciehp suspend/resume messages (Paul Bolle) - Make pci_slot built-in only (not a module) (Jiang Liu) - Remove unused PCI/ACPI bind ops (Jiang Liu) - Removed used pci_root_bus (Bjorn Helgaas)" * tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits) PCI/ACPI: Don't cache _PRT, and don't associate them with bus numbers PCI: Fix PCI Express Capability accessors for PCI_EXP_FLAGS ACPI / PCI: Make pci_slot built-in only, not a module PCI/PM: Clear state_saved during suspend PCI: Use atomic_inc_return() rather than atomic_add_return() PCI: Catch attempts to disable already-disabled devices PCI: Disable Bus Master unconditionally in pci_device_shutdown() PCI: acpiphp: Remove dead code for PCI host bridge hotplug PCI: acpiphp: Create companion ACPI devices before creating PCI devices PCI: Remove unused "rc" in virtfn_add_bus() PCI: pciehp: Drop suspend/resume ENTRY messages PCI/ASPM: Don't touch ASPM if forcibly disabled PCI/ASPM: Deallocate upstream link state even if device is not PCIe PCI: Document MPS parameters pci=pcie_bus_safe, pci=pcie_bus_perf, etc PCI: Document hpiosize= and hpmemsize= resource reservation parameters PCI: Use PCI Express Capability accessor PCI: Introduce accessor to retrieve PCIe Capabilities Register PCI: Put pci_dev in device tree as early as possible PCI: Skip attaching driver in device_add() PCI: acpiphp: Keep driver loaded even if no slots found ... 26 February 2013, 05:18:18 UTC
fffddfd Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux Pull drm merge from Dave Airlie: "Highlights: - TI LCD controller KMS driver - TI OMAP KMS driver merged from staging - drop gma500 stub driver - the fbcon locking fixes - the vgacon dirty like zebra fix. - open firmware videomode and hdmi common code helpers - major locking rework for kms object handling - pageflip/cursor won't block on polling anymore! - fbcon helper and prime helper cleanups - i915: all over the map, haswell power well enhancements, valleyview macro horrors cleaned up, killing lots of legacy GTT code, - radeon: CS ioctl unification, deprecated UMS support, gpu reset rework, VM fixes - nouveau: reworked thermal code, external dp/tmds encoder support (anx9805), fences sleep instead of polling, - exynos: all over the driver fixes." Lovely conflict in radeon/evergreen_cs.c between commit de0babd60d8d ("drm/radeon: enforce use of radeon_get_ib_value when reading user cmd") and the new changes that modified that evergreen_dma_cs_parse() function. * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (508 commits) drm/tilcdc: only build on arm drm/i915: Revert hdmi HDP pin checks drm/tegra: Add list of framebuffers to debugfs drm/tegra: Fix color expansion drm/tegra: Split DC_CMD_STATE_CONTROL register write drm/tegra: Implement page-flipping support drm/tegra: Implement VBLANK support drm/tegra: Implement .mode_set_base() drm/tegra: Add plane support drm/tegra: Remove bogus tegra_framebuffer structure drm: Add consistency check for page-flipping drm/radeon: Use generic HDMI infoframe helpers drm/tegra: Use generic HDMI infoframe helpers drm: Add EDID helper documentation drm: Add HDMI infoframe helpers video: Add generic HDMI infoframe helpers drm: Add some missing forward declarations drm: Move mode tables to drm_edid.c drm: Remove duplicate drm_mode_cea_vic() gma500: Fix n, m1 and m2 clock limits for sdvo and lvds ... 26 February 2013, 00:46:44 UTC
69086a7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Fix for 3.8 breakage introduced by "vfs: Allow unprivileged manipulation of the mount namespace" - accessing mnt->mnt_ns is done there without needed locking *and* without any real need. Definite -stable fodder, fortunately not going too far back. This is *not* all - there will be much bigger vfs pull request tomorrow." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: get rid of unprotected dereferencing of mnt->mnt_ns 26 February 2013, 00:17:01 UTC
94f2f14 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace and namespace infrastructure changes from Eric W Biederman: "This set of changes starts with a few small enhnacements to the user namespace. reboot support, allowing more arbitrary mappings, and support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the user namespace root. I do my best to document that if you care about limiting your unprivileged users that when you have the user namespace support enabled you will need to enable memory control groups. There is a minor bug fix to prevent overflowing the stack if someone creates way too many user namespaces. The bulk of the changes are a continuation of the kuid/kgid push down work through the filesystems. These changes make using uids and gids typesafe which ensures that these filesystems are safe to use when multiple user namespaces are in use. The filesystems converted for 3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The changes for these filesystems were a little more involved so I split the changes into smaller hopefully obviously correct changes. XFS is the only filesystem that remains. I was hoping I could get that in this release so that user namespace support would be enabled with an allyesconfig or an allmodconfig but it looks like the xfs changes need another couple of days before it they are ready." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits) cifs: Enable building with user namespaces enabled. cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t cifs: Convert struct cifs_sb_info to use kuids and kgids cifs: Modify struct smb_vol to use kuids and kgids cifs: Convert struct cifsFileInfo to use a kuid cifs: Convert struct cifs_fattr to use kuid and kgids cifs: Convert struct tcon_link to use a kuid. cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t cifs: Convert from a kuid before printing current_fsuid cifs: Use kuids and kgids SID to uid/gid mapping cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc cifs: Use BUILD_BUG_ON to validate uids and gids are the same size cifs: Override unmappable incoming uids and gids nfsd: Enable building with user namespaces enabled. nfsd: Properly compare and initialize kuids and kgids nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids nfsd: Modify nfsd4_cb_sec to use kuids and kgids nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion nfsd: Convert nfsxdr to use kuids and kgids nfsd: Convert nfs3xdr to use kuids and kgids ... 26 February 2013, 00:00:49 UTC
8d168f7 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM ARM compile fixes from Gleb Natapov: "Fix ARM KVM compilation breakage due to changes from kvm.git" * git://git.kernel.org/pub/scm/virt/kvm/kvm: ARM: KVM: fix compilation after removal of user_alloc from struct kvm_memory_slot ARM: KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region 25 February 2013, 23:59:37 UTC
32dc43e Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "Here is the crypto update for 3.9: - Added accelerated implementation of crc32 using pclmulqdq. - Added test vector for fcrypt. - Added support for OMAP4/AM33XX cipher and hash. - Fixed loose crypto_user input checks. - Misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (43 commits) crypto: user - ensure user supplied strings are nul-terminated crypto: user - fix empty string test in report API crypto: user - fix info leaks in report API crypto: caam - Added property fsl,sec-era in SEC4.0 device tree binding. crypto: use ERR_CAST crypto: atmel-aes - adjust duplicate test crypto: crc32-pclmul - Kill warning on x86-32 crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_* crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: aesni-intel - add ENDPROC statements for assembler functions crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets crypto: testmgr - add test vector for fcrypt ... 25 February 2013, 23:56:15 UTC
be88298 drm/tilcdc: only build on arm [airlied: hack for now until we fix cma helpers on other OF platforms] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dave Airlie <airlied@linux.ie> 25 February 2013, 23:54:48 UTC
d414c10 Merge tag 'please-pull-vm_unwrapped' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull ia64 update from Tony Luck: "ia64 vm patch series that was cooking in -mm tree" * tag 'please-pull-vm_unwrapped' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture mm: use vm_unmapped_area() on ia64 architecture 25 February 2013, 23:47:03 UTC
f6d43b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem fixes from James Morris: "From Mimi: Both of these patches are bug fixes for patches, which were upstreamed in this open window. The first patch addresses a merge issue. The second patch addresses a CONFIG_BLOCK dependency." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: block: fix part_pack_uuid() build error ima: "remove enforce checking duplication" merge fix 25 February 2013, 23:45:29 UTC
c69d0a1 Merge tag 'ktest-v3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest Pull ktest update from Steven Rostedt: "Added ability to have all builds test warnings. Fixed failing reboot when the reboot produces a non fatal error. Config reading fixes and other cleanups" * tag 'ktest-v3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest: ktest: Remove indexes from warnings check ktest: Ignore warnings during reboot ktest: Search for linux banner for successful reboot ktest: Add make_warnings_file and process full warnings ktest: Allow a test option to use its default option ktest: Strip off '\n' when reading which files were modified ktest: Do not require CONSOLE for build or install bisects 25 February 2013, 23:43:21 UTC
9043a26 Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module update from Rusty Russell: "The sweeping change is to make add_taint() explicitly indicate whether to disable lockdep, but it's a mechanical change." * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: MODSIGN: Add option to not sign modules during modules_install MODSIGN: Add -s <signature> option to sign-file MODSIGN: Specify the hash algorithm on sign-file command line MODSIGN: Simplify Makefile with a Kconfig helper module: clean up load_module a little more. modpost: Ignore ARC specific non-alloc sections module: constify within_module_* taint: add explicit flag to show whether lock dep is still OK. module: printk message when module signature fail taints kernel. 25 February 2013, 23:41:43 UTC
446d64e block: fix part_pack_uuid() build error Commit "85865c1 ima: add policy support for file system uuid" introduced a CONFIG_BLOCK dependency. This patch defines a wrapper called blk_part_pack_uuid(), which returns -EINVAL, when CONFIG_BLOCK is not defined. security/integrity/ima/ima_policy.c:538:4: error: implicit declaration of function 'part_pack_uuid' [-Werror=implicit-function-declaration] Changelog v2: - Reference commit number in patch description Changelog v1: - rename ima_part_pack_uuid() to blk_part_pack_uuid() - resolve scripts/checkpatch.pl warnings Changelog v0: - fix UUID scripts/Lindent msgs Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: James Morris <james.l.morris@oracle.com> 25 February 2013, 16:10:52 UTC
a2c2c3a ima: "remove enforce checking duplication" merge fix Commit "750943a ima: remove enforce checking duplication" combined the 'in IMA policy' and 'enforcing file integrity' checks. For the non-file, kernel module verification, a specific check for 'enforcing file integrity' was not added. This patch adds the check. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: James Morris <james.l.morris@oracle.com> 25 February 2013, 15:46:38 UTC
3b8cd8a ARM: KVM: fix compilation after removal of user_alloc from struct kvm_memory_slot Commit 7a905b1 (KVM: Remove user_alloc from struct kvm_memory_slot) broke KVM/ARM by removing the user_alloc field from a public structure. As we only used this field to alert the user that we didn't support this operation mode, there is no harm in discarding this bit of code without any remorse. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> 25 February 2013, 09:48:41 UTC
2b5e1e4 ARM: KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS Commit bbacc0c (KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS) broke KVM/ARM by changing a global #define. Apply the same change to fix the compilation breakage. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> 25 February 2013, 09:47:59 UTC
bef103a ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region Commit f82a8cfe9 (KVM: struct kvm_memory_slot.user_alloc -> bool) broke the ARM KVM port by changing the prototype of two global functions. Apply the same change to fix the compilation breakage. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> 25 February 2013, 09:47:28 UTC
ab78265 Merge tag 'mfd-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 Pull MFS updates from Samuel Ortiz: "This is the MFD pull request for the 3.9 merge window. No new drivers this time, but a bunch of fairly big cleanups: - Roger Quadros worked on a OMAP USBHS and TLL platform data consolidation, OMAP5 support and clock management code cleanup. - The first step of a major sync for the ab8500 driver from Lee Jones. In particular, the debugfs and the sysct interfaces got extended and improved. - Peter Ujfalusi sent a nice patchset for cleaning and fixing the twl-core driver, with a much needed module id lookup code improvement. - The regular wm5102 and arizona cleanups and fixes from Mark Brown. - Laxman Dewangan extended the palmas APIs in order to implement the palmas GPIO and rt drivers. - Laxman also added DT support for the tps65090 driver. - The Intel SCH and ICH drivers got a couple fixes from Aaron Sierra and Darren Hart. - Linus Walleij patchset for the ab8500 driver allowed ab8500 and ab9540 based devices to switch to the new abx500 pin-ctrl driver. - The max8925 now has device tree and irqdomain support thanks to Qing Xu. - The recently added rtsx driver got a few cleanups and fixes for a better card detection code path and now also supports the RTS5227 chipset, thanks to Wei Wang and Roger Tseng." * tag 'mfd-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (109 commits) mfd: lpc_ich: Use devres API to allocate private data mfd: lpc_ich: Add Device IDs for Intel Wellsburg PCH mfd: lpc_sch: Accomodate partial population of the MFD devices mfd: da9052-i2c: Staticize da9052_i2c_fix() mfd: syscon: Fix sparse warning mfd: twl-core: Fix kernel panic on boot mfd: rtsx: Fix issue that booting OS with SD card inserted mfd: ab8500: Fix compile error mfd: Add missing GENERIC_HARDIRQS dependecies Documentation: Add docs for max8925 dt mfd: max8925: Add dts mfd: max8925: Support dt for backlight mfd: max8925: Fix onkey driver irq base mfd: max8925: Fix mfd device register failure mfd: max8925: Add irqdomain for dt mfd: vexpress: Allow vexpress-sysreg to self-initialise mfd: rtsx: Support RTS5227 mfd: rtsx: Implement driving adjustment to device-dependent callbacks mfd: vexpress: Add pseudo-GPIO based LEDs mfd: ab8500: Rename ab8500 to abx500 for hwmon driver ... 25 February 2013, 04:00:58 UTC
21fbd58 Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - Some cleanups at V4L2 documentation - new drivers: ts2020 frontend, ov9650 sensor, s5c73m3 sensor, sh-mobile veu mem2mem driver, radio-ma901, davinci_vpfe staging driver - Lots of missing MAINTAINERS entries added - several em28xx driver improvements, including its conversion to videobuf2 - several fixups on drivers to make them to better comply with the API - DVB core: add support for DVBv5 stats, allowing the implementation of statistics for new standards like ISDB - mb86a20s: add statistics to the driver - lots of new board additions, cleanups, and driver improvements. * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (596 commits) [media] media: Add 0x3009 USB PID to ttusb2 driver (fixed diff) [media] rtl28xxu: Add USB IDs for Compro VideoMate U620F [media] em28xx: add usb id for terratec h5 rev. 3 [media] media: rc: gpio-ir-recv: add support for device tree parsing [media] mceusb: move check earlier to make smatch happy [media] radio-si470x doc: add info about v4l2-ctl and sox+alsa [media] staging: media: Remove unnecessary OOM messages [media] sh_vou: Use vou_dev instead of vou_file wherever possible [media] sh_vou: Use video_drvdata() [media] drivers/media/platform/soc_camera/pxa_camera.c: use devm_ functions [media] mt9t112: mt9t111 format set up differs from mt9t112 [media] sh-mobile-ceu-camera: fix SHARPNESS control default Revert "[media] fc0011: Return early, if the frequency is already tuned" [media] cx18/ivtv: fix regression: remove __init from a non-init function [media] em28xx: fix analog streaming with USB bulk transfers [media] stv0900: remove unnecessary null pointer check [media] fc0011: Return early, if the frequency is already tuned [media] fc0011: Add some sanity checks and cleanups [media] fc0011: Fix xin value clamping Revert "[media] [PATH,1/2] mxl5007 move reset to attach" ... 25 February 2013, 01:35:10 UTC
d9978ec Merge tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev Pull libata updates from Jeff Garzik: 1) apply, and then revert, the sysfs export of ATA host controller number. Discussion was continuing after patch application, trying to figure out how to best mesh exported data with the installers, boot-time agents and other parties that want this info. 2) Merge Zero-Power Optical Device Driver (ZPODD) support, bringing the wonderfulness of sane power management to your CD/DVD device. Includes one SCSI-subsystem patch (with appropriate ACKs), adding runtime PM support to 'sr' driver. That is the ZPODD interaction bits. Patchset went through some 13 revisions before it got here; kudos to Intel for persistence. 3) pata_samsung_cf: use devm_clk_get() 4) more ata_piix, ahci PCI IDs 5) Add SATA driver for R-Car SoC 6) Convert libata to use devm_ioremap_resource (Note: I think Greg sent this to you, also) 7) Set proper Sense Key (SK) in the SCSI simulator when ATA passthrough indicates check condition. Google and specification hawks everywhere shall rejoice. * tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: (22 commits) [libata] fix smatch warning for zpodd_wake_dev [libata] Set proper SK when CK_COND is set. [libata] Convert to devm_ioremap_resource() libata: add R-Car SATA driver ahci: Add Device IDs for Intel Wellsburg PCH ata_piix: Add Device IDs for Intel Wellsburg PCH [SCSI] remove can_power_off flag from scsi_device [libata] scsi: no poll when ODD is powered off [SCSI] sr: support runtime pm ahci: AHCI-mode SATA patch for Intel Avoton DeviceIDs ata_piix: IDE-mode SATA patch for Intel Avoton DeviceIDs [libata] PM code cleanup for ata port [libata] pm: differentiate system and runtime pm for ata port Revert "libata: export host controller number thru /sys" libata: do not suspend port if normal ODD is attached libata: expose pm qos flags for ata device libata: handle power transition of ODD libata: check zero power ready status for ZPODD libata: move acpi notification code to zpodd libata: identify and init ZPODD devices ... 25 February 2013, 01:32:15 UTC
a883b70 tty vt: fix character insertion overflow Commit 81732c3b2fed ("tty vt: Fix line garbage in virtual console on command line edition") broke insert_char() in multiple ways. Then commit b1a925f44a3a ("tty vt: Fix a regression in command line edition") partially fixed it. However, the buffer being moved is still too large and overflowing beyond the end of the current line, corrupting existing characters on the next line. Example test case: echo -e "abc\nde\x1b[A\x1b[4h \x1b[4l\x1b[B" Expected result: ab c de Current result: ab c e Needless to say that this is very annoying when inserting words in the middle of paragraphs with certain text editors. Signed-off-by: Nicolas Pitre <nico@linaro.org> Cc: Jean-François Moine <moinejf@free.fr> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 25 February 2013, 01:29:49 UTC
77be36d Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen update from Konrad Rzeszutek Wilk: "This has two new ACPI drivers for Xen - a physical CPU offline/online and a memory hotplug. The way this works is that ACPI kicks the drivers and they make the appropiate hypercall to the hypervisor to tell it that there is a new CPU or memory. There also some changes to the Xen ARM ABIs and couple of fixes. One particularly nasty bug in the Xen PV spinlock code was fixed by Stefan Bader - and has been there since the 2.6.32! Features: - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor to be aware of new CPU and new DIMMs - Cleanups Bug-fixes: - Fixes a long-standing bug in the PV spinlock wherein we did not kick VCPUs that were in a tight loop. - Fixes in the error paths for the event channel machinery" Fix up a few semantic conflicts with the ACPI interface changes in drivers/xen/xen-acpi-{cpu,mem}hotplug.c. * tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: event channel arrays are xen_ulong_t and not unsigned long xen: Send spinlock IPI to all waiters xen: introduce xen_remap, use it instead of ioremap xen: close evtchn port if binding to irq fails xen-evtchn: correct comment and error output xen/tmem: Add missing %s in the printk statement. xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0 xen/acpi: ACPI cpu hotplug xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h xen/stub: driver for CPU hotplug xen/acpi: ACPI memory hotplug xen/stub: driver for memory hotplug xen: implement updated XENMEM_add_to_physmap_range ABI xen/smp: Move the common CPU init code a bit to prep for PVH patch. 25 February 2013, 00:18:31 UTC
89f8833 Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Marcelo Tosatti: "KVM updates for the 3.9 merge window, including x86 real mode emulation fixes, stronger memory slot interface restrictions, mmu_lock spinlock hold time reduction, improved handling of large page faults on shadow, initial APICv HW acceleration support, s390 channel IO based virtio, amongst others" * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) Revert "KVM: MMU: lazily drop large spte" x86: pvclock kvm: align allocation size to page size KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr x86 emulator: fix parity calculation for AAD instruction KVM: PPC: BookE: Handle alignment interrupts booke: Added DBCR4 SPR number KVM: PPC: booke: Allow multiple exception types KVM: PPC: booke: use vcpu reference from thread_struct KVM: Remove user_alloc from struct kvm_memory_slot KVM: VMX: disable apicv by default KVM: s390: Fix handling of iscs. KVM: MMU: cleanup __direct_map KVM: MMU: remove pt_access in mmu_set_spte KVM: MMU: cleanup mapping-level KVM: MMU: lazily drop large spte KVM: VMX: cleanup vmx_set_cr0(). KVM: VMX: add missing exit names to VMX_EXIT_REASONS array KVM: VMX: disable SMEP feature when guest is in non-paging mode KVM: Remove duplicate text in api.txt Revert "KVM: MMU: split kvm_mmu_free_page" ... 24 February 2013, 21:07:18 UTC
9e2d59a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull signal handling cleanups from Al Viro: "This is the first pile; another one will come a bit later and will contain SYSCALL_DEFINE-related patches. - a bunch of signal-related syscalls (both native and compat) unified. - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE (fixing several potential problems with missing argument validation, while we are at it) - a lot of now-pointless wrappers killed - a couple of architectures (cris and hexagon) forgot to save altstack settings into sigframe, even though they used the (uninitialized) values in sigreturn; fixed. - microblaze fixes for delivery of multiple signals arriving at once - saner set of helpers for signal delivery introduced, several architectures switched to using those." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits) x86: convert to ksignal sparc: convert to ksignal arm: switch to struct ksignal * passing alpha: pass k_sigaction and siginfo_t using ksignal pointer burying unused conditionals make do_sigaltstack() static arm64: switch to generic old sigaction() (compat-only) arm64: switch to generic compat rt_sigaction() arm64: switch compat to generic old sigsuspend arm64: switch to generic compat rt_sigqueueinfo() arm64: switch to generic compat rt_sigpending() arm64: switch to generic compat rt_sigprocmask() arm64: switch to generic sigaltstack sparc: switch to generic old sigsuspend sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE sparc: kill sign-extending wrappers for native syscalls kill sparc32_open() sparc: switch to use of generic old sigaction sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE mips: switch to generic sys_fork() and sys_clone() ... 24 February 2013, 02:50:11 UTC
28ee461 Merge branch 'drm/hdmi-for-3.9' of git://anongit.freedesktop.org/tegra/linux into drm-next Thierry writes: "Remove a duplicate implementation of the CEA VIC lookup and move the CEA and other mode tables to drm_edid.c to make it more difficult to create duplicates of the tables. Add some helpers to pack CEA-861/HDMI AVI, audio and SPD infoframes into binary buffers that can easily be written into hardware registers. A new helper function makes it easy construct an AVI infoframe from a DRM display mode. Convert the Tegra and Radeon drivers to use the new HDMI helpers." * 'drm/hdmi-for-3.9' of git://anongit.freedesktop.org/tegra/linux: drm/radeon: Use generic HDMI infoframe helpers drm/tegra: Use generic HDMI infoframe helpers drm: Add EDID helper documentation drm: Add HDMI infoframe helpers video: Add generic HDMI infoframe helpers drm: Add some missing forward declarations drm: Move mode tables to drm_edid.c drm: Remove duplicate drm_mode_cea_vic() 24 February 2013, 02:39:42 UTC
a497bfe Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel into drm-next Two regressions fixes from snowboarding land * 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel: drm/i915: Revert hdmi HDP pin checks drm/i915: Handle untiled planes when computing their offsets 24 February 2013, 02:39:02 UTC
a3b1097 Merge branch 'drm/tegra-for-3.9' of git://anongit.freedesktop.org/tegra/linux into drm-next Thierry writes: "Add support for 2 hardware overlays found on Tegra. These support YUV pixel formats and can be used as video overlays. .mode_set_base() is implemented and support for VBLANK and page-flipping is added. A few minor bug fixes are also included and a new debugfs file allows to inspect the framebuffers attached to the Tegra DRM device." * 'drm/tegra-for-3.9' of git://anongit.freedesktop.org/tegra/linux: drm/tegra: Add list of framebuffers to debugfs drm/tegra: Fix color expansion drm/tegra: Split DC_CMD_STATE_CONTROL register write drm/tegra: Implement page-flipping support drm/tegra: Implement VBLANK support drm/tegra: Implement .mode_set_base() drm/tegra: Add plane support drm/tegra: Remove bogus tegra_framebuffer structure drm: Add consistency check for page-flipping 24 February 2013, 02:38:22 UTC
5ce1a70 Merge branch 'akpm' (more incoming from Andrew) Merge second patch-bomb from Andrew Morton: - A little DM fix - the MM queue * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits) ksm: allocate roots when needed mm: cleanup "swapcache" in do_swap_page mm,ksm: swapoff might need to copy mm,ksm: FOLL_MIGRATION do migration_entry_wait ksm: shrink 32-bit rmap_item back to 32 bytes ksm: treat unstable nid like in stable tree ksm: add some comments tmpfs: fix mempolicy object leaks tmpfs: fix use-after-free of mempolicy object mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages mm: export mmu notifier invalidates mm: accelerate mm_populate() treatment of THP pages mm: use long type for page counts in mm_populate() and get_user_pages() mm: accurately document nr_free_*_pages functions with code comments HWPOISON: change order of error_states[]'s elements HWPOISON: fix misjudgement of page_action() for errors on mlocked pages memcg: stop warning on memcg_propagate_kmem net: change type of virtio_chan->p9_max_pages vmscan: change type of vm_total_pages to unsigned long fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used ... 24 February 2013, 01:50:35 UTC
ef53d16 ksm: allocate roots when needed It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically allocated, particularly when very few users will ever actually tune merge_across_nodes 0 to use more than 1+1 of those trees. Not a big deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity. Start off with 1+1 statically allocated, then if merge_across_nodes is ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to free up the extra if it's tuned back, that would be a waste of effort. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:24 UTC
56f3180 mm: cleanup "swapcache" in do_swap_page I dislike the way in which "swapcache" gets used in do_swap_page(): there is always a page from swapcache there (even if maybe uncached by the time we lock it), but tests are made according to "swapcache". Rework that with "page != swapcache", as has been done in unuse_pte(). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:24 UTC
9e16b7f mm,ksm: swapoff might need to copy Before establishing that KSM page migration was the cause of my WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in many respects is equivalent to faulting in a page. In fact I've never caught that as the cause: but in theory it does at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to avoid bringing a KSM page back in when it's not supposed to be. I intended to copy how it's done in do_swap_page(), but have a strong aversion to how "swapcache" ends up being used there: rework it with "page != swapcache". Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
5117b3b mm,ksm: FOLL_MIGRATION do migration_entry_wait In "ksm: remove old stable nodes more thoroughly" I said that I'd never seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing, but it soon appeared once I tried fuller tests on the whole series. It turned out to be due to the KSM page migration itself: unmerge_and_ remove_all_rmap_items() failed to locate and replace all the KSM pages, because of that hiatus in page migration when old pte has been replaced by migration entry, but not yet by new pte. follow_page() finds no page at that instant, but a KSM page reappears shortly after, without a fault. Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait() for KSM's break_cow(). I'd have preferred to avoid another flag, and do it every time, in case someone else makes the same easy mistake; but did not find another transgressor (the common get_user_pages() is of course safe), and cannot be sure that every follow_page() caller is prepared to sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can already sleep there, since anon_vma locking was changed to mutex, but maybe that's somehow excluded. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
bc56620 ksm: shrink 32-bit rmap_item back to 32 bytes Think of struct rmap_item as an extension of struct page (restricted to MADV_MERGEABLE areas): there may be a lot of them, we need to keep them small, especially on 32-bit architectures of limited lowmem. Siting "int nid" after "unsigned int checksum" works nicely on 64-bit, making no change to its 64-byte struct rmap_item; but bloats the 32-bit struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which rounds up to 40 bytes once allocated from slab. We'd better avoid that. Hey, I only just remembered that the anon_vma pointer in struct rmap_item has no purpose until the rmap_item is hung from a stable tree node (which has its own nid field); and rmap_item's nid field no purpose than to say which tree root to tell rb_erase() when unlinking from an unstable tree. Double them up in a union. There's just one place where we set anon_vma early (when we already hold mmap_sem): now we must remove tree_rmap_item from its unstable tree there, before overwriting nid. No need to spatter BUG()s around: we'd be seeing oopses if this were wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
b599cbd ksm: treat unstable nid like in stable tree An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. Now, it might be that a wrong NUMA node in an unstable tree is more likely to correlate with instablility (different content, with rbnode now misplaced) than page migration; but even so, we are accustomed to instablility in the unstable tree. Without strong evidence for which strategy is generally better, I'd rather be consistent with what's done in the stable tree: accept a page from the wrong NUMA node for comparison, but not as a leaf for merging. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
8fdb3db ksm: add some comments Added slightly more detail to the Documentation of merge_across_nodes, a few comments in areas indicated by review, and renamed get_ksm_page()'s argument from "locked" to "lock_it". No functional change. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
49cd0a5 tmpfs: fix mempolicy object leaks Fix several mempolicy leaks in the tmpfs mount logic. These leaks are slow - on the order of one object leaked per mount attempt. Leak 1 (umount doesn't free mpol allocated in mount): while true; do mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt umount /mnt done Leak 2 (errors parsing remount options will leak mpol): mount -t tmpfs -o size=100M nodev /mnt while true; do mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null done umount /mnt Leak 3 (multiple mpol per mount leak mpol): while true; do mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt umount /mnt done This patch fixes all of the above. I could have broken the patch into three pieces but is seemed easier to review as one. [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh] Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
5f00110 tmpfs: fix use-after-free of mempolicy object The tmpfs remount logic preserves filesystem mempolicy if the mpol=M option is not specified in the remount request. A new policy can be specified if mpol=M is given. Before this patch remounting an mpol bound tmpfs without specifying mpol= mount option in the remount request would set the filesystem's mempolicy object to a freed mempolicy object. To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run: # mkdir /tmp/x # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0 # mount -o remount,size=200M nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0 # note ? garbage in mpol=... output above # dd if=/dev/zero of=/tmp/x/f count=1 # panic here Panic: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) [...] Oops: 0010 [#1] SMP DEBUG_PAGEALLOC Call Trace: mpol_shared_policy_init+0xa5/0x160 shmem_get_inode+0x209/0x270 shmem_mknod+0x3e/0xf0 shmem_create+0x18/0x20 vfs_create+0xb5/0x130 do_last+0x9a1/0xea0 path_openat+0xb3/0x4d0 do_filp_open+0x42/0xa0 do_sys_open+0xfe/0x1e0 compat_sys_open+0x1b/0x20 cstar_dispatch+0x7/0x1f Non-debug kernels will not crash immediately because referencing the dangling mpol will not cause a fault. Instead the filesystem will reference a freed mempolicy object, which will cause unpredictable behavior. The problem boils down to a dropped mpol reference below if shmem_parse_options() does not allocate a new mpol: config = *sbinfo shmem_parse_options(data, &config, true) mpol_put(sbinfo->mpol) sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */ This patch avoids the crash by not releasing the mempolicy if shmem_parse_options() doesn't create a new mpol. How far back does this issue go? I see it in both 2.6.36 and 3.3. I did not look back further. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
67d46b2 mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages Rob van der Heij reported the following (paraphrased) on private mail. The scenario is that I want to avoid backups to fill up the page cache and purge stuff that is more likely to be used again (this is with s390x Linux on z/VM, so I don't give it as much memory that we don't care anymore). So I have something with LD_PRELOAD that intercepts the close() call (from tar, in this case) and issues a posix_fadvise() just before closing the file. This mostly works, except for small files (less than 14 pages) that remains in page cache after the face. Unfortunately Rob has not had a chance to test this exact patch but the test program below should be reproducing the problem he described. The issue is the per-cpu pagevecs for LRU additions. If the pages are added by one CPU but fadvise() is called on another then the pages remain resident as the invalidate_mapping_pages() only drains the local pagevecs via its call to pagevec_release(). The user-visible effect is that a program that uses fadvise() properly is not obeyed. A possible fix for this is to put the necessary smarts into invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec page could not be discarded. The downside with this is that an inode cache shrink would send a global IPI and memory pressure potentially causing global IPI storms is very undesirable. Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to check if invalidate_mapping_pages() discarded all the requested pages. If a subset of pages are discarded it drains the LRU pagevecs and tries again. If the second attempt fails, it assumes it is due to the pages being mapped, locked or dirty and does not care. With this patch, an application using fadvise() correctly will be obeyed but there is a downside that a malicious application can force the kernel to send global IPIs and increase overhead. If accepted, I would like this to be considered as a -stable candidate. It's not an urgent issue but it's a system call that is not working as advertised which is weak. The following test program demonstrates the problem. It should never report that pages are still resident but will without this patch. It assumes that CPU 0 and 1 exist. int main() { int fd; int pagesize = getpagesize(); ssize_t written = 0, expected; char *buf; unsigned char *vec; int resident, i; cpu_set_t set; /* Prepare a buffer for writing */ expected = FILESIZE_PAGES * pagesize; buf = malloc(expected + 1); if (buf == NULL) { printf("ENOMEM\n"); exit(EXIT_FAILURE); } buf[expected] = 0; memset(buf, 'a', expected); /* Prepare the mincore vec */ vec = malloc(FILESIZE_PAGES); if (vec == NULL) { printf("ENOMEM\n"); exit(EXIT_FAILURE); } /* Bind ourselves to CPU 0 */ CPU_ZERO(&set); CPU_SET(0, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) { perror("sched_setaffinity"); exit(EXIT_FAILURE); } /* open file, unlink and write buffer */ fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR); if (fd == -1) { perror("open"); exit(EXIT_FAILURE); } unlink("fadvise-test-file"); while (written < expected) { ssize_t this_write; this_write = write(fd, buf + written, expected - written); if (this_write == -1) { perror("write"); exit(EXIT_FAILURE); } written += this_write; } free(buf); /* * Force ourselves to another CPU. If fadvise only flushes the local * CPUs pagevecs then the fadvise will fail to discard all file pages */ CPU_ZERO(&set); CPU_SET(1, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) { perror("sched_setaffinity"); exit(EXIT_FAILURE); } /* sync and fadvise to discard the page cache */ fsync(fd); if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) { perror("posix_fadvise"); exit(EXIT_FAILURE); } /* map the file and use mincore to see which parts of it are resident */ buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0); if (buf == NULL) { perror("mmap"); exit(EXIT_FAILURE); } if (mincore(buf, expected, vec) == -1) { perror("mincore"); exit(EXIT_FAILURE); } /* Check residency */ for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) { if (vec[i]) resident++; } if (resident != 0) { printf("Nr unexpected pages resident: %d\n", resident); exit(EXIT_FAILURE); } munmap(buf, expected); close(fd); free(vec); exit(EXIT_SUCCESS); } Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Rob van der Heij <rvdheij@gmail.com> Tested-by: Rob van der Heij <rvdheij@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
fa79419 mm: export mmu notifier invalidates We at SGI have a need to address some very high physical address ranges with our GRU (global reference unit), sometimes across partitioned machine boundaries and sometimes with larger addresses than the cpu supports. We do this with the aid of our own 'extended vma' module which mimics the vma. When something (either unmap or exit) frees an 'extended vma' we use the mmu notifiers to clean them up. We had been able to mimic the functions __mmu_notifier_invalidate_range_start() and __mmu_notifier_invalidate_range_end() by locking the per-mm lock and walking the per-mm notifier list. But with the change to a global srcu lock (static in mmu_notifier.c) we can no longer do that. Our module has no access to that lock. So we request that these two functions be exported. Signed-off-by: Cliff Wickman <cpw@sgi.com> Acked-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
240aade mm: accelerate mm_populate() treatment of THP pages This change adds a follow_page_mask function which is equivalent to follow_page, but with an extra page_mask argument. follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page, and to 0 in other cases. __get_user_pages() makes use of this in order to accelerate populating THP ranges - that is, when both the pages and vmas arrays are NULL, we don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also avoid taking mm->page_table_lock that many times). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:23 UTC
28a3571 mm: use long type for page counts in mm_populate() and get_user_pages() Use long type for page counts in mm_populate() so as to avoid integer overflow when running the following test code: int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); printf("done\n"); return 0; } Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
e0fb581 mm: accurately document nr_free_*_pages functions with code comments nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages() are horribly badly named, so accurately document them with code comments in case of the misuse of them. [akpm@linux-foundation.org: tweak comments] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
5f4b9fc HWPOISON: change order of error_states[]'s elements error_states[] has two separate states "unevictable LRU page" and "mlocked LRU page", and the former one has the higher priority now. But because of that the latter one is rarely chosen because pages with PageMlocked highly likely have PG_unevictable set. On the other hand, PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed shared memory, so reversing the priority of these two states helps us clearly distinguish them. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
524fca1 HWPOISON: fix misjudgement of page_action() for errors on mlocked pages memory_failure() can't handle memory errors on mlocked pages correctly, because page_action() judges such errors as ones on "unknown pages" instead of ones on "unevictable LRU page" or "mlocked LRU page". In order to determine page_state page_action() checks page flags at the timing of the judgement, but such page flags are not the same with those just after memory_failure() is called, because memory_failure() does unmapping of the error pages before doing page_action(). This unmapping changes the page state, especially page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch mlocked pages after that. With this patch, we store the page flag of the error page before doing unmap, and (only) if the first check with page flags at the time decided the error page is unknown, we do the second check with the stored page flag. This implementation doesn't change error handling for the page types for which the first check can determine the page state correctly. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
6d04399 memcg: stop warning on memcg_propagate_kmem Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand, I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
7293bfb net: change type of virtio_chan->p9_max_pages This member of struct virtio_chan is calculated from nr_free_buffer_pages so change its type to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
b21e0b9 vmscan: change type of vm_total_pages to unsigned long This variable is calculated from nr_free_pagecache_pages so change its type to unsigned long. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
697ce9b fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used The three variables are calculated from nr_free_buffer_pages so change their types to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
43be594 fs/buffer.c: change type of max_buffer_heads to unsigned long max_buffer_heads is calculated from nr_free_buffer_pages(), so change its type to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
6434b94 ia64: use %ld to print pages calculated in nr_free_buffer_pages Now the function nr_free_buffer_pages returns unsigned long, so use %ld to print its return value. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:22 UTC
ebec386 mm: fix return type for functions nr_free_*_pages Currently, the amount of RAM that functions nr_free_*_pages return is held in unsigned int. But in machines with big memory (exceeding 16TB), the amount may be incorrect because of overflow, so fix it. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
1081312 memcg: cleanup mem_cgroup_init comment We should encourage all memcg controller initialization independent on a specific mem_cgroup to be done here rather than exploit css_alloc callback and assume that nothing happens before root cgroup is created. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
e477749 memcg: move memcg_stock initialization to mem_cgroup_init memcg_stock are currently initialized during the root cgroup allocation which is OK but it pointlessly pollutes memcg allocation code with something that can be called when the memcg subsystem is initialized by mem_cgroup_init along with other controller specific parts. This patch wraps the current memcg_stock initialization code into a helper calls it from the controller subsystem initialization code. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
8787a1d memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init Per-node-zone soft limit tree is currently initialized when the root cgroup is created which is OK but it pointlessly pollutes memcg allocation code with something that can be called when the memcg subsystem is initialized by mem_cgroup_init along with other controller specific parts. While we are at it let's make mem_cgroup_soft_limit_tree_init void because it doesn't make much sense to report memory failure because if we fail to allocate memory that early during the boot then we are screwed anyway (this saves some code). Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
0e50ce3 mm: use up free swap space before reaching OOM kill Recently, Luigi reported there are lots of free swap space when OOM happens. It's easily reproduced on zram-over-swap, where many instance of memory hogs are running and laptop_mode is enabled. He said there was no problem when he disabled laptop_mode. The problem when I investigate problem is following as. Assumption for easy explanation: There are no page cache page in system because they all are already reclaimed. 1. try_to_free_pages disable may_writepage when laptop_mode is enabled. 2. shrink_inactive_list isolates victim pages from inactive anon lru list. 3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't pageout because sc->may_writepage is 0 so the page is rotated back into inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty. 4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and retry reclaim with higher priority. 5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list but got failed because it try to isolate pages with ISOLATE_CLEAN mode but inactive anon lru list is full of dirty pages by 3 so it just returns without any reclaim progress. 6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned. Because sc->nr_scanned is increased by shrink_page_list but we don't call shrink_page_list in 5 due to short of isolated pages. Above loop is continued until OOM happens. The problem didn't happen before [1] was merged because old logic's isolatation in shrink_inactive_list was successful and tried to call shrink_page_list to pageout them but it still ends up failed to page out by may_writepage. But important point is that sc->nr_scanned was increased although we couldn't swap out them so do_try_to_free_pages could set may_writepages. Since commit f80c0673610e ("mm: zone_reclaim: make isolate_lru_page() filter-aware") was introduced, it's not a good idea any more to depends on only the number of scanned pages for setting may_writepage. So this patch adds new trigger point of setting may_writepage as below DEF_PRIOIRTY - 2 which is used to show the significant memory pressure in VM so it's good fit for our purpose which would be better to lose power saving or clickety rather than OOM killing. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Luigi Semenzato <semenzato@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
00ef2d2 mm: use NUMA_NO_NODE Make a sweep through mm/ and convert code that uses -1 directly to using the more appropriate NUMA_NO_NODE. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
751efd8 mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts There is a race condition between mmu_notifier_unregister() and __mmu_notifier_release(). Assume two tasks, one calling mmu_notifier_unregister() as a result of a filp_close() ->flush() callout (task A), and the other calling mmu_notifier_release() from an mmput() (task B). A B t1 srcu_read_lock() t2 if (!hlist_unhashed()) t3 srcu_read_unlock() t4 srcu_read_lock() t5 hlist_del_init_rcu() t6 synchronize_srcu() t7 srcu_read_unlock() t8 hlist_del_rcu() <--- NULL pointer deref. Additionally, the list traversal in __mmu_notifier_release() is not protected by the by the mmu_notifier_mm->hlist_lock which can result in callouts to the ->release() notifier from both mmu_notifier_unregister() and __mmu_notifier_release(). -stable suggestions: The stable trees prior to 3.7.y need commits 21a92735f660 and 70400303ce0c cherry-picked in that order prior to cherry-picking this commit. The 3.7.y tree already has those two commits. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
c1f1949 mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same. Replace open coded pgdat_end_pfn() with helper function. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
64dd1b2 mm/memory_hotplug: use ensure_zone_is_initialized() Remove open coding of ensure_zone_is_initialzied(). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
f6bbb78 mm: add helper ensure_zone_is_initialized() ensure_zone_is_initialized() checks if a zone is in a empty & not initialized state (typically occuring after it is created in memory hotplugging), and, if so, calls init_currently_empty_zone() to initialize the zone. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:21 UTC
b5e6a5a mm/page_alloc: add informative debugging message in page_outside_zone_boundaries() Add a debug message which prints when a page is found outside of the boundaries of the zone it should belong to. Format is: "page $pfn outside zone [ $start_pfn - $end_pfn ]" [akpm@linux-foundation.org: s/pr_debug/pr_err/] Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
da3649e mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate. Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar zone_*() functions. Change node_end_pfn() to be a wrapper of pgdat_end_pfn(). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
d29bb97 mm/page_alloc: add a VM_BUG in __free_one_page() if the zone is uninitialized. Freeing pages to uninitialized zones is not handled by __free_one_page(), and should never happen when the code is correct. Ran into this while writing some code that dynamically onlines extra zones. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
2a6e3eb mm: add zone_is_empty() and zone_is_initialized() Factoring out these 2 checks makes it more clear what we are actually checking for. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
108bcc9 mm: add & use zone_end_pfn() and zone_spans_pfn() Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code duplication. This also switches to using them in compaction (where an additional variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and kmemleak. Note that in compaction.c I avoid calling zone_end_pfn() repeatedly because I expect at some point the sycronization issues with start_pfn & spanned_pages will need fixing, either by actually using the seqlock or clever memory barrier usage. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
9127ab4 mm: add SECTION_IN_PAGE_FLAGS Instead of directly utilizing a combination of config options to determine this, add a macro to specifically address it. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
4805b02 mm/mlock.c: document scary-looking stack expansion mlock chain The fact that mlock calls get_user_pages, and get_user_pages might call mlock when expanding a stack looks like a potential recursion. However, mlock makes sure the requested range is already contained within a vma, so no stack expansion will actually happen from mlock. Should this ever change: the stack expansion mlocks only the newly expanded range and so will not result in recursive expansion. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
e379014 mm: refactor inactive_file_is_low() to use get_lru_size() An inactive file list is considered low when its active counterpart is bigger, regardless of whether it is a global zone LRU list or a memcg zone LRU list. The only difference is in how the LRU size is assessed. get_lru_size() does the right thing for both global and memcg reclaim situations. Get rid of inactive_file_is_low_global() and mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare the numbers in common code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
860f275 mm: shmem: use new radix tree iterator In shmem_find_get_pages_and_swap(), use the faster radix tree iterator construct from commit 78c1d78488a3 ("radix-tree: introduce bit-optimized iterator"). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
ef4d43a ksm: stop hotremove lockdep warning Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:20 UTC
9c620e2 mm: remove offlining arg to migrate_pages No functional change, but the only purpose of the offlining argument to migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a KSM page for memory hotremove (which took ksm_thread_mutex) but not for other callers. Now all cases are safe, remove the arg. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
b79bc0a ksm: enable KSM page migration Migration of KSM pages is now safe: remove the PageKsm restrictions from mempolicy.c and migrate.c. But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which are irrelevant to KSM: it looks as if that code was preventing hotremove migration of KSM pages, unless they happened to be in swapcache. There is some question as to whether enforcing a NUMA mempolicy migration ought to migrate KSM pages, mapped into entirely unrelated processes; but moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway, and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on any area where this is a worry. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
4146d2d ksm: make !merge_across_nodes migration safe The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
c8d6553 ksm: make KSM page migration possible KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
cbf86cf ksm: remove old stable nodes more thoroughly Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
8aafa6a ksm: get_ksm_page locked In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
ee0ea59 ksm: reorganize ksm_check_stable_tree Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
e850dcf ksm: trivial tidyups Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
f00dc0e ksm: add sysfs ABI Documentation Add sysfs documentation for Kernel Samepage Merging (KSM) including new merge_across_nodes knob. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
90bd6fd ksm: allow trees per NUMA node Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) This patch: Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:19 UTC
22b751c mm: rename page struct field helpers The function names page_xchg_last_nid(), page_last_nid() and reset_page_last_nid() were judged to be inconsistent so rename them to a struct_field_op style pattern. As it looked jarring to have reset_page_mapcount() and page_nid_reset_last() beside each other in memmap_init_zone(), this patch also renames reset_page_mapcount() to page_mapcount_reset(). There are others like init_page_count() but as it is used throughout the arch code a rename would likely cause more conflicts than it is worth. [akpm@linux-foundation.org: fix zcache] Signed-off-by: Mel Gorman <mgorman@suse.de> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
e4715f0 memcg: avoid dangling reference count in creation failure. When use_hierarchy is enabled, we acquire an extra reference count in our parent during cgroup creation. We don't release it, though, if any failure exist in the creation process. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
692e89a memcg: increment static branch right after limit set We were deferring the kmemcg static branch increment to a later time, due to a nasty dependency between the cpu_hotplug lock, taken by the jump label update, and the cgroup_lock. Now we no longer take the cgroup lock, and we can save ourselves the trouble. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
0999821 memcg: replace cgroup_lock with memcg specific memcg_lock After the preparation work done in earlier patches, the cgroup_lock can be trivially replaced with a memcg-specific lock. This is an automatic translation at every site where the values involved were queried. The sites where values are written, however, used to be naturally called under cgroup_lock. This is the case for instance in the css_online callback. For those, we now need to explicitly add the memcg lock. With this, all the calls to cgroup_lock outside cgroup core are gone. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
b5f99b5 memcg: fast hierarchy-aware child test Currently, we use cgroups' provided list of children to verify if it is safe to proceed with any value change that is dependent on the cgroup being empty. This is less than ideal, because it enforces a dependency over cgroup core that we would be better off without. The solution proposed here is to iterate over the child cgroups and if any is found that is already online, we bounce and return: we don't really care how many children we have, only if we have any. This is also made to be hierarchy aware. IOW, cgroups with hierarchy disabled, while they still exist, will be considered for the purpose of this interface as having no children. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
d142e3e memcg: split part of memcg creation to css_online This patch is a preparatory work for later locking rework to get rid of big cgroup lock from memory controller code. The memory controller uses some tunables to adjust its operation. Those tunables are inherited from parent to children upon children intialization. For most of them, the value cannot be changed after the parent has a new children. cgroup core splits initialization in two phases: css_alloc and css_online. After css_alloc, the memory allocation and basic initialization are done. But the new group is not yet visible anywhere, not even for cgroup core code. It is only somewhere between css_alloc and css_online that it is inserted into the internal children lists. Copying tunable values in css_alloc will lead to inconsistent values: the children will copy the old parent values, that can change between the copy and the moment in which the groups is linked to any data structure that can indicate the presence of children. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
ee5e847 memcg: prevent changes to move_charge_at_immigrate during task attach In memcg, we use the cgroup_lock basically to synchronize against attaching new children to a cgroup. We do this because we rely on cgroup core to provide us with this information. We need to guarantee that upon child creation, our tunables are consistent. For those, the calls to cgroup_lock() all live in handlers like mem_cgroup_hierarchy_write(), where we change a tunable in the group that is hierarchy-related. For instance, the use_hierarchy flag cannot be changed if the cgroup already have children. Furthermore, those values are propagated from the parent to the child when a new child is created. So if we don't lock like this, we can end up with the following situation: A B memcg_css_alloc() mem_cgroup_hierarchy_write() copy use hierarchy from parent change use hierarchy in parent finish creation. This is mainly because during create, we are still not fully connected to the css tree. So all iterators and the such that we could use, will fail to show that the group has children. My observation is that all of creation can proceed in parallel with those tasks, except value assignment. So what this patch series does is to first move all value assignment that is dependent on parent values from css_alloc to css_online, where the iterators all work, and then we lock only the value assignment. This will guarantee that parent and children always have consistent values. Together with an online test, that can be derived from the observation that the refcount of an online memcg can be made to be always positive, we should be able to synchronize our side without the cgroup lock. This patch: Currently, we rely on the cgroup_lock() to prevent changes to move_charge_at_immigrate during task migration. However, this is only needed because the current strategy keeps checking this value throughout the whole process. Since all we need is serialization, one needs only to guarantee that whatever decision we made in the beginning of a specific migration is respected throughout the process. We can achieve this by just saving it in mc. By doing this, no kind of locking is needed. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
45cf7eb memcg: reduce the size of struct memcg 244-fold. In order to maintain all the memcg bookkeeping, we need per-node descriptors, which will in turn contain a per-zone descriptor. Because we want to statically allocate those, this array ends up being very big. Part of the reason is that we allocate something large enough to hold MAX_NUMNODES, the compile time constant that holds the maximum number of nodes we would ever consider. However, we can do better in some cases if the firmware help us. This is true for modern x86 machines; coincidentally one of the architectures in which MAX_NUMNODES tends to be very big. By using the firmware-provided maximum number of nodes instead of MAX_NUMNODES, we can reduce the memory footprint of struct memcg considerably. In the extreme case in which we have only one node, this reduces the size of the structure from ~ 64k to ~2k. This is particularly important because it means that we will no longer resort to the vmalloc area for the struct memcg on defconfigs. We also have enough room for an extra node and still be outside vmalloc. One also has to keep in mind that with the industry's ability to fit more processors in a die as fast as the FED prints money, a nodes = 2 configuration is already respectably big. [akpm@linux-foundation.org: add check for invalid nid, remove inline] Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
a4e1b4c mm: init: report on last-nid information stored in page->flags Answering the question "how much space remains in the page->flags" is time-consuming. mminit_loglevel can help answer the question but it does not take last_nid information into account. This patch corrects it and while there it corrects the messages related to page flag usage, pgshifts and node/zone id. When applied the relevant output looks something like this but will depend on the kernel configuration. mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25 mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9 mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44 mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53 mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
4468b8f mm: uninline page_xchg_last_nid() Andrew Morton pointed out that page_xchg_last_nid() and reset_page_last_nid() were "getting nuttily large" and asked that it be investigated. reset_page_last_nid() is on the page free path and it would be unfortunate to make that path more expensive than it needs to be. Due to the internal use of page_xchg_last_nid() it is already too expensive but fortunately, it should also be impossible for the page->flags to be updated in parallel when we call reset_page_last_nid(). Instead of unlining the function, it uses a simplier implementation that assumes no parallel updates and should now be sufficiently short for inlining. page_xchg_last_nid() is called in paths that are already quite expensive (splitting huge page, fault handling, migration) and it is reasonable to uninline. There was not really a good place to place the function but mm/mmzone.c was the closest fit IMO. This patch saved 128 bytes of text in the vmlinux file for the kernel configuration I used for testing automatic NUMA balancing. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:18 UTC
6acc8b0 memcg: clean up swap accounting initialization code Memcg swap accounting is currently enabled by enable_swap_cgroup when the root cgroup is created. mem_cgroup_init acts as a memcg subsystem initializer which sounds like a much better place for enable_swap_cgroup as well. We already register memsw files from there so it makes a lot of sense to merge those two into a single enable_swap_cgroup function. This patch doesn't introduce any semantic changes. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
2d11085 memcg: do not create memsw files if swap accounting is disabled Zhouping Liu has reported that memsw files are exported even though swap accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior has been introduced by commit af36f906c0f4 ("memcg: always create memsw files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear that memsw operations are not supported in the given configuration it is fair to say that this behavior could be quite confusing. Let's tear memsw files out of default cgroup files and add them only if the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or swapaccount=1 boot parameter). We can hook into mem_cgroup_init which is called when the memcg subsystem is initialized and which happens after boot command line is processed. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Zhouping Liu <zliu@redhat.com> Tested-by: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
75f7ad8 page-writeback.c: subtract min_free_kbytes from dirtyable memory When calculating amount of dirtyable memory, min_free_kbytes should be subtracted because it is not intended for dirty pages. Addresses http://bugs.debian.org/695182 [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations] [akpm@linux-foundation.org: fix min() warning] Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au> Acked-by: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
08b5270 mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable") says: | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), | to make it clearer that it's an exclusive write-lock in | that case - suggested by Rik van Riel. But that commit renames only anon_vma_lock() Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
ec8acf2 swap: add per-partition lock for swapfile swap_lock is heavily contended when I test swap to 3 fast SSD (even slightly slower than swap to 2 such SSD). The main contention comes from swap_info_get(). This patch tries to fix the gap with adding a new per-partition lock. Global data like nr_swapfiles, total_swap_pages, least_priority and swap_list are still protected by swap_lock. nr_swap_pages is an atomic now, it can be changed without swap_lock. In theory, it's possible get_swap_page() finds no swap pages but actually there are free swap pages. But sounds not a big problem. Accessing partition specific data (like scan_swap_map and so on) is only protected by swap_info_struct.lock. Changing swap_info_struct.flags need hold swap_lock and swap_info_struct.lock, because scan_scan_map() will check it. read the flags is ok with either the locks hold. If both swap_lock and swap_info_struct.lock must be hold, we always hold the former first to avoid deadlock. swap_entry_free() can change swap_list. To delete that code, we add a new highest_priority_index. Whenever get_swap_page() is called, we check it. If it's valid, we use it. It's a pity get_swap_page() still holds swap_lock(). But in practice, swap_lock() isn't heavily contended in my test with this patch (or I can say there are other much more heavier bottlenecks like TLB flush). And BTW, looks get_swap_page() doesn't really need the lock. We never free swap_info[] and we check SWAP_WRITEOK flag. The only risk without the lock is we could swapout to some low priority swap, but we can quickly recover after several rounds of swap, so sounds not a big deal to me. But I'd prefer to fix this if it's a real problem. "swap: make each swap partition have one address_space" improved the swapout speed from 1.7G/s to 2G/s. This patch further improves the speed to 2.3G/s, so around 15% improvement. It's a multi-process test, so TLB flush isn't the biggest bottleneck before the patches. [arnd@arndb.de: fix it for nommu] [hughd@google.com: add missing unlock] [minchan@kernel.org: get rid of lockdep whinge on sys_swapon] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
33806f0 swap: make each swap partition have one address_space When I use several fast SSD to do swap, swapper_space.tree_lock is heavily contended. This makes each swap partition have one address_space to reduce the lock contention. There is an array of address_space for swap. The swap entry type is the index to the array. In my test with 3 SSD, this increases the swapout throughput 20%. [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
9800339 mm: don't inline page_mapping() According to akpm, this saves 1/2k text and makes things simple for the next patch. Numbers from Minchan: add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424) function old new delta page_mapping - 48 +48 do_task_stat 2292 2308 +16 page_remove_rmap 240 248 +8 load_elf_binary 4500 4508 +8 update_queue 532 536 +4 scsi_probe_and_add_lun 2892 2896 +4 lookup_fast 644 648 +4 vcs_read 1040 1036 -4 __ip_route_output_key 1904 1900 -4 ip_route_input_noref 2508 2500 -8 shmem_file_aio_read 784 772 -12 __isolate_lru_page 272 256 -16 shmem_replace_page 708 688 -20 mark_buffer_dirty 228 208 -20 __set_page_dirty_buffers 240 220 -20 __remove_mapping 276 256 -20 update_mmu_cache 500 476 -24 set_page_dirty_balance 92 68 -24 set_page_dirty 172 148 -24 page_evictable 88 64 -24 page_cache_pipe_buf_steal 248 224 -24 clear_page_dirty_for_io 340 316 -24 test_set_page_writeback 400 372 -28 test_clear_page_writeback 516 488 -28 invalidate_inode_page 156 128 -28 page_mkclean 432 400 -32 flush_dcache_page 360 328 -32 __set_page_dirty_nobuffers 324 280 -44 shrink_page_list 2412 2356 -56 Signed-off-by: Shaohua Li <shli@fusionio.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
340ef39 mm: numa: cleanup flow of transhuge page migration When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of THP before migrating") Hugh Dickins noted that the control flow for transhuge migration was difficult to follow. Unconditionally calling put_page() in numamigrate_isolate_page() made the failure paths of both migrate_misplaced_transhuge_page() and migrate_misplaced_page() more complex that they should be. Further, he was extremely wary that an unlock_page() should ever happen after a put_page() even if the put_page() should never be the final put_page. Hugh implemented the following cleanup to simplify the path by calling putback_lru_page() inside numamigrate_isolate_page() if it failed to isolate and always calling unlock_page() within migrate_misplaced_transhuge_page(). There is no functional change after this patch is applied but the code is easier to follow and unlock_page() always happens before put_page(). [mgorman@suse.de: changelog only] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
75980e9 mm: fold page->_last_nid into page->flags where possible page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit NUMA configuration with NUMA Balancing will still need an extra page field. As Peter notes "Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify things, but it would also remove the warning if we grow enough 64bit only page-flags to push the last-cpu out." [mgorman@suse.de: minor modifications] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 24 February 2013, 01:50:17 UTC
back to top