https://github.com/torvalds/linux

sort by:
Revision Author Date Message Commit Date
272b98c Linux 3.12-rc1 16 September 2013, 20:17:51 UTC
a4ae54f Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer code update from Thomas Gleixner: - armada SoC clocksource overhaul with a trivial merge conflict - Minor improvements to various SoC clocksource drivers * 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: armada-370-xp: Add detailed clock requirements in devicetree binding clocksource: armada-370-xp: Get reference fixed-clock by name clocksource: armada-370-xp: Replace WARN_ON with BUG_ON clocksource: armada-370-xp: Fix device-tree binding clocksource: armada-370-xp: Introduce new compatibles clocksource: armada-370-xp: Use CLOCKSOURCE_OF_DECLARE clocksource: armada-370-xp: Simplify TIMER_CTRL register access clocksource: armada-370-xp: Use BIT() ARM: timer-sp: Set dynamic irq affinity ARM: nomadik: add dynamic irq flag to the timer clocksource: sh_cmt: 32-bit control register support clocksource: em_sti: Convert to devm_* managed helpers 16 September 2013, 20:10:26 UTC
3369d11 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull CIFS fixes from Steve French: "Two minor cifs fixes and a minor documentation cleanup for cifs.txt" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: update cifs.txt and remove some outdated infos cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache cifs: Do not take a reference to the page in cifs_readpage_worker() 16 September 2013, 19:39:21 UTC
f1da345 Merge tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubi Pull UBI fixes from Artem Bityutskiy: "Just a single fastmap fix plus a regression fix" * tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubi: UBI: Fix invalidate_fastmap() UBI: Fix PEB leak in wear_leveling_worker() 16 September 2013, 19:37:52 UTC
098e7f1 Merge tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs Pull ubifs fix from Artem Bityutskiy: "Just one patch which fixes the power-cut recovery testing mode. I'll start using a single UBI/UBIFS tree instead of 2 trees from now on. So in the future you'll get 1 small pull request instead of 2 tiny ones" * tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs: UBIFS: remove invalid warn msg with tst_recovery enabled 16 September 2013, 19:36:55 UTC
d8efd82 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus Pull MIPS fixes from Ralf Baechle: "These are four patches for three construction sites: - Fix register decoding for the combination of multi-core processors and multi-threading. - Two more fixes that are part of the ongoing DECstation resurrection work. One of these touches a DECstation-only network driver. - Finally Markos' trivial build fix for the AP/SP support. (With this applied now all MIPS defconfigs are building again)" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: MIPS: kernel: vpe: Make vpe_attrs an array of pointers. MIPS: Fix SMP core calculations when using MT support. MIPS: DECstation I/O ASIC DMA interrupt handling fix MIPS: DECstation HRT initialization rearrangement 15 September 2013, 21:45:52 UTC
cd619e2 Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86 Pull x86 platform updates from Matthew Garrett: "Nothing amazing here, almost entirely cleanups and minor bugfixes and one bit of hardware enablement in the amilo-rfkill driver" * 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86: platform/x86: panasonic-laptop: reuse module_acpi_driver samsung-laptop: fix config build error platform: x86: remove unnecessary platform_set_drvdata() amilo-rfkill: Enable using amilo-rfkill with the FSC Amilo L1310. wmi: parse_wdg() should return kernel error codes hp_wmi: Fix unregister order in hp_wmi_rfkill_setup() platform: replace strict_strto*() with kstrto*() x86: irst: use module_acpi_driver to simplify the code x86: smartconnect: use module_acpi_driver to simplify the code platform samsung-q10: use ACPI instead of direct EC calls thinkpad_acpi: add the ability setting TPACPI_LED_NONE by quirk thinkpad_acpi: return -NODEV while operating uninitialized LEDs 15 September 2013, 21:42:59 UTC
0375ec5 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull misc SCSI driver updates from James Bottomley: "This patch set is a set of driver updates (megaraid_sas, fnic, lpfc, ufs, hpsa) we also have a couple of bug fixes (sd out of bounds and ibmvfc error handling) and the first round of esas2r checker fixes and finally the much anticipated big endian additions for megaraid_sas" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (47 commits) [SCSI] fnic: fnic Driver Tuneables Exposed through CLI [SCSI] fnic: Kernel panic while running sh/nosh with max lun cfg [SCSI] fnic: Hitting BUG_ON(io_req->abts_done) in fnic_rport_exch_reset [SCSI] fnic: Remove QUEUE_FULL handling code [SCSI] fnic: On system with >1.1TB RAM, VIC fails multipath after boot up [SCSI] fnic: FC stat param seconds_since_last_reset not getting updated [SCSI] sd: Fix potential out-of-bounds access [SCSI] lpfc 8.3.42: Update lpfc version to driver version 8.3.42 [SCSI] lpfc 8.3.42: Fixed issue of task management commands having a fixed timeout [SCSI] lpfc 8.3.42: Fixed inconsistent spin lock usage. [SCSI] lpfc 8.3.42: Fix driver's abort loop functionality to skip IOs already getting aborted [SCSI] lpfc 8.3.42: Fixed failure to allocate SCSI buffer on PPC64 platform for SLI4 devices [SCSI] lpfc 8.3.42: Fix WARN_ON when driver unloads [SCSI] lpfc 8.3.42: Avoided making pci bar ioremap call during dual-chute WQ/RQ pci bar selection [SCSI] lpfc 8.3.42: Fixed driver iocbq structure's iocb_flag field running out of space [SCSI] lpfc 8.3.42: Fix crash on driver load due to cpu affinity logic [SCSI] lpfc 8.3.42: Fixed logging format of setting driver sysfs attributes hard to interpret [SCSI] lpfc 8.3.42: Fixed back to back RSCNs discovery failure. [SCSI] lpfc 8.3.42: Fixed race condition between BSG I/O dispatch and timeout handling [SCSI] lpfc 8.3.42: Fixed function mode field defined too small for not recognizing dual-chute mode ... 15 September 2013, 21:41:30 UTC
bff157b Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB update from Pekka Enberg: "Nothing terribly exciting here apart from Christoph's kmalloc unification patches that brings sl[aou]b implementations closer to each other" * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slab: Use correct GFP_DMA constant slub: remove verify_mem_not_deleted() mm/sl[aou]b: Move kmallocXXX functions to common code mm, slab_common: add 'unlikely' to size check of kmalloc_slab() mm/slub.c: beautify code for removing redundancy 'break' statement. slub: Remove unnecessary page NULL check slub: don't use cpu partial pages on UP mm/slub: beautify code for 80 column limitation and tab alignment mm/slub: remove 'per_cpu' which is useless variable 15 September 2013, 11:15:06 UTC
8bf5e36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input update from Dmitry Torokhov: "The only change is David Hermann's new EVIOCREVOKE evdev ioctl that allows safely passing file descriptors to input devices to session processes and later being able to stop delivery of events through these fds so that inactive sessions will no longer receive user input that does not belong to them" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: evdev - add EVIOCREVOKE ioctl 15 September 2013, 11:13:39 UTC
05a8252 vfs: fix typo in comment in recent dentry work Sedat points out that I transposed some letters in "LRU" and wrote "RLU" instead in one of the new comments explaining the flow. Let's just fix it. Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 15 September 2013, 11:11:01 UTC
6b02fa5 partitions/efi: loosen check fot pmbr size in lba Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr size in lba") caused his GPT formatted eMMC device not to boot. The reason is that this commit enforced Linux to always check the lesser of the whole disk or 2Tib for the pMBR size in LBA. While most disk partitioning tools out there create a pMBR with these characteristics, Microsoft does not, as it always sets the entry to the maximum 32-bit limitation - even though a drive may be smaller than that[1]. Loosen this check and only verify that the size is either the whole disk or 0xFFFFFFFF. No tool in its right mind would set it to any value other than these. [1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT Reported-and-tested-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 15 September 2013, 11:10:16 UTC
3711d86 Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "A trivial writeback fix" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Do not sort b_io list only because of block device inode 14 September 2013, 03:06:40 UTC
89dc77b vfs: fix dentry LRU list handling and nr_dentry_unused accounting The LRU list changes interacted badly with our nr_dentry_unused accounting, and even worse with the new DCACHE_LRU_LIST bit logic. This introduces helper functions to make sure everything follows the proper dcache d_lru list rules: the dentry cache is complicated by the fact that some of the hotpaths don't even want to look at the LRU list at all, and the fact that we use the same list entry in the dentry for both the LRU list and for our temporary shrinking lists when removing things from the LRU. The helper functions temporarily have some extra sanity checking for the flag bits that have to match the current LRU state of the dentry. We'll remove that before the final 3.12 release, but considering how easy it is to get wrong, this first cleanup version has some very particular sanity checking. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 14 September 2013, 02:55:10 UTC
81b6622 cifs: update cifs.txt and remove some outdated infos Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Björn JACKE <bj@sernet.de> Signed-off-by: Steve French <smfrench@gmail.com> 13 September 2013, 21:29:58 UTC
466bd31 cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache When reading a single page with cifs_readpage(), we make a call to fscache_read_or_alloc_page() which once done, asynchronously calls the completion function cifs_readpage_from_fscache_complete(). This completion function unlocks the page once it has been populated from cache. The module then attempts to unlock the page a second time in cifs_readpage() which leads to warning messages. In case of a successful call to fscache_read_or_alloc_page() we should skip the second unlock_page() since this will be called by the cifs_readpage_from_fscache_complete() once the page has been populated by fscache. With the modifications to cifs_readpage_worker(), we will need to re-grab the page lock in cifs_write_begin(). The problem was first noticed when testing new fscache patches for cifs. https://bugzilla.redhat.com/show_bug.cgi?id=1005737 Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> 13 September 2013, 21:24:49 UTC
a9e9b7b cifs: Do not take a reference to the page in cifs_readpage_worker() We do not need to take a reference to the pagecache in cifs_readpage_worker() since the calling function will have already taken one before passing the pointer to the page as an argument to the function. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> 13 September 2013, 21:24:43 UTC
bdbdfde Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: "Some more low risk cleanup patches: - Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han - Fix return values in several drivers from Sachin Kamat - Remove redundant break in amc6821 driver from Sachin Kamat" * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (k10temp) remove unnecessary pci_set_drvdata() hwmon: (tmp421) Fix return value hwmon: (amc6821) Remove redundant break hwmon: (amc6821) Fix return value hwmon: (ibmaem) Fix return value hwmon: (emc2103) Fix return value 13 September 2013, 17:58:41 UTC
6700215 Merge tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux Pull Xtensa updates from Chris Zankel. * tag 'xtensa-next-20130912' of git://github.com/czankel/xtensa-linux: xtensa: Fix broken allmodconfig build xtensa: remove CCOUNT_PER_JIFFY xtensa: fix !CONFIG_XTENSA_CALIBRATE_CCOUNT build failure xtensa: don't use echo -e needlessly xtensa: new fast_alloca handler xtensa: keep a3 and excsave1 on entry to exception handlers xtensa: enable kernel preemption xtensa: check thread flags atomically on return from user exception 13 September 2013, 17:57:48 UTC
9bf12df Merge git://git.kvack.org/~bcrl/aio-next Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ... 13 September 2013, 17:55:58 UTC
399a946 Merge branch 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull generic hardirq option removal from Martin Schwidefsky: "All architectures now use generic hardirqs, s390 has been last to switch. With that the code under !CONFIG_GENERIC_HARDIRQS and the related HAVE_GENERIC_HARDIRQS and GENERIC_HARDIRQS config options can be removed. Yay!" * 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: Remove GENERIC_HARDIRQ config option 13 September 2013, 14:31:38 UTC
183c420 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig fix from Michal Marek: "This is a fix for a regression caused by my previous pull request. A sed command in scripts/config that used colons as separator was accidentally changed to use slashes, which fails when you use slashes in a value. Changing it back to colons is of course not a proper fix, but at least it will be broken in the same way it had been for four years. A proper fix is pending" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: scripts/config: fix variable substitution command 13 September 2013, 14:30:17 UTC
951a730 Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux Pull blackfin updates from Steven Miao. * tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux: blackfin: Ignore generated uImages blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x. bf609: adv7343: add S-Video and Component output support bf609: add adv7343 video encoder support clock: add stmmac clock for ethernet driver blackfin: scb: Add SCB1 to SCB9 config options and data. blackfin: scb: Add system crossbar init code. 13 September 2013, 14:23:49 UTC
0898d2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a 7+ year race condition in the crypto API that causes sporadic crashes when multiple threads load the same algorithm. It also fixes the crct10dif algorithm again to prevent boot failures on systems where the initramfs tool ignores module softdeps" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: crct10dif - Add fallback for broken initrds crypto: api - Fix race condition in larval lookup 13 September 2013, 14:11:14 UTC
1b46763 MIPS: kernel: vpe: Make vpe_attrs an array of pointers. Commit 567b21e973ccf5b0d13776e408d7c67099749eb8 "mips: convert vpe_class to use dev_groups" broke the build on MIPS since vpe_attrs should be an array of 'struct device_attribute' pointers. Fixes the following build problem: arch/mips/kernel/vpe.c:1372:2: error: missing braces around initializer [-Werror=missing-braces] arch/mips/kernel/vpe.c:1372:2: error: (near initialization for 'vpe_attrs[0]') [-Werror=missing-braces] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: John Crispin <blogic@openwrt.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5819/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> 13 September 2013, 13:12:48 UTC
0244ad0 Remove GENERIC_HARDIRQ config option After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> 13 September 2013, 13:09:52 UTC
86eb781 scripts/config: fix variable substitution command Commit 229455bc02b87f7128f190c4491b4ceffff38648 accidentally changed the separator between sed `s' command and its parameters from ':' to '/'. Revert this change. Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Clement Chauplannaz <chauplac@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz> 13 September 2013, 11:06:59 UTC
670bac3 MIPS: Fix SMP core calculations when using MT support. The TCBIND register is only available if the core has MT support. It should not be read otherwise. Secondly, the number of TCs (siblings) are calculated differently depending on if the kernel is configured as SMVP or SMTC. Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5822/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> 13 September 2013, 09:59:51 UTC
5359b93 MIPS: DECstation I/O ASIC DMA interrupt handling fix This change complements commit d0da7c002f7b2a93582187a9e3f73891a01d8ee4 and brings clear_ioasic_irq back, renaming it to clear_ioasic_dma_irq at the same time, to make I/O ASIC DMA interrupts functional. Unlike ordinary I/O ASIC interrupts DMA interrupts need to be deasserted by software by writing 0 to the respective bit in I/O ASIC's System Interrupt Register (SIR), similarly to how CP0.Cause.IP0 and CP0.Cause.IP1 bits are handled in the CPU (the difference is SIR DMA interrupt bits are R/W0C so there's no need for an RMW cycle). Otherwise the handler is reentered over and over again. The only current user is the DEC LANCE Ethernet driver and its extremely uncommon DMA memory error handler that does not care when exactly the interrupt is cleared. Anticipating the use of DMA interrupts by the Zilog SCC driver this change however exports clear_ioasic_dma_irq for device drivers to choose the right application-specific sequence to clear the request explicitly rather than calling it implicitly in the .irq_eoi handler of `struct irq_chip'. Previously these interrupts were cleared in the .end handler of the said structure, before it was removed. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5826/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> 13 September 2013, 09:57:40 UTC
daed128 MIPS: DECstation HRT initialization rearrangement Not all I/O ASIC versions have the free-running counter implemented, an early revision used in the 5000/1xx models aka 3MIN and 4MIN did not have it. Therefore we cannot unconditionally use it as a clock source. Fortunately if not implemented its register slot has a fixed value so it is enough if we check for the value at the end of the calibration period being the same as at the beginning. This also means we need to look for another high-precision clock source on the systems affected. The 5000/1xx can have an R4000SC processor installed where the CP0 Count register can be used as a clock source. Unfortunately all the R4k DECstations suffer from the missed timer interrupt on CP0 Count reads erratum, so we cannot use the CP0 timer as a clock source and a clock event both at a time. However we never need an R4k clock event device because all DECstations have a DS1287A RTC chip whose periodic interrupt can be used as a clock source. This gives us the following four configuration possibilities for I/O ASIC DECstations: 1. No I/O ASIC counter and no CP0 timer, e.g. R3k 5000/1xx (3MIN). 2. No I/O ASIC counter but the CP0 timer, i.e. R4k 5000/150 (4MIN). 3. The I/O ASIC counter but no CP0 timer, e.g. R3k 5000/240 (3MAX+). 4. The I/O ASIC counter and the CP0 timer, e.g. R4k 5000/260 (4MAX+). For #1 and #2 this change stops the I/O ASIC free-running counter from being installed as a clock source of a 0Hz frequency. For #2 it also arranges for the CP0 timer to be used as a clock source rather than a clock event device, because having an accurate wall clock is more important than a high-precision interval timer. For #3 there is no change. For #4 the change makes the I/O ASIC free-running counter installed as a clock source so that the CP0 timer can be used as a clock event device. Unfortunately the use of the CP0 timer as a clock event device relies on a succesful completion of c0_compare_interrupt. That never happens, because while waiting for a CP0 Compare interrupt to happen the function spins in a loop reading the CP0 Count register. This makes the CP0 Count erratum trigger reliably causing the interrupt waited for to be lost in all cases. As a result #4 resorts to using the CP0 timer as a clock source as well, just as #2. However we want to keep this separate arrangement in case (hope) c0_compare_interrupt is eventually rewritten such that it avoids the erratum. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5825/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> 13 September 2013, 09:56:13 UTC
08b67fa blackfin: Ignore generated uImages We have the build infrastructure to generate uImages so we should ignore the resulting generated files. Signed-off-by: Mark Brown <broonie@linaro.org> Acked-by: Mike Frysinger <vapier@gentoo.org> 13 September 2013, 02:42:39 UTC
1d899fd blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x. - Enable GMAC - Set propler DMA PBL - Disable DMA store and forward mode - Select PTP input clock from MII clock. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com> 13 September 2013, 02:42:38 UTC
e578609 bf609: adv7343: add S-Video and Component output support Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com> 13 September 2013, 02:42:36 UTC
4940c53 bf609: add adv7343 video encoder support Signed-off-by: Scott Jiang <scott.jiang.linux@gmail.com> 13 September 2013, 02:42:34 UTC
3036dcc clock: add stmmac clock for ethernet driver Signed-off-by: Steven Miao <realmz6@gmail.com> 13 September 2013, 02:42:32 UTC
206f060 blackfin: scb: Add SCB1 to SCB9 config options and data. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> 13 September 2013, 02:42:31 UTC
24a70cf blackfin: scb: Add system crossbar init code. If SCB exists in select blackfin cpu, developer can change the SCB priority in kernel configuration. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Steven Miao <realmz6@gmail.com> 13 September 2013, 02:42:27 UTC
5a7d8a2 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus Pull MIPS updates from Ralf Baechle: "This has been sitting in -next for a while with no objections and all MIPS defconfigs except one are building fine; that one platform got broken by another patch in your tree and I'm going to submit a patch separately. - a handful of fixes that didn't make 3.11 - a few bits of Octeon 3 support with more to come for a later release - platform enhancements for Octeon, ath79, Lantiq, Netlogic and Ralink SOCs - a GPIO driver for the Octeon - some dusting off of the DECstation code - the usual dose of cleanups" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (65 commits) MIPS: DMA: Fix BUG due to smp_processor_id() in preemptible code MIPS: kexec: Fix random crashes while loading crashkernel MIPS: kdump: Skip walking indirection page for crashkernels MIPS: DECstation HRT calibration bug fixes MIPS: Export copy_from_user_page() (needed by lustre) MIPS: Add driver for the built-in PCI controller of the RT3883 SoC MIPS: DMA: For BMIPS5000 cores flush region just like non-coherent R10000 MIPS: ralink: Add support for reset-controller API MIPS: ralink: mt7620: Add cpu-feature-override header MIPS: ralink: mt7620: Add spi clock definition MIPS: ralink: mt7620: Add wdt clock definition MIPS: ralink: mt7620: Improve clock frequency detection MIPS: ralink: mt7620: This SoC has EHCI and OHCI hosts MIPS: ralink: mt7620: Add verbose ram info MIPS: ralink: Probe clocksources from OF MIPS: ralink: Add support for systick timer found on newer ralink SoC MIPS: ralink: Add support for periodic timer irq MIPS: Netlogic: Built-in DTB for XLP2xx SoC boards MIPS: Netlogic: Add support for USB on XLP2xx MIPS: Netlogic: XLP2xx update for I2C controller ... 12 September 2013, 23:14:49 UTC
e0ea404 Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs Pull xfs update #2 from Ben Myers: "Here we have defrag support for v5 superblock, a number of bugfixes and a cleanup or two. - defrag support for CRC filesystems - fix endian worning in xlog_recover_get_buf_lsn - fixes for sparse warnings - fix for assert in xfs_dir3_leaf_hdr_from_disk - fix for log recovery of remote symlinks - fix for log recovery of btree root splits - fixes formemory allocation failures with ACLs - fix for assert in xfs_buf_item_relse - fix for assert in xfs_inode_buf_verify - fix an assignment in an assert that should be a test in xfs_bmbt_change_owner - remove dead code in xlog_recover_inode_pass2" * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: remove dead code from xlog_recover_inode_pass2 xfs: = vs == typo in ASSERT() xfs: don't assert fail on bad inode numbers xfs: aborted buf items can be in the AIL. xfs: factor all the kmalloc-or-vmalloc fallback allocations xfs: fix memory allocation failures with ACLs xfs: ensure we copy buffer type in da btree root splits xfs: set remote symlink buffer type for recovery xfs: recovery of swap extents operations for CRC filesystems xfs: swap extents operations for CRC filesystems xfs: check magic numbers in dir3 leaf verifier first xfs: fix some minor sparse warnings xfs: fix endian warning in xlog_recover_get_buf_lsn() 12 September 2013, 23:13:41 UTC
48efe45 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "Lots of activity again this round for I/O performance optimizations (per-cpu IDA pre-allocation for vhost + iscsi/target), and the addition of new fabric independent features to target-core (COMPARE_AND_WRITE + EXTENDED_COPY). The main highlights include: - Support for iscsi-target login multiplexing across individual network portals - Generic Per-cpu IDA logic (kent + akpm + clameter) - Conversion of vhost to use per-cpu IDA pre-allocation for descriptors, SGLs and userspace page pointer list - Conversion of iscsi-target + iser-target to use per-cpu IDA pre-allocation for descriptors - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet) emulation for virtual backend drivers - Add support for generic EXTENDED_COPY (CopyOffload) emulation for virtual backend drivers. - Add support for fast memory registration mode to iser-target (Vu) The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of particular significance, which make us the first and only open source target to support the full set of VAAI primitives. Currently Linux clients are lacking upstream support to actually utilize these primitives. However, with server side support now in place for folks like MKP + ZAB working on the client, this logic once reserved for the highest end of storage arrays, can now be run in VMs on their laptops" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits) target/iscsi: Bump versions to v4.1.0 target: Update copyright ownership/year information to 2013 iscsi-target: Bump default TCP listen backlog to 256 target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out iscsi-target; Bump default CmdSN Depth to 64 iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set iscsi-target: Add thread_set->ts_activate_sem + use common deallocate iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE target: remove unused including <linux/version.h> iser-target: introduce fast memory registration mode (FRWR) iser-target: generalize rdma memory registration and cleanup iser-target: move rdma wr processing to a shared function target: Enable global EXTENDED_COPY setup/release target: Add Third Party Copy (3PC) bit in INQUIRY response target: Enable EXTENDED_COPY setup in spc_parse_cdb target: Add support for EXTENDED_COPY copy offload emulation target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check target: Add global device list for EXTENDED_COPY target: Make helpers non static for EXTENDED_COPY command setup target: Make spc_parse_naa_6h_vendor_specific non static ... 12 September 2013, 23:11:45 UTC
ac4de95 Merge branch 'akpm' (patches from Andrew Morton) Merge more patches from Andrew Morton: "The rest of MM. Plus one misc cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) mm/Kconfig: add MMU dependency for MIGRATION. kernel: replace strict_strto*() with kstrto*() mm, thp: count thp_fault_fallback anytime thp fault fails thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() thp: do_huge_pmd_anonymous_page() cleanup thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() mm: cleanup add_to_page_cache_locked() thp: account anon transparent huge pages into NR_ANON_PAGES truncate: drop 'oldsize' truncate_pagecache() parameter mm: make lru_add_drain_all() selective memcg: document cgroup dirty/writeback memory statistics memcg: add per cgroup writeback pages accounting memcg: check for proper lock held in mem_cgroup_update_page_stat memcg: remove MEMCG_NR_FILE_MAPPED memcg: reduce function dereference memcg: avoid overflow caused by PAGE_ALIGN memcg: rename RESOURCE_MAX to RES_COUNTER_MAX memcg: correct RESOURCE_MAX to ULLONG_MAX mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup ... 12 September 2013, 22:44:27 UTC
de32a81 mm/Kconfig: add MMU dependency for MIGRATION. MIGRATION must depend on MMU, or allmodconfig for the nommu sh architecture fails to build: CC mm/migrate.o mm/migrate.c: In function 'remove_migration_pte': mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration] if (pmd_trans_huge(*pmd)) ^ mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration] if (!is_swap_pte(pte)) ^ ... Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will select MIGRATION by force. Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
6072ddc kernel: replace strict_strto*() with kstrto*() The usage of strict_strto*() is not preferred, because strict_strto*() is obsolete. Thus, kstrto*() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
17766dd mm, thp: count thp_fault_fallback anytime thp fault fails Currently, thp_fault_fallback in vmstat only gets incremented if a hugepage allocation fails. If current's memcg hits its limit or the page fault handler returns an error, it is incorrectly accounted as a successful thp_fault_alloc. Count thp_fault_fallback anytime the page fault handler falls back to using regular pages and only count thp_fault_alloc when a hugepage has actually been faulted. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
c029255 thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
128ec03 thp: do_huge_pmd_anonymous_page() cleanup Minor cleanup: unindent most code of the fucntion by inverting one condition. It's preparation for the next patch. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
3122359 thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() It's confusing that mk_huge_pmd() has semantics different from mk_pte() or mk_pmd(). I spent some time on debugging issue cased by this inconsistency. Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype to match mk_pte(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
66a0c8e mm: cleanup add_to_page_cache_locked() Make add_to_page_cache_locked() cleaner: - unindent most code of the function by inverting one condition; - streamline code no-error path; - move insert error path outside normal code path; - call radix_tree_preload_end() earlier; No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
3cd14fc thp: account anon transparent huge pages into NR_ANON_PAGES We use NR_ANON_PAGES as base for reporting AnonPages to user. There's not much sense in not accounting transparent huge pages there, but add them on printing to user. Let's account transparent huge pages in NR_ANON_PAGES in the first place. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Ning Qu <quning@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:03 UTC
7caef26 truncate: drop 'oldsize' truncate_pagecache() parameter truncate_pagecache() doesn't care about old size since commit cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
5fbc461 mm: make lru_add_drain_all() selective make lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. This is important on workloads where nohz cores are handling 10 Gb traffic in userspace. Those CPUs do not enter the kernel and place pages into LRU pagevecs and they really, really don't want to be interrupted, or they drop packets on the floor. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
9cb2dc1 memcg: document cgroup dirty/writeback memory statistics Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
3ea67d0 memcg: add per cgroup writeback pages accounting Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
658b72c memcg: check for proper lock held in mem_cgroup_update_page_stat We should call mem_cgroup_begin_update_page_stat() before mem_cgroup_update_page_stat() to get proper locks, however the latter doesn't do any checking that we use proper locking, which would be hard. Suggested by Michal Hock we could at least test for rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled(). Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
68b4876 memcg: remove MEMCG_NR_FILE_MAPPED While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
1a36e59 memcg: reduce function dereference This function dereferences res far too often, so optimize it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
3af3351 memcg: avoid overflow caused by PAGE_ALIGN Since PAGE_ALIGN is aligning up(the next page boundary), so after PAGE_ALIGN, the value might be overflow, such as write the MAX value to *.limit_in_bytes. $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes bash: echo: write error: Invalid argument Some user programs might depend on such behaviours(like libcg, we read the value in snapshot, then use the value to reset cgroup later), and that will cause confusion. So we need to fix it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
6de5a8b memcg: rename RESOURCE_MAX to RES_COUNTER_MAX RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
34ff8dc memcg: correct RESOURCE_MAX to ULLONG_MAX Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource limit is unsigned long long, so we can set bigger value than that which is strange. The XXX_MAX should be reasonable max value, bigger than that should be overflow. Notice that this change will affect user output of default *.limit_in_bytes: before change: $ cat /cgroup/memory/memory.limit_in_bytes 9223372036854775807 after change: $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 But it doesn't alter the API in term of input - we can still use "echo -1 > *.limit_in_bytes" to reset the numbers to "unlimited". Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
3812c8c mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:02 UTC
fb2a6fc mm: memcg: rework and document OOM waiting and wakeup The memcg OOM handler open-codes a sleeping lock for OOM serialization (trylock, wait, repeat) because the required locking is so specific to memcg hierarchies. However, it would be nice if this construct would be clearly recognizable and not be as obfuscated as it is right now. Clean up as follows: 1. Remove the return value of mem_cgroup_oom_unlock() 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock(). 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This makes it more obvious that the task has to be on the waitqueue before attempting to OOM-trylock the hierarchy, to not miss any wakeups before going to sleep. It just didn't matter until now because it was all lumped together into the global memcg_oom_lock spinlock section. 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope. It is proctected by the hierarchical OOM-lock. 5. The memcg_oom_lock spinlock is only required to propagate the OOM lock in any given hierarchy atomically. Restrict its scope to mem_cgroup_oom_(trylock|unlock). 6. Do not wake up the waitqueue unconditionally at the end of the function. Only the lockholder has to wake up the next in line after releasing the lock. Note that the lockholder kicks off the OOM-killer, which in turn leads to wakeups from the uncharges of the exiting task. But a contender is not guaranteed to see them if it enters the OOM path after the OOM kills but before the lockholder releases the lock. Thus there has to be an explicit wakeup after releasing the lock. 7. Put the OOM task on the waitqueue before marking the hierarchy as under OOM as that is the point where we start to receive wakeups. No point in listening before being on the waitqueue. 8. Likewise, unmark the hierarchy before finishing the sleep, for symmetry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
519e524 mm: memcg: enable memcg OOM killer only for user faults System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
3a13c4d x86: finish user fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has a fatal signal pending. For a subsequent patch this is a problem in OOM situations because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper per-task OOM state unwinding. Shortcutting the fault like this is a rather minor optimization that saves a few instructions in rare cases. Just remove it for user-triggered faults. Use the opportunity to split the fault retry handling from actual fault errors and add locking documentation that reads suprisingly similar to ARM's. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
759496b arch: mm: pass userspace fault flag to generic fault handler Unlike global OOM handling, memory cgroup code will invoke the OOM killer in any OOM situation because it has no way of telling faults occuring in kernel context - which could be handled more gracefully - from user-triggered faults. Pass a flag that identifies faults originating in user space from the architecture-specific fault handlers to generic code so that memcg OOM handling can be improved. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
8713410 arch: mm: do not invoke OOM killer on kernel fault OOM Kernel faults are expected to handle OOM conditions gracefully (gup, uaccess etc.), so they should never invoke the OOM killer. Reserve this for faults triggered in user context when it is the only option. Most architectures already do this, fix up the remaining few. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
94bce45 arch: mm: remove obsolete init OOM protection The memcg code can trap tasks in the context of the failing allocation until an OOM situation is resolved. They can hold all kinds of locks (fs, mm) at this point, which makes it prone to deadlocking. This series converts memcg OOM handling into a two step process that is started in the charge context, but any waiting is done after the fault stack is fully unwound. Patches 1-4 prepare architecture handlers to support the new memcg requirements, but in doing so they also remove old cruft and unify out-of-memory behavior across architectures. Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel faults, because they can gracefully unwind the stack with -ENOMEM. OOM handling is restricted to user triggered faults that have no other option. Patch 6 reworks memcg's hierarchical OOM locking to make it a little more obvious wth is going on in there: reduce locked regions, rename locking functions, reorder and document. Patch 7 implements the two-part OOM handling such that tasks are never trapped with the full charge stack in an OOM situation. This patch: Back before smart OOM killing, when faulting tasks were killed directly on allocation failures, the arch-specific fault handlers needed special protection for the init process. Now that all fault handlers call into the generic OOM killer (see commit 609838cfed97: "mm: invoke oom-killer from remaining unconverted page fault handlers"), which already provides init protection, the arch-specific leftovers can be removed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc bits] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
f894ffa memcg: trivial cleanups Clean up some mess made by the "Soft limit rework" series, and a few other things. Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
e975de9 memcg, vmscan: do not fall into reclaim-all pass too quickly shrink_zone starts with soft reclaim pass first and then falls back to regular reclaim if nothing has been scanned. This behavior is natural but there is a catch. Memcg iterators, when used with the reclaim cookie, are designed to help to prevent from over reclaim by interleaving reclaimers (per node-zone-priority) so the tree walk might miss many (even all) nodes in the hierarchy e.g. when there are direct reclaimers racing with each other or with kswapd in the global case or multiple allocators reaching the limit for the target reclaim case. To make it even more complicated, targeted reclaim doesn't do the whole tree walk because it stops reclaiming once it reclaims sufficient pages. As a result groups over the limit might be missed, thus nothing is scanned, and reclaim would fall back to the reclaim all mode. This patch checks for the incomplete tree walk in shrink_zone. If no group has been visited and the hierarchy is soft reclaimable then we must have missed some groups, in which case the __shrink_zone is called again. This doesn't guarantee there will be some progress of course because the current reclaimer might be still racing with others but it would at least give a chance to start the walk without a big risk of reclaim latencies. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
1be171d memcg: track all children over limit in the root Children in soft limit excess are currently tracked up the hierarchy in memcg->children_in_excess. Nevertheless there still might exist tons of groups that are not in hierarchy relation to the root cgroup (e.g. all first level groups if root_mem_cgroup->use_hierarchy == false). As the whole tree walk has to be done when the iteration starts at root_mem_cgroup the iterator should be able to skip the walk if there is no child above the limit without iterating them. This can be done easily if the root tracks all children rather than only hierarchical children. This is done by this patch which updates root_mem_cgroup children_in_excess if root_mem_cgroup->use_hierarchy == false so the root knows about all children in excess. Please note that this is not an issue for inner memcgs which have use_hierarchy == false because then only the single group is visited so no special optimization is necessary. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
e839b6a memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done and it always says yes currently. Memcg iterators are clever to skip nodes that are not soft reclaimable quite efficiently but mem_cgroup_should_soft_reclaim can be more clever and do not start the soft reclaim pass at all if it knows that nothing would be scanned anyway. In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the target group of the reclaim and allow the pass only if the whole subtree wouldn't be skipped. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:01 UTC
7d910c0 memcg: track children in soft limit excess to improve soft limit Soft limit reclaim has to check the whole reclaim hierarchy while doing the first pass of the reclaim. This leads to a higher system time which can be visible especially when there are many groups in the hierarchy. This patch adds a per-memcg counter of children in excess. It also restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a proper batching. If a group crosses soft limit for the first time it increases parent's children_in_excess up the hierarchy. The similarly if a group gets below the limit it will decrease the counter. The transition phase is recorded in soft_contributed flag. mem_cgroup_soft_reclaim_eligible then uses this information to better decide whether to skip the node or the whole subtree. The rule is simple. Skip the node with a children in excess or skip the whole subtree otherwise. This has been tested by a stream IO (dd if=/dev/zero of=file with 4*MemTotal size) which is quite sensitive to overhead during reclaim. The load is running in a group with soft limit set to 0 and without any limit. Apart from that there was a hierarchy with ~500, 2k and 8k groups (two groups on each level) without any pages in them. base denotes to the kernel on which the whole series is based on, rework is the kernel before this patch and reworkoptim is with this patch applied: * Run with soft limit set to 0 Elapsed 0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3 0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3 0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3 System 0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3 0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3 0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3 Elapsed 0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3 0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3 0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3 System 2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3 2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3 2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3 Elapsed 2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3 2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3 2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3 System 8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3 8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3 8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3 Elapsed 8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3 8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3 8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3 System time is increased by 30-40% but it is reduced a lot comparing to kernel without this patch. The higher time can be explained by the fact that the original soft reclaim scanned at priority 0 so it was much more effective for this workload (which is basically touch once and writeback). The Elapsed time looks better though (~20%). * Run with no soft limit set System 0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3 0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3 0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3 Elapsed 0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3 0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3 0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3 System 0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3 0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3 0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3 Elapsed 0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3 0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3 0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3 System 2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3 2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3 2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3 Elapsed 2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3 2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3 2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3 System 8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3 8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3 8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3 Elapsed 8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3 8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3 8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3 Both System and Elapsed are in stdev with the base kernel for all configurations except for 8k where both System and Elapsed are up by 35%. I do not have a good explanation for this because there is no soft reclaim pass going on as no group is above the limit which is checked in mem_cgroup_should_soft_reclaim. Then I have tested kernel build with the same configuration to see the behavior with a more general behavior. * Soft limit set to 0 for the build System 0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3 0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3 0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3 Elapsed 0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3 0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3 0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3 System 0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3 0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3 0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3 Elapsed 0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3 0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3 0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3 System 2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3 2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3 2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3 Elapsed 2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3 2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3 2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3 System 8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3 8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3 8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3 Elapsed 8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3 8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3 8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3 Apart from 8k the system time is comparable with the base kernel while Elapsed is up to 20% better with all configurations. * No soft limit set System 0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3 0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3 0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3 Elapsed 0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3 0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3 0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3 System 0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3 0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3 0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3 Elapsed 0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3 0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3 0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3 System 2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3 2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3 2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3 Elapsed 2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3 2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3 2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3 System 8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3 8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3 8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3 Elapsed 8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3 8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3 8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3 and these numbers look good as well. System time is around 100% (suprisingly better for the 8k case) and Elapsed is copies that trend. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
de57780 memcg: enhance memcg iterator to support predicates The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. [hughd@google.com: fix memcg-less page reclaim] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
a5b7c87 vmscan, memcg: do softlimit reclaim also for targeted reclaim Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. From the semantic point of view it is natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
e883110 memcg: get rid of soft-limit tree infrastructure Now that the soft limit is integrated to the reclaim directly the whole soft-limit tree infrastructure is not needed anymore. Rip it out. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
3b38722 memcg, vmscan: integrate soft reclaim tighter with zone shrinking code This patchset is sitting out of tree for quite some time without any objections. I would be really happy if it made it into 3.12. I do not want to push it too hard but I think this work is basically ready and waiting more doesn't help. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which has been updated in the third step to have only negligible overhead. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before when it wasn't integrated into the zone shrinking code and it reclaimed at priority 0 (the testing results show that some workloads suffers from such an aggressive reclaim). The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes sense for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. The third step (patches 4-8) addresses the tree walk overhead by enhancing memcg iterators to enable skipping whole subtrees and tracking number of over soft limit children at each level of the hierarchy. This information is updated same way the old soft limit tree was updated (from memcg_check_events) so we shouldn't see an additional overhead. In fact mem_cgroup_update_soft_limit is much simpler than tree manipulation done previously. __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for mem_cgroup_iter so the decision whether a particular group should be visited is done at the iterator level which allows us to decide to skip the whole subtree as well (if there is no child in excess). This reduces the tree walk overhead considerably. * TEST 1 ======== My primary test case was a parallel kernel build with 2 groups (make is running with -j8 with a distribution .config in a separate cgroup without any hard limit) on a 32 CPU machine booted with 1GB memory and both builds run taskset to Node 0 cpus. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. base is mmotm-2013-07-18-16-40 rework all 8 patches applied on top of base * No-limit User no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6 no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6 System no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6 no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6 Elapsed no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6 no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6 The results are within noise. Elapsed time has a bigger variance but the average looks good. * 0-limit User 0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6 0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6 System 0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6 0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6 Elapsed 0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6 0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6 The improvement is really huge here (even bigger than with my previous testing and I suspect that this highly depends on the storage). Page fault statistics tell us at least part of the story: Minor 0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6 0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6 Major 0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6 0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6 Same as with my previous testing Minor faults are more or less within noise but Major fault count is way bellow the base kernel. While this looks as a nice win it is fair to say that 0-limit configuration is quite artificial. So I was playing with 0-no-limit loads as well. * TEST 2 ======== The following results are from 2 groups configuration on a 16GB machine (single NUMA node). - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with 2*TotalMem with 0 soft limit. - B running a mem_eater which consumes TotalMem-1G without any limit. The mem_eater consumes the memory in 100 chunks with 1s nap after each mmap+poppulate so that both loads have chance to fight for the memory. The expected result is that B shouldn't be reclaimed and A shouldn't see a big dropdown in elapsed time. User base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3 rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3 System base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3 rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3 Elapsed base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3 rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3 System time improved slightly as well as Elapsed. My previous testing has shown worse numbers but this again seem to depend on the storage speed. My theory is that the writeback doesn't catch up and prio-0 soft reclaim falls into wait on writeback page too often in the base kernel. The patched kernel doesn't do that because the soft reclaim is done from the kswapd/direct reclaim context. This can be seen on the following graph nicely. The A's group usage_in_bytes regurarly drops really low very often. All 3 runs http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png resp. a detail of the single run http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png mem_eater seems to be doing better as well. It gets to the full allocation size faster as can be seen on the following graph: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png /proc/meminfo collected during the test also shows that rework kernel hasn't swapped that much (well almost not at all): base: max: 123900 K avg: 56388.29 K rework: max: 300 K avg: 128.68 K kswapd and direct reclaim statistics are of no use unfortunatelly because soft reclaim is not accounted properly as the counters are hidden by global_reclaim() checks in the base kernel. * TEST 3 ======== Another test was the same configuration as TEST2 except the stream IO was replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB left. Kbuild did better with the rework kernel here as well: User base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3 rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3 System base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3 rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3 Elapsed base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3 rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3 Minor base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3 rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3 Major base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3 rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3 Again we can see a significant improvement in Elapsed (it also seems to be more stable), there is a huge dropdown for the Major page faults and much more swapping: base: max: 583736 K avg: 112547.43 K rework: max: 4012 K avg: 124.36 K Graphs from all three runs show the variability of the kbuild quite nicely. It even seems that it took longer after every run with the base kernel which would be quite surprising as the source tree for the build is removed and caches are dropped after each run so the build operates on a freshly extracted sources everytime. http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png My other testing shows that this is just a matter of timing and other runs behave differently the std for Elapsed time is similar ~50. Example of other three runs: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png So to wrap this up. The series is still doing good and improves the soft limit. The testing results for bunch of cgroups with both stream IO and kbuild loads can be found in "memcg: track children in soft limit excess to improve soft limit". This patch: Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
c33bd83 memcg: remove redundant code in mem_cgroup_force_empty_write() vfs guarantees the cgroup won't be destroyed, so it's redundant to get a css reference. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 22:38:00 UTC
26935fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 4 from Al Viro: "list_lru pile, mostly" This came out of Andrew's pile, Al ended up doing the merge work so that Andrew didn't have to. Additionally, a few fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits) super: fix for destroy lrus list_lru: dynamically adjust node arrays shrinker: Kill old ->shrink API. shrinker: convert remaining shrinkers to count/scan API staging/lustre/libcfs: cleanup linux-mem.h staging/lustre/ptlrpc: convert to new shrinker API staging/lustre/obdclass: convert lu_object shrinker to count/scan API staging/lustre/ldlm: convert to shrinkers to count/scan API hugepage: convert huge zero page shrinker to new shrinker API i915: bail out earlier when shrinker cannot acquire mutex drivers: convert shrinkers to new count/scan API fs: convert fs shrinkers to new scan/count API xfs: fix dquot isolation hang xfs-convert-dquot-cache-lru-to-list_lru-fix xfs: convert dquot cache lru to list_lru xfs: rework buffer dispose list tracking xfs-convert-buftarg-lru-to-generic-code-fix xfs: convert buftarg LRU to generic code fs: convert inode and dentry shrinking to be node aware vmscan: per-node deferred work ... 12 September 2013, 22:01:38 UTC
bf2ba3b Merge branch 'for-next' into for-linus 12 September 2013, 21:54:48 UTC
3cc69b6 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "A small batch of fixes that have trickled in over the last week of the merge window. Also included are few small devicetree updates for sunxi, since it enables me to use one of their newer boards (cubieboard2) for additional test coverage. The support for that SoC is new for 3.12, so there's no exposure to new regressions due to it" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: dts: sun7i: olinuxino-micro: Enable the EMAC ARM: dts: sun7i: cubieboard2: Enable the EMAC ARM: dts: sun7i: Add the muxing options for the EMAC ARM: dts: sun7i: Enable the Ethernet in the A20 i2c: davinci: Fix bad dev_get_platdata() conversion ARM: vexpress: allow dcscb and tc2_pm in a combined ARMv6+v7 build ARM: shmobile: lager: Do not use register_type field of struct sh_eth_plat_data ARM: pxa: ssp: Check return values from phandle lookups ARM: PCI: versatile: Fix SMAP register offsets ARM: PCI: versatile: Fix PCI I/O ARM: PCI: versatile: Fix map_irq function to match hardware ARM: ep93xx: Don't use modem interface on the second UART ARM: shmobile: r8a7779: Update early timer initialisation order 12 September 2013, 20:59:31 UTC
0e6a1fb Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm Pull ARM fixes from Russell King: "Just two fixes here - one for the recent addition of Neon stuff which causes problems when this is built as a module. The other was one spotted by Olof with the fixed-HZ stuff. Last patch (which is at the very top) is not a fix per-se, but an almost-end-of-merge window sorting of the select symbols in arch/arm/Kconfig to keep them as akpm would like to reduce unnecessary conflicts. I've also taken the liberty this time to add a comment at the end to discourage the endless "add the next select to the bottom of a nicely sorted list" syndrome" * 'fixes' of git://git.linaro.org/people/rmk/linux-arm: ARM: sort arch/arm/Kconfig ARM: fix forced-HZ values ARM: 7835/2: fix modular build of xor_blocks() with NEON enabled 12 September 2013, 20:58:35 UTC
1d7b24f Merge tag 'nfs-for-3.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes (part 2) from Trond Myklebust: "Bugfixes: - Fix a few credential reference leaks resulting from the SP4_MACH_CRED NFSv4.1 state protection code. - Fix the SUNRPC bloatometer footprint: convert a 256K hashtable into the intended 64 byte structure. - Fix a long standing XDR issue with FREE_STATEID - Fix a potential WARN_ON spamming issue - Fix a missing dprintk() kuid conversion New features: - Enable the NFSv4.1 state protection support for the WRITE and COMMIT operations" * tag 'nfs-for-3.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: No, I did not intend to create a 256KiB hashtable sunrpc: Add missing kuids conversion for printing NFSv4.1: sp4_mach_cred: WARN_ON -> WARN_ON_ONCE NFSv4.1: sp4_mach_cred: no need to ref count creds NFSv4.1: fix SECINFO* use of put_rpccred NFSv4.1: sp4_mach_cred: ask for WRITE and COMMIT NFSv4.1 fix decode_free_stateid 12 September 2013, 20:39:34 UTC
68f0d9d vfs: make d_path() get the root path under RCU This avoids the spinlocks and refcounts in the d_path() sequence too (used by /proc and various other entities). See commit 8b19e34188a3 for the equivalent getcwd() system call path. And unlike getcwd(), d_path() doesn't copy the result to user space, so I don't need to fear _that_ particular bug happening again. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 20:24:55 UTC
171b3f0 ARM: sort arch/arm/Kconfig Keep arch/arm/Kconfig select statements sorted alphabetically. I've added a comment at the bottom of the main bank for CONFIG_ARM to this effect so hopefully this will keep things more in order. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> 12 September 2013, 20:24:42 UTC
3272c54 vfs: use __getname/__putname for getcwd() system call It's a pathname. It should use the pathname allocators and deallocators, and PATH_MAX instead of PAGE_SIZE. Never mind that the two are commonly the same. With this, the allocations scale up nicely too, and I can do getcwd() system calls at a rate of about 300M/s, with no lock contention anywhere. Of course, nobody sane does that, especially since getcwd() is traditionally a very slow operation in Unix. But this was also the simplest way to benchmark the prepend_path() improvements by Waiman, and once I saw the profiles I couldn't leave it well enough alone. But apart from being an performance improvement (from using per-cpu slab allocators instead of the raw page allocator), it's actually a valid and real cleanup. Signed-off-by: Linus "OCD" Torvalds <torvalds@linux-foundation.org> 12 September 2013, 19:40:15 UTC
f7ec00b ARM: dts: sun7i: olinuxino-micro: Enable the EMAC The A20-olinuxino-micro has the EMAC wired in. Enable it in the DT so that we can use it. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Olof Johansson <olof@lixom.net> 12 September 2013, 19:22:48 UTC
0547433 ARM: dts: sun7i: cubieboard2: Enable the EMAC The Cubieboard2, just like its A10 counterpart, has the Ethernet wired in. Enable it in the DT. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Olof Johansson <olof@lixom.net> 12 September 2013, 19:22:43 UTC
756084c ARM: dts: sun7i: Add the muxing options for the EMAC The A20 has several muxing options for the EMAC. Yet, the currently supported boards only use one set of them. Add that pin set to the DTSI. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Olof Johansson <olof@lixom.net> 12 September 2013, 19:22:39 UTC
2e804d0 ARM: dts: sun7i: Enable the Ethernet in the A20 The Allwinner A20 SoC also have the EMAC found on the A10 and A10s. Enable the support for it in the DTSI. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Olof Johansson <olof@lixom.net> 12 September 2013, 19:22:32 UTC
ff812d7 vfs: don't copy things to user space holding the rcu readlock Oops. That wasn't very smart. We don't actually need the RCU lock any more by the time we copy the cwd string to user space, but I had stupidly surrounded the whole thing with it. Introduced by commit 8b19e34188a3 ("vfs: make getcwd() get the root and pwd path under rcu") Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 18:57:01 UTC
5223161 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds Pull led updates from Bryan Wu: "Sorry for the late pull request, since I'm just back from vacation. LED subsystem updates for 3.12: - pca9633 driver DT supporting and pca9634 chip supporting - restore legacy device attributes for lp5521 - other fixing and updates" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (28 commits) leds: wm831x-status: Request a REG resource leds: trigger: ledtrig-backlight: Fix invalid memory access in fb_event notification callback leds-pca963x: Fix device tree parsing leds-pca9633: Rename to leds-pca963x leds-pca9633: Add mutex to the ledout register leds-pca9633: Unique naming of the LEDs leds-pca9633: Add support for PCA9634 leds: lp5562: use LP55xx common macros for device attributes Documentation: leds-lp5521,lp5523: update device attribute information leds: lp5523: remove unnecessary writing commands leds: lp5523: restore legacy device attributes leds: lp5523: LED MUX configuration on initializing leds: lp5523: make separate API for loading engine leds: lp5521: remove unnecessary writing commands leds: lp5521: restore legacy device attributes leds: lp55xx: add common macros for device attributes leds: lp55xx: add common data structure for program Documentation: leds: Fix a typo leds: ss4200: Fix incorrect placement of __initdata leds: clevo-mail: Fix incorrect placement of __initdata ... 12 September 2013, 18:35:33 UTC
e5d0c87 Merge tag 'iommu-updates-v3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU Updates from Joerg Roedel: "This round the updates contain: - A new driver for the Freescale PAMU IOMMU from Varun Sethi. This driver has cooked for a while and required changes to the IOMMU-API and infrastructure that were already merged before. - Updates for the ARM-SMMU driver from Will Deacon - Various fixes, the most important one is probably a fix from Alex Williamson for a memory leak in the VT-d page-table freeing code In summary not all that much. The biggest part in the diffstat is the new PAMU driver" * tag 'iommu-updates-v3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: intel-iommu: Fix leaks in pagetable freeing iommu/amd: Fix resource leak in iommu_init_device() iommu/amd: Clean up unnecessary MSI/MSI-X capability find iommu/arm-smmu: Simplify VMID and ASID allocation iommu/arm-smmu: Don't use VMIDs for stage-1 translations iommu/arm-smmu: Tighten up global fault reporting iommu/arm-smmu: Remove broken big-endian check iommu/fsl: Remove unnecessary 'fsl-pamu' prefixes iommu/fsl: Fix whitespace problems noticed by git-am iommu/fsl: Freescale PAMU driver and iommu implementation. iommu/fsl: Add additional iommu attributes required by the PAMU driver. powerpc: Add iommu domain pointer to device archdata iommu/exynos: Remove dead code (set_prefbuf) 12 September 2013, 18:29:26 UTC
d5adf7e Merge tag 'stable/for-linus-3.12-rc0-tag-three' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen balloon driver bug-fixes from Stefano Stabellini: - fix a preemption bug in xen/balloon.c; - remove an harmful BUG_ON in xen/balloon.c that can trigger in non-erroneous situations. * tag 'stable/for-linus-3.12-rc0-tag-three' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/balloon: remove BUG_ON in increase_reservation xen/balloon: ensure preemption is disabled when using a scratch page 12 September 2013, 18:28:24 UTC
02b9735 Merge tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "All of these commits are fixes that have emerged recently and some of them fix bugs introduced during this merge window. Specifics: 1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events After the recent ACPIPHP changes we've seen some interesting breakage on a system that triggers device check notifications during boot for non-existing devices. Although those notifications are really spurious, we should be able to deal with them nevertheless and that shouldn't introduce too much overhead. Four commits to make that work properly. 2) Memory hotplug and hibernation mutual exclusion rework This was maent to be a cleanup, but it happens to fix a classical ABBA deadlock between system suspend/hibernation and ACPI memory hotplug which is possible if they are started roughly at the same time. Three commits rework memory hotplug so that it doesn't acquire pm_mutex and make hibernation use device_hotplug_lock which prevents it from racing with memory hotplug. 3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix The ACPI LPSS driver crashes during boot on Apple Macbook Air with Haswell that has slightly unusual BIOS configuration in which one of the LPSS device's _CRS method doesn't return all of the information expected by the driver. Fix from Mika Westerberg, for stable. 4) ACPICA fix related to Store->ArgX operation AML interpreter fix for obscure breakage that causes AML to be executed incorrectly on some machines (observed in practice). From Bob Moore. 5) ACPI core fix for PCI ACPI device objects lookup There still are cases in which there is more than one ACPI device object matching a given PCI device and we don't choose the one that the BIOS expects us to choose, so this makes the lookup take more criteria into account in those cases. 6) Fix to prevent cpuidle from crashing in some rare cases If the result of cpuidle_get_driver() is NULL, which can happen on some systems, cpuidle_driver_ref() will crash trying to use that pointer and the Daniel Fu's fix prevents that from happening. 7) cpufreq fixes related to CPU hotplug Stephen Boyd reported a number of concurrency problems with cpufreq related to CPU hotplug which are addressed by a series of fixes from Srivatsa S Bhat and Viresh Kumar. 8) cpufreq fix for time conversion in time_in_state attribute Time conversion carried out by cpufreq when user space attempts to read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state won't work correcty if cputime_t doesn't map directly to jiffies. Fix from Andreas Schwab. 9) Revert of a troublesome cpufreq commit Commit 7c30ed5 (cpufreq: make sure frequency transitions are serialized) was intended to address some known concurrency problems in cpufreq related to the ordering of transitions, but unfortunately it introduced several problems of its own, so I decided to revert it now and address the original problems later in a more robust way. 10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle. 11) cpufreq fixes related to system suspend/resume The recent cpufreq changes that made it preserve CPU sysfs attributes over suspend/resume cycles introduced a possible NULL pointer dereference that caused it to crash during the second attempt to suspend. Three commits from Srivatsa S Bhat fix that problem and a couple of related issues. 12) cpufreq locking fix cpufreq_policy_restore() should acquire the lock for reading, but it acquires it for writing. Fix from Lan Tianyu" * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits) cpufreq: Acquire the lock in cpufreq_policy_restore() for reading cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu cpufreq: Restructure if/else block to avoid unintended behavior cpufreq: Fix crash in cpufreq-stats during suspend/resume intel_pstate: Add Haswell CPU models Revert "cpufreq: make sure frequency transitions are serialized" cpufreq: Use signed type for 'ret' variable, to store negative error values cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock cpufreq: Split __cpufreq_remove_dev() into two parts cpufreq: Fix wrong time unit conversion cpufreq: serialize calls to __cpufreq_governor() cpufreq: don't allow governor limits to be changed when it is disabled ACPI / bind: Prefer device objects with _STA to those without it ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks ACPI / hotplug / PCI: Use _OST to notify firmware about notify status ACPI / hotplug / PCI: Avoid doing too much for spurious notifies ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field. ACPI / hotplug / PCI: Don't trim devices before scanning the namespace ... 12 September 2013, 18:22:45 UTC
75acebf Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes. The -g perf report lockup you reported is only partially addressed, patches that fix the excessive runtime are still being worked on" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix uncore PCI fixed counter handling uprobes: Fix utask->depth accounting in handle_trampoline() perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING perf: Fix up MMAP2 buffer space reservation perf tools: Add attr->mmap2 support perf kvm: Fix sample_type manipulation perf evlist: Fix id pos in perf_evlist__open() perf trace: Handle perf.data files with no tracepoints perf session: Separate progress bar update when processing events perf trace: Check if MAP_32BIT is defined perf hists: Fix formatting of long symbol names perf evlist: Fix parsing with no sample_id_all bit set perf tools: Add test for parsing with no sample_id_all bit perf trace: Check control+C more often 12 September 2013, 17:44:54 UTC
b55ee28 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Performance regression fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix load balancing performance regression in should_we_balance() 12 September 2013, 17:44:13 UTC
8b19e34 vfs: make getcwd() get the root and pwd path under rcu This allows us to skip all the crazy spinlocks and reference count updates, and instead use the fs sequence read-lock to get an atomic snapshot of the root and cwd information. We might want to make the rule that "prepend_path()" is always called with the RCU lock held, but the RCU lock nests fine and this is the minimal fix. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 17:35:47 UTC
5762482 vfs: move get_fs_root_and_pwd() to single caller Let's not pollute the include files with inline functions that are only used in a single place. Especially not if we decide we might want to change the semantics of said function to make it more efficient.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 17:12:47 UTC
b7c09ad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is against 3.11-rc7, but was pulled and tested against your tree as of yesterday. We do have two small incrementals queued up, but I wanted to get this bunch out the door before I hop on an airplane. This is a fairly large batch of fixes, performance improvements, and cleanups from the usual Btrfs suspects. We've included Stefan Behren's work to index subvolume UUIDs, which is targeted at speeding up send/receive with many subvolumes or snapshots in place. It closes a long standing performance issue that was built in to the disk format. Mark Fasheh's offline dedup work is also here. In this case offline means the FS is mounted and active, but the dedup work is not done inline during file IO. This is a building block where utilities are able to ask the FS to dedup a series of extents. The kernel takes care of verifying the data involved really is the same. Today this involves reading both extents, but we'll continue to evolve the patches" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits) Btrfs: optimize key searches in btrfs_search_slot Btrfs: don't use an async starter for most of our workers Btrfs: only update disk_i_size as we remove extents Btrfs: fix deadlock in uuid scan kthread Btrfs: stop refusing the relocation of chunk 0 Btrfs: fix memory leak of uuid_root in free_fs_info btrfs: reuse kbasename helper btrfs: return btrfs error code for dev excl ops err Btrfs: allow partial ordered extent completion Btrfs: convert all bug_ons in free-space-cache.c Btrfs: add support for asserts Btrfs: adjust the fs_devices->missing count on unmount Btrf: cleanup: don't check for root_refs == 0 twice Btrfs: fix for patch "cleanup: don't check the same thing twice" Btrfs: get rid of one BUG() in write_all_supers() Btrfs: allocate prelim_ref with a slab allocater Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl Btrfs: fix race between removing a dev and writing sbs Btrfs: remove ourselves from the cluster list under lock ... 12 September 2013, 16:58:51 UTC
1812997 dcache: get/release read lock in read_seqbegin_or_lock() & friend This patch modifies read_seqbegin_or_lock() and need_seqretry() to use newly introduced read_seqlock_excl() and read_sequnlock_excl() primitives so that they won't change the sequence number even if they fall back to take the lock. This is OK as no change to the protected data structure is being made. It will prevent one fallback to lock taking from cascading into a series of lock taking reducing performance because of the sequence number change. It will also allow other sequence readers to go forward while an exclusive reader lock is taken. This patch also updates some of the inaccurate comments in the code. Signed-off-by: Waiman Long <Waiman.Long@hp.com> To: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 16:25:23 UTC
1370e97 seqlock: Add a new locking reader type The sequence lock (seqlock) was originally designed for the cases where the readers do not need to block the writers by making the readers retry the read operation when the data change. Since then, the use cases have been expanded to include situations where a thread does not need to change the data (effectively a reader) at all but have to take the writer lock because it can't tolerate changes to the protected structure. Some examples are the d_path() function and the getcwd() syscall in fs/dcache.c where the functions take the writer lock on rename_lock even though they don't need to change anything in the protected data structure at all. This is inefficient as a reader is now blocking other sequence number reading readers from moving forward by pretending to be a writer. This patch tries to eliminate this inefficiency by introducing a new type of locking reader to the seqlock locking mechanism. This new locking reader will try to take an exclusive lock preventing other writers and locking readers from going forward. However, it won't affect the progress of the other sequence number reading readers as the sequence number won't be changed. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 12 September 2013, 16:25:23 UTC
back to top