Jingoo Han [Fri, 9 Nov 2012 03:04:44 +0000 (14:04 +1100)]
backlight: lms283gf05: use devm_gpio_request_one
By using devm_gpio_request_one it is possible to set the direction
and initial value in one shot. Thus, using devm_gpio_request_one
can make the code simpler.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Marek Vasut <marex@denx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Fri, 9 Nov 2012 03:04:42 +0000 (14:04 +1100)]
backlight: locomolcd: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: line over 80 characters
WARNING: space prohibited between function name and open parenthesis '('
ERROR: trailing statements should be on next line
Also, long comments are fixed for the preferred style and
unnecessary lines are removed.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Fri, 9 Nov 2012 03:04:41 +0000 (14:04 +1100)]
backlight: ili9320: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: please, no space before tabs
WARNING: please, no spaces at the start of a line
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
WARNING: braces {} are not necessary for single statement blocks
ERROR: code indent should use tabs where possible
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Fri, 9 Nov 2012 03:04:41 +0000 (14:04 +1100)]
backlight: hp680_bl: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: please, no space before tabs
WARNING: please, no spaces at the start of a line
ERROR: do not initialise statics to 0 or NULL
ERROR: code indent should use tabs where possible
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Fri, 9 Nov 2012 03:04:39 +0000 (14:04 +1100)]
backlight: da903x_bl: use dev_get_drvdata() instead of platform_get_drvdata()
dev_get_drvdata() can be used instead of platform_get_drvdata()
to make the code smaller.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Mike Rapoport <mike@compulab.co.il> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Leach [Fri, 9 Nov 2012 03:04:37 +0000 (14:04 +1100)]
include/linux/init.h: use the stringify operator for the __define_initcall macro
Currently the __define_initcall() macro takes three arguments, fn, id and
level. The level argument is exactly the same as the id argument but
wrapped in quotes. To overcome this need to specify three arguments to
the __define_initcall macro, where one argument is the stringification of
another, we can just use the stringification macro instead.
Signed-off-by: Matthew Leach <matthew@mattleach.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Fri, 9 Nov 2012 03:04:37 +0000 (14:04 +1100)]
Documentation/kernel-parameters.txt: update mem= option's spec according to its implementation
Current mem= implementation seems buggy because the specification and
implementation don't match. The current mem= has been working for many
years and it's not buggy - it works as expected. So we should update the
specification.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PBM generated with current tools do not have a whitespace between the
digits. Therefore the pnmtologo tool fails to gernerate the required
C-Array for these images. This patch fixes that behaviour and can handle
both 'old style' and 'new style' PBM files.
Signed-off-by: Andreas Bießmann <andreas@biessmann.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Fri, 9 Nov 2012 03:04:36 +0000 (14:04 +1100)]
mm/memblock: reduce overhead in binary search
When checking that the indicated address belongs to the memory region, the
memory regions are checked one by one through a binary search, which will
be time consuming.
If the indicated address isn't in the memory region, then we needn't do
the time-consuming search. Add a check on the indicated address for that
purpose.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 9 Nov 2012 03:04:35 +0000 (14:04 +1100)]
swap: add a simple detector for inappropriate swapin readahead
The swapin readahead does a blind readahead whether or not the swapin is
sequential. This is ok for harddisk because large reads have relatively
small costs and if the readahead pages are unneeded they can be reclaimed
easily. But for SSD devices large reads are more expensive than small
one. If readahead pages are unneeded, reading them in caused significant
overhead
This patch addes a simple random read detection similar to file mmap
readahead. If a random read is detected, swapin readahead will be
skipped. This improves a lot for a swap workload with random IO in a fast
SSD.
I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
For both harddisk and SSD, the randwrite swap workload run time is reduced
significantly. Sequential write swap workload hasn't chanage.
Interestingly, the randwrite harddisk test is improved too. This might be
because swapin readahead needs to allocate extra memory, which further
tights memory pressure, so more swapout/swapin.
Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 9 Nov 2012 03:04:34 +0000 (14:04 +1100)]
drop_caches: add some documentation and info message
I would like to resurrect Dave's patch. The last time it was posted was
here https://lkml.org/lkml/2010/9/16/250 and there didn't seem to be any
strong opposition.
Kosaki was worried about possible excessive logging when somebody drops
caches too often (but then he claimed he didn't have a strong opinion on
that) but I would say opposite. If somebody does that then I would really
like to know that from the log when supporting a system because it almost
for sure means that there is something fishy going on. It is also worth
mentioning that only root can write drop caches so this is not an flooding
attack vector.
I am bringing that up again because this can be really helpful when
chasing strange performance issues which (surprise surprise) turn out to
be related to artificially dropped caches done because the admin thinks
this would help...
I have just refreshed the original patch on top of the current mm tree
but I could live with KERN_INFO as well if people think that KERN_NOTICE
is too hysterical.
: From: Dave Hansen <dave@linux.vnet.ibm.com>
: Date: Fri, 12 Oct 2012 14:30:54 +0200
:
: There is plenty of anecdotal evidence and a load of blog posts
: suggesting that using "drop_caches" periodically keeps your system
: running in "tip top shape". Perhaps adding some kernel
: documentation will increase the amount of accurate data on its use.
:
: If we are not shrinking caches effectively, then we have real bugs.
: Using drop_caches will simply mask the bugs and make them harder
: to find, but certainly does not fix them, nor is it an appropriate
: "workaround" to limit the size of the caches.
:
: It's a great debugging tool, and is really handy for doing things
: like repeatable benchmark runs. So, add a bit more documentation
: about it, and add a little KERN_NOTICE. It should help developers
: who are chasing down reclaim-related bugs.
[mhocko@suse.cz: refreshed to current -mm tree] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commits 2139cbe627b89 ("cma: fix counting of isolated pages") and d95ea5d18e69951 ("cma: fix watermark checking") introduced a reliable
method of free page accounting when memory is being allocated from CMA
regions, so the workaround introduced earlier by commit 49f223a9cd96c72
("mm: trigger page reclaim in alloc_contig_range() to stabilise
watermarks") can be finally removed.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: cma: skip watermarks check for already isolated blocks in split_free_page()
Since commit 2139cbe627b8 ("cma: fix counting of isolated pages") free
pages in isolated pageblocks are not accounted to NR_FREE_PAGES counters,
so watermarks check is not required if one operates on a free page in
isolated pageblock.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 9 Nov 2012 03:04:34 +0000 (14:04 +1100)]
mm, oom: fix race when specifying a thread as the oom origin
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
specify that current should be killed first if an oom condition occurs in
between the two calls.
The usage is
short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
...
compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
to store the thread's oom_score_adj, temporarily change it to the maximum
score possible, and then restore the old value if it is still the same.
This happens to still be racy, however, if the user writes
OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
The compare_swap_oom_score_adj() will then incorrectly reset the old value
prior to the write of OOM_SCORE_ADJ_MAX.
To fix this, introduce a new oom_flags_t member in struct signal_struct
that will be used for per-thread oom killer flags. KSM and swapoff can
now use a bit in this member to specify that threads should be killed
first in oom conditions without playing around with oom_score_adj.
This also allows the correct oom_score_adj to always be shown when reading
/proc/pid/oom_score.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 9 Nov 2012 03:04:33 +0000 (14:04 +1100)]
mm, oom: change type of oom_score_adj to short
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
so this range can be represented by the signed short type with no
functional change. The extra space this frees up in struct signal_struct
will be used for per-thread oom kill flags in the next patch.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Fri, 9 Nov 2012 03:04:32 +0000 (14:04 +1100)]
mm: fix slab.c kernel-doc warnings
Fix new kernel-doc warnings in mm/slab.c:
Warning(mm/slab.c:2358): No description found for parameter 'cachep'
Warning(mm/slab.c:2358): Excess function parameter 'name' description in '__kmem_cache_create'
Warning(mm/slab.c:2358): Excess function parameter 'size' description in '__kmem_cache_create'
Warning(mm/slab.c:2358): Excess function parameter 'align' description in '__kmem_cache_create'
Warning(mm/slab.c:2358): Excess function parameter 'ctor' description in '__kmem_cache_create'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rafael Aquini [Fri, 9 Nov 2012 03:04:32 +0000 (14:04 +1100)]
mm: introduce putback_movable_pages()
The patch "mm: introduce compaction and migration for virtio ballooned
pages" hacks around putback_lru_pages() in order to allow ballooned pages
to be re-inserted on balloon page list as if a ballooned page was like a
LRU page.
As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rafael Aquini [Fri, 9 Nov 2012 03:04:32 +0000 (14:04 +1100)]
virtio_balloon: introduce migration primitives to balloon pages
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.
Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme, in order to
enhance the syncronization methods for accessing elements of struct
virtio_balloon, thus providing protection against concurrent access
introduced by parallel memory migration threads.
- balloon_lock (mutex) : synchronizes the access demand to elements of
struct virtio_balloon and its queue operations;
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rafael Aquini [Fri, 9 Nov 2012 03:04:31 +0000 (14:04 +1100)]
mm: introduce compaction and migration for ballooned pages
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.
This patch introduces the helper functions as well as the necessary
changes to teach compaction and migration bits how to cope with pages
which are part of a guest memory balloon, in order to make them movable by
memory compaction procedures.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rafael Aquini [Fri, 9 Nov 2012 03:04:30 +0000 (14:04 +1100)]
mm: introduce a common interface for balloon pages mobility
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.
This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system to
better leverage the compation efforts on memory defragmentation.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rafael Aquini [Fri, 9 Nov 2012 03:04:30 +0000 (14:04 +1100)]
mm: redefine address_space.assoc_mapping
Overhaul struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*. By this
approach we consistently name the .private_* elements from struct
address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.
Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.
This patchset follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make
those subsystems aware of ballooned pages and allow memory balloon pages
become movable within a guest, thus avoiding the aforementioned
fragmentation issue
Following are numbers that prove this patch benefits on allowing
compaction to be more effective at memory ballooned guests.
Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
chunks, at every minute (inflating/deflating), while test was running:
Introduce MIGRATEPAGE_SUCCESS as the default return code for
address_space_operations.migratepage() method and documents the expected
return code for the same method in failure cases.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix the x86-64 cache alignment code to take pgoff into account. Use the
x86 and MIPS cache alignment code as the basis for a generic cache
alignment function.
The old x86 code will always align the mmap to aliasing boundaries,
even if the program mmaps the file with a non-zero pgoff.
If program A mmaps the file with pgoff 0, and program B mmaps the file
with pgoff 1. The old code would align the mmaps, resulting in misaligned
pages:
A: 0123
B: 123
After this patch, they are aligned so the pages line up:
A: 0123
B: 123
Proposed by Rik van Riel.
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
information to look up for suitable virtual address space gaps.
struct vm_unmapped_area_info is used to define the desired allocation
request:
- lowest or highest possible address matching the remaining constraints
- desired gap length
- low/high address limits that the gap must fit into
- alignment mask and offset
Also update the generic arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Fri, 9 Nov 2012 03:04:23 +0000 (14:04 +1100)]
mm: rearrange vm_area_struct for fewer cache misses
The kernel walks the VMA rbtree in various places, including the page
fault path. However, the vm_rb node spanned two cache lines, on 64 bit
systems with 64 byte cache lines (most x86 systems).
Rearrange vm_area_struct a little, so all the information we need to do a
VMA tree walk is in the first cache line.
Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Define vma->rb_subtree_gap as the largest gap between any vma in the
subtree rooted at that vma, and their predecessor. Or, for a recursive
definition, vma->rb_subtree_gap is the max of:
- vma->vm_start - vma->vm_prev->vm_end
- rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
vma->rb.rb_right
This will allow get_unmapped_area_* to find a free area of the right size
in O(log(N)) time, instead of potentially having to do a linear walk
across all the VMAs.
Also define mm->highest_vm_end as the vm_end field of the highest vma, so
that we can easily check if the following gap is suitable.
This does have the potential to make unmapping VMAs more expensive,
especially for processes with very large numbers of VMAs, where the VMA
rbtree can grow quite deep.
Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Fri, 9 Nov 2012 03:04:21 +0000 (14:04 +1100)]
mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB
There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
to use 1GB huge pages on some mappings, and stay with 2MB on others. This
is useful together with NUMA policy: use 2MB interleaving on some
mappings, but 1GB on local mappings.
This patch extends the IPC/SHM syscall interfaces slightly to allow
specifying the page size.
It borrows some upper bits in the existing flag arguments and allows
encoding the log of the desired page size in addition to the *_HUGETLB
flag. When 0 is specified the default size is used, this makes the change
fully compatible.
Extending the internal hugetlb code to handle this is straight forward.
Instead of a single mount it just keeps an array of them and selects the
right mount based on the specified page size. When no page size is
specified it uses the mount of the default page size.
The change is not visible in /proc/mounts because internal mounts don't
appear there. It also has very little overhead: the additional mounts
just consume a super block, but not more memory when not used.
I also exported the new flags to the user headers (they were previously
under __KERNEL__). Right now only symbols for x86 and some other
architecture for 1GB and 2MB are defined. The interface should already
work for all other architectures though. Only architectures that define
multiple hugetlb sizes actually need it (that is currently x86, tile,
powerpc). However tile and powerpc have user configurable hugetlb sizes,
so it's not easy to add defines. A program on those architectures would
need to query sysfs and use the appropiate log2.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Fri, 9 Nov 2012 03:04:20 +0000 (14:04 +1100)]
mm: print out information of file affected by memory error
Printing out the information about which file can be affected by a memory
error in generic_error_remove_page() is helpful for user to estimate the
impact of the error.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Fri, 9 Nov 2012 03:04:20 +0000 (14:04 +1100)]
mm: hwpoison: fix action_result() to print out dirty/clean
action_result() fails to print out "dirty" even if an error occurred on a
dirty pagecache, because when we check PageDirty in action_result() it was
cleared after page isolation even if it's dirty before error handling.
This can break some applications that monitor this message, so should be
fixed.
There are several callers of action_result() except page_action(), but
either of them are not for LRU pages but for free pages or kernel pages,
so we don't have to consider dirty or not for them.
Note that PG_dirty can be set outside page locks as described in commit 6746aff74da29 ("HWPOISON: shmem: call set_page_dirty() with locked page"),
so this patch does not completely closes the race window, but just narrows
it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:18 +0000 (14:04 +1100)]
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:18 +0000 (14:04 +1100)]
slab: propagate tunable values
SLAB allows us to tune a particular cache behavior with tunables. When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.
This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.
Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and then
things work as expected.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:18 +0000 (14:04 +1100)]
memcg: aggregate memcg cache values in slabinfo
When we create caches in memcgs, we need to display their usage
information somewhere. We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.
For the time being, only reads are allowed in the per-group cache.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:17 +0000 (14:04 +1100)]
memcg/sl[au]b: shrink dead caches
This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.
In this patch, we will call kmem_cache_shrink() for all dead caches that
cannot be destroyed because of remaining pages. After shrinking, it is
possible that it could be freed. If this is not the case, we'll schedule
a lazy worker to keep trying.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:17 +0000 (14:04 +1100)]
memcg/sl[au]b: track all the memcg children of a kmem_cache
This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in gets
unloaded. Otherwise, the children will still point to the destroyed
parent.
Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:17 +0000 (14:04 +1100)]
memcg: destroy memcg caches
Implement destruction of memcg caches. Right now, only caches where our
reference counter is the last remaining are deleted. If there are any
other reference counters around, we just leave the caches lying around
until they go away.
When that happens, a destruction function is called from the cache code.
Caches are only destroyed in process context, so we queue them up for
later processing in the general case.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:16 +0000 (14:04 +1100)]
sl[au]b: allocate objects from memcg cache
We are able to match a cache allocation to a particular memcg. If the
task doesn't change groups during the allocation itself - a rare event,
this will give us a good picture about who is the first group to touch a
cache page.
This patch uses the now available infrastructure by calling
memcg_kmem_get_cache() before all the cache allocations.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:16 +0000 (14:04 +1100)]
sl[au]b: always get the cache from its page in kmem_cache_free()
struct page already has this information. If we start chaining caches,
this information will always be more trustworthy than whatever is passed
into the function.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:16 +0000 (14:04 +1100)]
memcg: skip memcg kmem allocations in specified code regions
Create a mechanism that skip memcg allocations during certain pieces of
our core code. It basically works in the same way as
preempt_disable()/preempt_enable(): By marking a region under which all
allocations will be accounted to the root memcg.
We need this to prevent races in early cache creation, when we
allocate data using caches that are not necessarily created already.
Signed-off-by: Glauber Costa <glommer@parallels.com>
yCc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:15 +0000 (14:04 +1100)]
memcg: infrastructure to match an allocation to the right cache
The page allocator is able to bind a page to a memcg when it is allocated.
But for the caches, we'd like to have as many objects as possible in a
page belonging to the same cache.
This is done in this patch by calling memcg_kmem_get_cache in the
beginning of every allocation function. This function is patched out by
static branches when kernel memory controller is not being used.
It assumes that the task allocating, which determines the memcg in the
page allocator, belongs to the same cgroup throughout the whole process.
Misaccounting can happen if the task calls memcg_kmem_get_cache() while
belonging to a cgroup, and later on changes. This is considered
acceptable, and should only happen upon task migration.
Before the cache is created by the memcg core, there is also a possible
imbalance: the task belongs to a memcg, but the cache being allocated from
is the global cache, since the child cache is not yet guaranteed to be
ready. This case is also fine, since in this case the GFP_KMEMCG will not
be passed and the page allocator will not attempt any cgroup accounting.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:15 +0000 (14:04 +1100)]
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:15 +0000 (14:04 +1100)]
slab/slub: consider a memcg parameter in kmem_create_cache
Allow a memcg parameter to be passed during cache creation. When the slub
allocator is being used, it will only merge caches that belong to the same
memcg. We'll do this by scanning the global list, and then translating
the cache to a memcg-specific cache
Default function is created as a wrapper, passing NULL to the memcg
version. We only merge caches that belong to the same memcg.
A helper is provided, memcg_css_id: because slub needs a unique cache name
for sysfs. Since this is visible, but not the canonical location for slab
data, the cache name is not used, the css_id should suffice.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:14 +0000 (14:04 +1100)]
slab: annotate on-slab caches nodelist locks
We currently provide lockdep annotation for kmalloc caches, and also
caches that have SLAB_DEBUG_OBJECTS enabled. The reason for this is that
we can quite frequently nest in the l3->list_lock lock, which is not
something trivial to avoid.
My proposal with this patch, is to extend this to caches whose slab
management object lives within the slab as well ("on_slab"). The need for
this arose in the context of testing kmemcg-slab patches. With such
patchset, we can have per-memcg kmalloc caches. So the same path that led
to nesting between kmalloc caches will could then lead to in-memcg
nesting. Because they are not annotated, lockdep will trigger.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:14 +0000 (14:04 +1100)]
fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.
This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.
This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.
For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path. Once the slab is also tracked by memcg, we can
get rid of that flag.
Tested to successfully protect against :(){ :|:& };:
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Frederic Weisbecker <fweisbec@redhat.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Fri, 9 Nov 2012 03:04:13 +0000 (14:04 +1100)]
memcg: execute the whole memcg freeing in free_worker()
A lot of the initialization we do in mem_cgroup_create() is done with
softirqs enabled. This include grabbing a css id, which holds
&ss->id_lock->rlock, and the per-zone trees, which holds
rtpz->lock->rlock. All of those signal to the lockdep mechanism that
those locks can be used in SOFTIRQ-ON-W context. This means that the
freeing of memcg structure must happen in a compatible context, otherwise
we'll get a deadlock, like the one below, caught by lockdep:
This usage pattern could not be triggered before kmem came into play.
With the introduction of kmem stack handling, it is possible that we call
the last mem_cgroup_put() from the task destructor, which is run in an rcu
callback. Such callbacks are run with softirqs disabled, leading to the
offensive usage pattern.
In general, we have little, if any, means to guarantee in which context
the last memcg_put will happen. The best we can do is test it and try to
make sure no invalid context releases are happening. But as we add more
code to memcg, the possible interactions grow in number and expose more
ways to get context conflicts. One thing to keep in mind, is that part of
the freeing process is already deferred to a worker, such as vfree(), that
can only be called from process context.
For the moment, the only two functions we really need moved away are:
* free_css_id(), and
* mem_cgroup_remove_from_trees().
But because the later accesses per-zone info,
free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
are left with the per_cpu stats only. Better move it all.
Signed-off-by: Glauber Costa <glommer@parallels.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>