Afzal Mohammed [Sat, 3 Nov 2012 00:42:53 +0000 (11:42 +1100)]
rtc: omap: kicker mechanism support
OMAP RTC IP can have kicker feature. This prevents spurious writes to
register. To write to registers kicker lock has to be released.
Procedure to do it as follows,
1. write to kick0 register, 0x83e70b13
2. write to kick1 register, 0x95a4f1e0
Writing value other than 0x83e70b13 to kick0 enables write locking, more
details about kicker mechanism can be found in section 20.3.3.5.3 of
AM335X TRM @www.ti.com/am335x
Here id table information is added and is used to distinguish those that
require kicker handling and the ones that doesn't need it. There are more
features in the newer IP's compared to legacy ones other than kicker,
which driver currently doesn't handle, supporting additional features
would be easier with the addition of id table.
Older IP (of OMAP1) doesn't have revision register as per TRM, so revision
register can't be relied always to find features, hence id table is being
used.
While at it, replace __raw_writeb/__raw_readb with writeb/readb; this
driver is used on ARMv7 (AM335X SoC)
Signed-off-by: Afzal Mohammed <afzal@ti.com> Acked-by: Sekhar Nori <nsekhar@ti.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Kevin Hilman <khilman@ti.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Daniel Mack <zonque@gmail.com> Cc: Vaibhav Hiremath <hvaibhav@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alan Cox [Sat, 3 Nov 2012 00:42:52 +0000 (11:42 +1100)]
binfmt_elf: fix corner case kfree of uninitialized data
If elf_core_dump() is called and fill_note_info() fails in the kmalloc()
then it returns 0 but has not yet initialised all the needed fields. As a
result we do a kfree(randomness) after correctly skipping the thread data.
Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Paton J. Lewis [Sat, 3 Nov 2012 00:42:52 +0000 (11:42 +1100)]
epoll: support for disabling items, and a self-test app
It is not currently possible to reliably delete epoll items when using the
same epoll set from multiple threads. After calling epoll_ctl with
EPOLL_CTL_DEL, another thread might still be executing code related to an
event for that epoll item (in response to epoll_wait). Therefore the
deleting thread does not know when it is safe to delete resources
pertaining to the associated epoll item because another thread might be
using those resources.
The deleting thread could wait an arbitrary amount of time after calling
epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
inefficient and could result in the destruction of resources before
another thread is done handling an event returned by epoll_wait.
This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables
an epoll item. If epoll_ctl returns -EBUSY in this case, then another
thread may handling a return from epoll_wait for this item. Otherwise if
epoll_ctl returns 0, then it is safe to delete the epoll item. This
allows multiple threads to use a mutex to determine when it is safe to
delete an epoll item and its associated resources, which allows epoll
items to be deleted both efficiently and without error in a multi-threaded
environment. Note that EPOLL_CTL_DISABLE is only useful in conjunction
with EPOLLONESHOT, and using EPOLL_CTL_DISABLE on an epoll item without
EPOLLONESHOT returns -EINVAL.
This patch also adds a new test_epoll self-test program to both
demonstrate the need for this feature and test it.
Signed-off-by: Paton J. Lewis <palewis@adobe.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@redhat.com> Cc: Paul Holland <pholland@adobe.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Sat, 3 Nov 2012 00:42:51 +0000 (11:42 +1100)]
checkpatch: warn on unnecessary line continuations
When the previous line is not a line continuation and the current line has
a line continuation but is not a #define, emit a warning.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This function is used by sparc, powerpc tile and arm64 for compat support.
The patch adds a generic implementation with a wrapper for PowerPC to do
the u32->int sign extension.
The reason for a single patch covering powerpc, tile, sparc and arm64 is
to keep it bisectable, otherwise kernel building may fail with mismatched
function declarations.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> [for tile] Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Sat, 3 Nov 2012 00:42:48 +0000 (11:42 +1100)]
lib: dynamic_debug: use kbasename()
Remove the custom implementation of the functionality similar to kbasename().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Sat, 3 Nov 2012 00:42:48 +0000 (11:42 +1100)]
string: introduce helper to get base file name from given path
There are several places in the kernel that use functionality like
basename(3) with the exception: in case of '/foo/bar/' we expect to get an
empty string. Let's do it common helper for them.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Jason Baron <jbaron@redhat.com> Cc: YAMANE Toshiaki <yamanetoshi@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WARNING: drivers/video/backlight/built-in.o(.data+0x110): Section mismatch in reference from the variable ep93xxbl_driver to the
function .init.text:ep93xxbl_probe()
The variable ep93xxbl_driver references
the function __init ep93xxbl_probe()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Sat, 3 Nov 2012 00:42:47 +0000 (11:42 +1100)]
drivers/video/backlight/lm3639_bl.c: fix up world writable sysfs file
We don't need the sysfs file to be world writable or group writable.
This file is write-only, change it to S_IWUSR (0200).
Signed-off-by: Axel Lin <axel.lin@ingics.com> Acked-by: "G.Shark Jeong" <gshark.jeong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:46 +0000 (11:42 +1100)]
drivers/video/backlight/max8925_bl.c: drop devm_kfree of devm_kzalloc'd data
devm_kfree() allocates memory that is released when a driver detaches.
Thus, there is no reason to explicitly call devm_kfree in probe or remove
functions.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:46 +0000 (11:42 +1100)]
drivers/video/backlight/88pm860x_bl.c: drop devm_kfree of devm_kzalloc'd data
devm_kfree() allocates memory that is released when a driver detaches.
Thus, there is no reason to explicitly call devm_kfree() in probe or remove
functions.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add 'const' to static array that was missing it in its definition. Also,
'const' is added to ili9320_write_regs(), because it is called by
vgg2432a4 driver.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add 'const' to static array that was missing it in its definition.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Eric Miao <eric.y.miao@gmail.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add 'const' to static array that was missing it in its definition.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Acked-by: Marek Vasut <marex@denx.de> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The mutex for accessing lp855x registers is used in case of the user-space
interaction. When the brightness is changed via sysfs, the mutex is
required. But the backlight class device already provides it. Thus, the
lp855x mutex is unnecessary.
Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com> Cc: Thierry Reding <thierry.reding@avionic-design.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Bryan Wu <bryan.wu@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kim, Milo [Sat, 3 Nov 2012 00:42:43 +0000 (11:42 +1100)]
drivers/video/backlight/lp855x_bl.c: use generic PWM functions
The LP855x family devices support the PWM input for the backlight control.
Period of the PWM is configurable in the platform side. Platform
specific functions are unnecessary anymore because generic PWM functions
are used inside the driver.
(PWM input mode)
To set the brightness, new lp855x_pwm_ctrl() is used.
If a PWM device is not allocated, devm_pwm_get() is called.
The PWM consumer name is from the chip name such as 'lp8550' and 'lp8556'.
To get the brightness value, no additional handling is required.
Just the value of 'props.brightness' is returned.
If the PWM driver is not ready while initializing the LP855x driver, it's
OK. The PWM device can be retrieved later, when the brightness value is
changed.
Documentation is updated with an example.
Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com> Cc: Thierry Reding <thierry.reding@avionic-design.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Bryan Wu <bryan.wu@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:43 +0000 (11:42 +1100)]
backlight: tosa: use devm_gpio_request_one
By using devm_gpio_request_one it is possible to set the direction and
initial value in one shot. Thus, using devm_gpio_request_one can make the
code simpler.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:43 +0000 (11:42 +1100)]
backlight: lms283gf05: use devm_gpio_request_one
By using devm_gpio_request_one it is possible to set the direction
and initial value in one shot. Thus, using devm_gpio_request_one
can make the code simpler.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Marek Vasut <marex@denx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:41 +0000 (11:42 +1100)]
backlight: locomolcd: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: line over 80 characters
WARNING: space prohibited between function name and open parenthesis '('
ERROR: trailing statements should be on next line
Also, long comments are fixed for the preferred style and
unnecessary lines are removed.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:39 +0000 (11:42 +1100)]
backlight: ili9320: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: please, no space before tabs
WARNING: please, no spaces at the start of a line
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
WARNING: braces {} are not necessary for single statement blocks
ERROR: code indent should use tabs where possible
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:39 +0000 (11:42 +1100)]
backlight: hp680_bl: fix checkpatch error and warning
This patch fixes the checkpatch error and warning as below:
WARNING: please, no space before tabs
WARNING: please, no spaces at the start of a line
ERROR: do not initialise statics to 0 or NULL
ERROR: code indent should use tabs where possible
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Sat, 3 Nov 2012 00:42:37 +0000 (11:42 +1100)]
backlight: da903x_bl: use dev_get_drvdata() instead of platform_get_drvdata()
dev_get_drvdata() can be used instead of platform_get_drvdata()
to make the code smaller.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Mike Rapoport <mike@compulab.co.il> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Leach [Sat, 3 Nov 2012 00:42:36 +0000 (11:42 +1100)]
include/linux/init.h: use the stringify operator for the __define_initcall macro
Currently the __define_initcall() macro takes three arguments, fn, id and
level. The level argument is exactly the same as the id argument but
wrapped in quotes. To overcome this need to specify three arguments to
the __define_initcall macro, where one argument is the stringification of
another, we can just use the stringification macro instead.
Signed-off-by: Matthew Leach <matthew@mattleach.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Sat, 3 Nov 2012 00:42:36 +0000 (11:42 +1100)]
Documentation/kernel-parameters.txt: update mem= option's spec according to its implementation
Current mem= implementation seems buggy because the specification and
implementation don't match. The current mem= has been working for many
years and it's not buggy - it works as expected. So we should update the
specification.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PBM generated with current tools do not have a whitespace between the
digits. Therefore the pnmtologo tool fails to gernerate the required
C-Array for these images. This patch fixes that behaviour and can handle
both 'old style' and 'new style' PBM files.
Signed-off-by: Andreas Bießmann <andreas@biessmann.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Sat, 3 Nov 2012 00:42:35 +0000 (11:42 +1100)]
mm/memblock: reduce overhead in binary search
When checking that the indicated address belongs to the memory region, the
memory regions are checked one by one through a binary search, which will
be time consuming.
If the indicated address isn't in the memory region, then we needn't do
the time-consuming search. Add a check on the indicated address for that
purpose.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Sat, 3 Nov 2012 00:42:35 +0000 (11:42 +1100)]
swap: add a simple detector for inappropriate swapin readahead
The swapin readahead does a blind readahead whether or not the swapin is
sequential. This is ok for harddisk because large reads have relatively
small costs and if the readahead pages are unneeded they can be reclaimed
easily. But for SSD devices large reads are more expensive than small
one. If readahead pages are unneeded, reading them in caused significant
overhead
This patch addes a simple random read detection similar to file mmap
readahead. If a random read is detected, swapin readahead will be
skipped. This improves a lot for a swap workload with random IO in a fast
SSD.
I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
For both harddisk and SSD, the randwrite swap workload run time is reduced
significantly. Sequential write swap workload hasn't chanage.
Interestingly, the randwrite harddisk test is improved too. This might be
because swapin readahead needs to allocate extra memory, which further
tights memory pressure, so more swapout/swapin.
Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Sat, 3 Nov 2012 00:42:34 +0000 (11:42 +1100)]
drop_caches: add some documentation and info message
I would like to resurrect Dave's patch. The last time it was posted was
here https://lkml.org/lkml/2010/9/16/250 and there didn't seem to be any
strong opposition.
Kosaki was worried about possible excessive logging when somebody drops
caches too often (but then he claimed he didn't have a strong opinion on
that) but I would say opposite. If somebody does that then I would really
like to know that from the log when supporting a system because it almost
for sure means that there is something fishy going on. It is also worth
mentioning that only root can write drop caches so this is not an flooding
attack vector.
I am bringing that up again because this can be really helpful when
chasing strange performance issues which (surprise surprise) turn out to
be related to artificially dropped caches done because the admin thinks
this would help...
I have just refreshed the original patch on top of the current mm tree
but I could live with KERN_INFO as well if people think that KERN_NOTICE
is too hysterical.
: From: Dave Hansen <dave@linux.vnet.ibm.com>
: Date: Fri, 12 Oct 2012 14:30:54 +0200
:
: There is plenty of anecdotal evidence and a load of blog posts
: suggesting that using "drop_caches" periodically keeps your system
: running in "tip top shape". Perhaps adding some kernel
: documentation will increase the amount of accurate data on its use.
:
: If we are not shrinking caches effectively, then we have real bugs.
: Using drop_caches will simply mask the bugs and make them harder
: to find, but certainly does not fix them, nor is it an appropriate
: "workaround" to limit the size of the caches.
:
: It's a great debugging tool, and is really handy for doing things
: like repeatable benchmark runs. So, add a bit more documentation
: about it, and add a little KERN_NOTICE. It should help developers
: who are chasing down reclaim-related bugs.
[mhocko@suse.cz: refreshed to current -mm tree] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:33 +0000 (11:42 +1100)]
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:33 +0000 (11:42 +1100)]
slab: propagate tunable values
SLAB allows us to tune a particular cache behavior with tunables. When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.
This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.
Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and then
things work as expected.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:33 +0000 (11:42 +1100)]
memcg: aggregate memcg cache values in slabinfo
When we create caches in memcgs, we need to display their usage
information somewhere. We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.
For the time being, only reads are allowed in the per-group cache.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:32 +0000 (11:42 +1100)]
memcg/sl[au]b: shrink dead caches
This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.
In this patch, we will call kmem_cache_shrink() for all dead caches that
cannot be destroyed because of remaining pages. After shrinking, it is
possible that it could be freed. If this is not the case, we'll schedule
a lazy worker to keep trying.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:32 +0000 (11:42 +1100)]
memcg/sl[au]b: track all the memcg children of a kmem_cache
This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in gets
unloaded. Otherwise, the children will still point to the destroyed
parent.
Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:32 +0000 (11:42 +1100)]
memcg: destroy memcg caches
Implement destruction of memcg caches. Right now, only caches where our
reference counter is the last remaining are deleted. If there are any
other reference counters around, we just leave the caches lying around
until they go away.
When that happens, a destruction function is called from the cache code.
Caches are only destroyed in process context, so we queue them up for
later processing in the general case.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:31 +0000 (11:42 +1100)]
sl[au]b: allocate objects from memcg cache
We are able to match a cache allocation to a particular memcg. If the
task doesn't change groups during the allocation itself - a rare event,
this will give us a good picture about who is the first group to touch a
cache page.
This patch uses the now available infrastructure by calling
memcg_kmem_get_cache() before all the cache allocations.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:31 +0000 (11:42 +1100)]
sl[au]b: always get the cache from its page in kmem_cache_free()
struct page already has this information. If we start chaining caches,
this information will always be more trustworthy than whatever is passed
into the function
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:31 +0000 (11:42 +1100)]
memcg: skip memcg kmem allocations in specified code regions
Create a mechanism that skip memcg allocations during certain pieces of
our core code. It basically works in the same way as
preempt_disable()/preempt_enable(): By marking a region under which all
allocations will be accounted to the root memcg.
We need this to prevent races in early cache creation, when we
allocate data using caches that are not necessarily created already.
Signed-off-by: Glauber Costa <glommer@parallels.com>
yCc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:30 +0000 (11:42 +1100)]
memcg: infrastructure to match an allocation to the right cache
The page allocator is able to bind a page to a memcg when it is allocated.
But for the caches, we'd like to have as many objects as possible in a
page belonging to the same cache.
This is done in this patch by calling memcg_kmem_get_cache in the
beginning of every allocation function. This routing is patched out by
static branches when kernel memory controller is not being used.
It assumes that the task allocating, which determines the memcg in the
page allocator, belongs to the same cgroup throughout the whole process.
Misacounting can happen if the task calls memcg_kmem_get_cache() while
belonging to a cgroup, and later on changes. This is considered
acceptable, and should only happen upon task migration.
Before the cache is created by the memcg core, there is also a possible
imbalance: the task belongs to a memcg, but the cache being allocated from
is the global cache, since the child cache is not yet guaranteed to be
ready. This case is also fine, since in this case the GFP_KMEMCG will not
be passed and the page allocator will not attempt any cgroup accounting.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:30 +0000 (11:42 +1100)]
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:30 +0000 (11:42 +1100)]
slab/slub: consider a memcg parameter in kmem_create_cache
Allow a memcg parameter to be passed during cache creation. When the slub
allocator is being used, it will only merge caches that belong to the same
memcg. We'll do this by scanning the global list, and then translating
the cache to a memcg-specific cache
Default function is created as a wrapper, passing NULL to the memcg
version. We only merge caches that belong to the same memcg.
A helper is provided, memcg_css_id: because slub needs a unique cache name
for sysfs. Since this is visible, but not the canonical location for slab
data, the cache name is not used, the css_id should suffice.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:29 +0000 (11:42 +1100)]
slab: annotate on-slab caches nodelist locks
We currently provide lockdep annotation for kmalloc caches, and also
caches that have SLAB_DEBUG_OBJECTS enabled. The reason for this is that
we can quite frequently nest in the l3->list_lock lock, which is not
something trivial to avoid.
My proposal with this patch, is to extend this to caches whose slab
management object lives within the slab as well ("on_slab"). The need for
this arose in the context of testing kmemcg-slab patches. With such
patchset, we can have per-memcg kmalloc caches. So the same path that led
to nesting between kmalloc caches will could then lead to in-memcg
nesting. Because they are not annotated, lockdep will trigger.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:28 +0000 (11:42 +1100)]
fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.
This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.
This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.
For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path. Once the slab is also tracked by memcg, we can
get rid of that flag.
Tested to successfully protect against :(){ :|:& };:
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Frederic Weisbecker <fweisbec@redhat.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:28 +0000 (11:42 +1100)]
memcg: execute the whole memcg freeing in free_worker()
A lot of the initialization we do in mem_cgroup_create() is done with
softirqs enabled. This include grabbing a css id, which holds
&ss->id_lock->rlock, and the per-zone trees, which holds
rtpz->lock->rlock. All of those signal to the lockdep mechanism that
those locks can be used in SOFTIRQ-ON-W context. This means that the
freeing of memcg structure must happen in a compatible context, otherwise
we'll get a deadlock, like the one bellow, caught by lockdep:
This usage pattern could not be triggered before kmem came into play.
With the introduction of kmem stack handling, it is possible that we call
the last mem_cgroup_put() from the task destructor, which is run in an rcu
callback. Such callbacks are run with softirqs disabled, leading to the
offensive usage pattern.
In general, we have little, if any, means to guarantee in which context
the last memcg_put will happen. The best we can do is test it and try to
make sure no invalid context releases are happening. But as we add more
code to memcg, the possible interactions grow in number and expose more
ways to get context conflicts. One thing to keep in mind, is that part of
the freeing process is already deferred to a worker, such as vfree(), that
can only be called from process context.
For the moment, the only two functions we really need moved away are:
* free_css_id(), and
* mem_cgroup_remove_from_trees().
But because the later accesses per-zone info,
free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
are left with the per_cpu stats only. Better move it all.
Signed-off-by: Glauber Costa <glommer@parallels.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:28 +0000 (11:42 +1100)]
memcg: allow a memcg with kmem charges to be destructed
Because the ultimate goal of the kmem tracking in memcg is to track slab
pages as well, we can't guarantee that we'll always be able to point a
page to a particular process, and migrate the charges along with it -
since in the common case, a page will contain data belonging to multiple
processes.
Because of that, when we destroy a memcg, we only make sure the
destruction will succeed by discounting the kmem charges from the user
charges when we try to empty the cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:27 +0000 (11:42 +1100)]
memcg: use static branches when code not in use
We can use static branches to patch the code in or out when not used.
Because the _ACTIVE bit on kmem_accounted is only set after the increment
is done, we guarantee that the root memcg will always be selected for kmem
charges until all call sites are patched (see memcg_kmem_enabled). This
guarantees that no mischarges are applied.
Static branch decrement happens when the last reference count from the
kmem accounting in memcg dies. This will only happen when the charges
drop down to 0.
When that happens, we need to disable the static branch only on those
memcgs that enabled it. To achieve this, we would be forced to complicate
the code by keeping track of which memcgs were the ones that actually
enabled limits, and which ones got it from its parents.
It is a lot simpler just to do static_key_slow_inc() on every child
that is accounted.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:27 +0000 (11:42 +1100)]
memcg: kmem accounting lifecycle management
Because kmem charges can outlive the cgroup, we need to make sure that we
won't free the memcg structure while charges are still in flight. For
reviewing simplicity, the charge functions will issue mem_cgroup_get() at
every charge, and mem_cgroup_put() at every uncharge.
This can get expensive, however, and we can do better. mem_cgroup_get()
only really needs to be issued once: when the first limit is set. In the
same spirit, we only need to issue mem_cgroup_put() when the last charge
is gone.
We'll need an extra bit in kmem_account_flags for that:
KMEM_ACCOUNTED_DEAD. it will be set when the cgroup dies, if there are
charges in the group. If there aren't, we can proceed right away.
Our uncharge function will have to test that bit every time the charges
drop to 0. Because that is not the likely output of res_counter_uncharge,
this should not impose a big hit on us: it is certainly much better than a
reference count decrease at every operation.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:27 +0000 (11:42 +1100)]
res_counter: return amount of charges after res_counter_uncharge()
It is useful to know how many charges are still left after a call to
res_counter_uncharge. While it is possible to issue a res_counter_read
after uncharge, this can be racy.
If we need, for instance, to take some action when the counters drop down
to 0, only one of the callers should see it. This is the same semantics
as the atomic variables in the kernel.
Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.
Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:26 +0000 (11:42 +1100)]
mm: allocate kernel pages to the right memcg
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate the
allocation. Tasks in the root memcg can always proceed.
To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:26 +0000 (11:42 +1100)]
memcg: kmem controller infrastructure
Introduce infrastructure for tracking kernel memory pages to a given
memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.
In memcontrol.h those functions are wrapped in inline acessors. The idea
is to later on, patch those with static branches, so we don't incur any
overhead when no mem cgroups with limited kmem are being used.
Users of this functionality shall interact with the memcg core code
through the following functions:
memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.
memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.
memcg_kmem_uncharge_page: called at free time, will revert the charge.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 3 Nov 2012 00:42:26 +0000 (11:42 +1100)]
mm: add a __GFP_KMEMCG flag
This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to current's
memcg. It takes numerical place of the of the recently removed
__GFP_NO_KSWAPD.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
They have the same meaning of their user memory counterparts. They
reflect the state of the "kmem" res_counter.
Per cgroup kmem memory accounting is not enabled until a limit is set for
the group. Once the limit is set the accounting cannot be disabled for
that group. This means that after the patch is applied, no behavioral
changes exists for whoever is still using memcg to control their memory
usage, until memory.kmem.limit_in_bytes is set for the first time.
We always account to both user and kernel resource_counters. This
effectively means that an independent kernel limit is in place when the
limit is set to a lower value than the user memory. A equal or higher
value means that the user limit will always hit first, meaning that kmem
is effectively unlimited.
People who want to track kernel memory but not limit it, can set this
limit to a very high number (like RESOURCE_MAX - 1page - that no one will
ever hit, or equal to the user memory)
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mem_cgroup_do_charge() was written before kmem accounting, and expects
three cases: being called for 1 page, being called for a stock of 32
pages, or being called for a hugepage. If we call for 2 or 3 pages (and
both the stack and several slabs used in process creation are such, at
least with the debug options I had), it assumed it's being called for
stock and just retried without reclaiming.
Fix that by passing down a minsize argument in addition to the csize.
And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
__GFP_NORETRY.
v4: fixed nr pages calculation pointed out by Christoph Lameter.
Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memcg: make it possible to use the stock for more than one page
We currently have a percpu stock cache scheme that charges one page at a
time from memcg->res, the user counter. When the kernel memory controller
comes into play, we'll need to charge more than that.
This is because kernel memory allocations will also draw from the user
counter, and can be bigger than a single page, as it is the case with the
stack (usually 2 pages) or some higher order slabs.
[glommer@parallels.com: added a changelog ] Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 3 Nov 2012 00:42:24 +0000 (11:42 +1100)]
mm, oom: allow exiting threads to have access to memory reserves
Exiting threads, those with PF_EXITING set, can pagefault and require
memory before they can make forward progress. This happens, for instance,
when a process must fault task->robust_list, a userspace structure, before
detaching its memory.
These threads also aren't guaranteed to get access to memory reserves
unless oom killed or killed from userspace. The oom killer won't grant
memory reserves if other threads are also exiting other than current and
stalling at the same point. This prevents needlessly killing processes
when others are already exiting.
Instead of special casing all the possible situations between PF_EXITING
getting set and a thread detaching its mm where it may allocate memory,
which probably wouldn't get updated when a change is made to the exit
path, the solution is to give all exiting threads access to memory
reserves if they call the oom killer. This allows them to quickly
allocate, detach its mm, and free the memory it represents.
Summary of Luigi's bug report:
: He had an oom condition where threads were faulting on task->robust_list
: and repeatedly called the oom killer but it would defer killing a thread
: because it saw other PF_EXITING threads. This can happen anytime we need
: to allocate memory after setting PF_EXITING and before detaching our mm;
: if there are other threads in the same state then the oom killer won't do
: anything unless one of them happens to be killed from userspace.
:
: So instead of only deferring for PF_EXITING and !task->robust_list, it's
: better to just give them access to memory reserves to prevent a potential
: livelock so that any other faults that may be introduced in the future in
: the exit path don't cause the same problem (and hopefully we don't allow
: too many of those!).
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mem_cgroup_charge_common() is invoked as the entry point for cgroup limits
charge rather than mem_cgroup_charge(), as the later has been removed for
years. Update the cgroup/memory.txt to reflect this change.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Yoknis [Sat, 3 Nov 2012 00:42:23 +0000 (11:42 +1100)]
mm: memmap_init_zone() performance improvement
We have what we call an "architectural simulator". It is a computer
program that pretends that it is a computer system. We use it to test the
firmware before real hardware is available. We have booted Linux on our
simulator. As you would expect it takes longer to boot on the simulator
than it does on real hardware.
With my patch - boot time 41 minutes
Without patch - boot time 94 minutes
These numbers do not scale linearly to real hardware. But indicate to me
a place where Linux can be improved.
memmap_init_zone() loops through every Page Frame Number (pfn), including
pfn values that are within the gaps between existing memory sections. The
unneeded looping will become a boot performance issue when machines
configure larger memory ranges that will contain larger and more numerous
gaps.
The code will skip across invalid pfn values to reduce the number of loops
executed.
Signed-off-by: Mike Yoknis <mike.yoknis@hp.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Will Deacon [Sat, 3 Nov 2012 00:42:23 +0000 (11:42 +1100)]
mm: thp: set the accessed flag for old pages on access fault
On x86 memory accesses to pages without the ACCESSED flag set result in
the ACCESSED flag being set automatically. With the ARM architecture a
page access fault is raised instead (and it will continue to be raised
until the ACCESSED flag is set for the appropriate PTE/PMD).
For normal memory pages, handle_pte_fault will call pte_mkyoung
(effectively setting the ACCESSED flag). For transparent huge pages,
pmd_mkyoung will only be called for a write fault.
This patch ensures that faults on transparent hugepages which do not
result in a CoW update the access flags for the faulting pmd.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:23 +0000 (11:42 +1100)]
mm, highmem: get virtual address of the page using PKMAP_ADDR()
In flush_all_zero_pkmaps(), we have an index of the pkmap associated with
the page. Using this index, we can simply get virtual address of the
page. So change it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of first flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of first flushed entry.
Additionally, update last_pkmap_nr to this index.
It is certain that entry which is below this index is occupied by other mapping,
therefore updating last_pkmap_nr to this index is reasonable optimization.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:22 +0000 (11:42 +1100)]
mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
In current code, after flush_all_zero_pkmaps() is invoked we re-iterate
all pkmaps. This can be optimized if flush_all_zero_pkmaps() returns an
index of flushed entry. With this index, we can immediately map highmem
page to virtual address represented by index. So change return type of
flush_all_zero_pkmaps() and return index of last flushed entry.
Additionally, update last_pkmap_nr to this index. It is certain that
entry which is below this index is occupied by other mapping, therefore
updating last_pkmap_nr to this index is reasonable optimization.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:22 +0000 (11:42 +1100)]
mm-highmem-remove-page_address_pool-list-v2
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:22 +0000 (11:42 +1100)]
mm, highmem: remove page_address_pool list
We can find free page_address_map instance without the page_address_pool.
So remove it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:21 +0000 (11:42 +1100)]
mm, highmem: remove useless pool_lock
The pool_lock protects the page_address_pool from concurrent access. But,
access to the page_address_pool is already protected by kmap_lock. So
remove it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kin <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Sat, 3 Nov 2012 00:42:21 +0000 (11:42 +1100)]
mm, highmem: use PKMAP_NR() to calculate an index of pkmap
To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, so change it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>