lib/bitmap.c: quiet sparse noise about address space
__bitmap_parse() and __bitmap_parselist() both take a pointer to a kernel
buffer as a parameter and then cast it to a pointer to user buffer for use
in cases when the parameter is_user indicates that the buffer is actually
located in user space. This casting, and the casts in the callers,
results in sparse noise like the following:
warning: incorrect type in initializer (different address spaces)
expected char const [noderef] <asn:1>*ubuf
got char const *buf
warning: cast removes address space of expression
Since these casts are intentional, use __force to quiet the noise.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Len Brown <len.brown@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Mon, 24 Oct 2011 14:58:56 +0000 (01:58 +1100)]
lib/spinlock_debug.c: print owner on spinlock lockup
When SPIN_BUG_ON is triggered, the lock owner information is reported.
But it is omitted when spinlock lockup is detected.
This information is useful especially on the architectures which don't
implement trigger_all_cpu_backtrace() that is called just after detecting
lockup. So report it and also avoid message format duplication.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Dobriyan [Mon, 24 Oct 2011 14:58:55 +0000 (01:58 +1100)]
lib/kstrtox: common code between kstrto*() and simple_strto*() functions
Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.
simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.
Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Srinidhi KASAGAR [Mon, 24 Oct 2011 14:58:55 +0000 (01:58 +1100)]
drivers/leds/leds-lp5521.c: check if reset is successful
Make sure that the reset is successful by issuing a dummy read to R
channel current register and check its default value. On some platforms,
without this dummy read, any further access to {R/G/B}_EXEC will not have
any impact.
Signed-off-by: srinidhi kasagar <srinidhi.kasagar@stericsson.com> Tested-by: Naga Radhesh <naga.radheshy@stericsson.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Cc: Richard Purdie <richard.purdie@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Antonio Ospite [Mon, 24 Oct 2011 14:58:54 +0000 (01:58 +1100)]
leds: turn the blink_timer off before starting to blink
Depending on the implementation of the hardware blinking function in
blink_set(), the led can support hardware blinking for some values of
delay_on and delay_off and fall-back to software blinking for some other
values.
Turning off the blink_timer unconditionally before starting to blink
make sure that a sequence like:
OFF
hardware blinking
software blinking
hardware blinking
does not leave the software blinking timer active.
Signed-off-by: Antonio Ospite <ospite@studenti.unina.it> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Antonio Ospite <ospite@studenti.unina.it> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
My GPIOs are on an I2C port expander, so we must use the *_cansleep()
variant of the GPIO functions. This is was not being done in
create_gpio_led().
We can change gpio_get_value() to gpio_get_value_cansleep() because it is
only called from the platform_driver probe function, which is a context
where we can sleep.
Only tested on my gpio_cansleep() system, but it seems safe for all
systems.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Trent Piepho <tpiepho@gmail.com> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Mon, 24 Oct 2011 14:58:53 +0000 (01:58 +1100)]
drivers/leds/leds-lm3530.c: add __devexit_p where needed
According to the comments in include/linux/init.h:
"Pointers to __devexit functions must use __devexit_p(function_name), the
wrapper will insert either the function_name or NULL, depending on the config
options."
We have __devexit annotation for lm3530_remove(), so add __devexit_p to
the `struct i2c_driver'.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Shreshtha Kumar SAHU <shreshthakumar.sahu@stericsson.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@gmail.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Samu Onkalo <samu.p.onkalo@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Mon, 24 Oct 2011 14:58:53 +0000 (01:58 +1100)]
drivers/leds/leds-lp5521.c: avoid writing uninitialized value to LP5521_REG_OP_MODE register
If lp5521_read fails, engine_state variable is not initialized.
If lp5521_read fails, we should return error.
This patch fixes below warning.
CC drivers/leds/leds-lp5521.o
drivers/leds/leds-lp5521.c: In function 'lp5521_set_engine_mode':
drivers/leds/leds-lp5521.c:168: warning: 'engine_state' may be used uninitialized in this function
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Samu Onkalo <samu.p.onkalo@nokia.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Mon, 24 Oct 2011 14:58:52 +0000 (01:58 +1100)]
drivers/leds/leds-renesas-tpu.c: move Renesas TPU LED driver platform data
Use the platform_data include directory for the TPU LED driver, as
suggested by Paul Mundt.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Mon, 24 Oct 2011 14:58:52 +0000 (01:58 +1100)]
drivers/leds/leds-renesas-tpu.c: update driver to use workqueue
Use a workqueue in the Renesas TPU LED driver to allow the Runtime PM code
to sleep.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Mon, 24 Oct 2011 14:58:51 +0000 (01:58 +1100)]
drivers/leds/leds-lm3530.c: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Mon, 24 Oct 2011 14:58:51 +0000 (01:58 +1100)]
leds-renesas-tpu-led-driver-v2-fix
include linux/module.h
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Mon, 24 Oct 2011 14:58:50 +0000 (01:58 +1100)]
leds: Renesas TPU LED driver
Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.
The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.
Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.
The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control. Hardware blink is unsupported.
The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board. Tested with experimental sh7372 A3SP power domain
patches. Platform device bind/unbind tested ok.
V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Mon, 24 Oct 2011 14:58:49 +0000 (01:58 +1100)]
backlight: fix broken regulator API usage in l4f00242t03
The regulator support in the l4f00242t03 is very non-idiomatic. Rather
than requesting the regulators based on the device name and the supply
names used by the device the driver requires boards to pass system
specific supply names around through platform data. The driver also
conditionally requests the regulators based on this platform data, adding
unneeded conditional code to the driver.
Fix this by removing the platform data and converting to the standard
idiom, also updating all in tree users of the driver. As no datasheet
appears to be available for the LCD I'm guessing the names for the
supplies based on the existing users and I've no ability to do anything
more than compile test.
The use of regulator_set_voltage() in the driver is also problematic,
since fixed voltages are required the expectation would be that the
voltages would be fixed in the constraints set by the machines rather than
manually configured by the driver, but is less problematic.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Mon, 24 Oct 2011 14:58:49 +0000 (01:58 +1100)]
video/backlight: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Kosina [Mon, 24 Oct 2011 14:58:48 +0000 (01:58 +1100)]
MAINTAINERS: add ASLR maintainer
Since achieving the full ASLR by merging the PIE randomization in commit cc503c1b43 ("x86: PIE executable randomization"), I have been dealing with
most (if not all) of the bugreports reported against userspace address
space randomization, so it might be a good idea to provide a decent
contact point in MAINTAINERS.
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hans Verkuil [Mon, 24 Oct 2011 14:58:47 +0000 (01:58 +1100)]
poll: add poll_requested_events() function
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem.
Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table_struct pointer.
The solution is to set the qproc field to NULL in poll_table_struct once
poll() matches the events, not the poll_table_struct pointer itself. That
way drivers can obtain the mask through a new poll_requested_events
inline.
The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).
Since eventpoll always leaves the key field at ~0 instead of using the
requested events mask, that source was changed as well to properly fill in
the key field.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Reviewed-by: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
William Douglas [Mon, 24 Oct 2011 14:58:46 +0000 (01:58 +1100)]
printk: remove bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false).
Since the code being updated works because strtoul bombs out (endp isn't
updated) and 0 is returned anyway just remove the check and don't change
the behavior of the function.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
William Douglas [Mon, 24 Oct 2011 14:58:45 +0000 (01:58 +1100)]
printk: fix bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false). It should be testing to see if the character less than '0' or
greater than '9' instead. This patch makes that change.
The code being changed worked because strtoul bombs out (endp isn't
updated) and 0 is returned anyway.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Ballard [Mon, 24 Oct 2011 14:58:42 +0000 (01:58 +1100)]
kernel/sysctl.c: add cap_last_cap to /proc/sys/kernel
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from the header files
only. The fact that this value cannot be determined properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren't. libcap-ng is one example. And we ran into the
same problem with systemd too.
Now the capability is exported in /proc/sys/kernel/cap_last_cap.
Signed-off-by: Dan Ballard <dan@mindstab.net> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ulrich Drepper <drepper@akkadia.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vasily Averin [Mon, 24 Oct 2011 14:58:41 +0000 (01:58 +1100)]
watchdog: move watchdog_*_all_cpus under CONFIG_SYSCTL
Fix compilation warnings for CONFIG_SYSCTL=n:
fixed compilation warnings in case of disabled CONFIG_SYSCTL
kernel/watchdog.c:483:13: warning: `watchdog_enable_all_cpus' defined but not used
kernel/watchdog.c:500:13: warning: `watchdog_disable_all_cpus' defined but not used
these functions are static and are used only in sysctl handler, so move
them inside #ifdef CONFIG_SYSCTL too
Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
stop_machine: make stop_machine safe and efficient to call early
Make stop_machine() safe to call early in boot, before SMP has been set
up, by simply calling the callback function directly if there's only one
CPU online.
[ Fixes from AKPM:
- add comment
- local_irq_flags, not save_flags
- also call hard_irq_disable() for systems which need it
Tejun suggested using an explicit flag rather than just looking at
the online cpu count. ]
Cc: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: stop_machine-make-stop_machine-safe-and-efficient-to-call-early-v3.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: Tejun Heo <tj@kernel.org>
From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: stop_machine-make-stop_machine-safe-and-efficient-to-call-early-v3.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Korsgaard [Mon, 24 Oct 2011 14:58:40 +0000 (01:58 +1100)]
drivers/misc/ad525x_dpot-i2c.c: add i2c support for AD5161
Commit 6c536e4ce8e ("ad525x_dpot: add support for SPI parts") added
support for the AD5161 through SPI, but the device supports both I2C and
SPI (depending on the DIS pin), so add it to -i2c as well.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk> Acked-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Mon, 24 Oct 2011 14:58:40 +0000 (01:58 +1100)]
lis3lv02d: make regulator API usage unconditional
The regulator API contains a range of features for stubbing itself out
when not in use and for transparently restricting the actual effect of
regulator API calls where they can't be supported on a particular system
so that drivers don't need to individually implement this. Simplify the
driver slightly by making use of this idiom.
The only in tree user is ecovec24 which does not use the regulator API.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Éric Piel <eric.piel@tremplin-utc.net> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com> Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>From my POV, it looks like the hardware is not working as expected
and returns a bogus data rate. The driver doesn't check the result
and directly uses it as some sort of divisor in some places:
msleep(lis3->pwron_delay / lis3lv02d_get_odr());
Under this circumstances, this could very well cause the
"divide by zero" exception from above.
For now, I fixed it the easiest and most obvious way:
Check if the result is sane and if it isn't use a sane default
instead. I went for "100" in the latter case, simply because
/sys/devices/platform/lis3lv02d/rate returns it on a successful
boot.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net> Cc: Matthew Garrett <mjg@redhat.com> Cc: Witold Pilat <witold.pilat@gmail.com> Cc: Lyall Pearce <lyall.pearce@hp.com> Cc: Malte Starostik <m-starostik@versanet.de> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Cc: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jonathan Cameron [Mon, 24 Oct 2011 14:58:35 +0000 (01:58 +1100)]
drivers/hwmon/hwmon.c: convert idr to ida and use ida_simple_get()
A straightforward looking use of idr for a device id.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pavel Emelyanov [Mon, 24 Oct 2011 14:58:34 +0000 (01:58 +1100)]
fs/pipe.c: add ->statfs callback for pipefs
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire
pipfs up so that the statfs succeeds.
This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.
When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has. Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Mon, 24 Oct 2011 14:58:33 +0000 (01:58 +1100)]
intel_idle: fix API misuse
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Cree [Mon, 24 Oct 2011 14:58:33 +0000 (01:58 +1100)]
alpha: wire up sendmmsg syscall
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Cree [Mon, 24 Oct 2011 14:58:32 +0000 (01:58 +1100)]
alpha: wire up accept4 syscall
Somehow wiring up the accept4 syscall on Alpha was missed long ago.
This commit rectifies that oversight.
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Lynch [Mon, 24 Oct 2011 14:58:32 +0000 (01:58 +1100)]
hpet: factor timer allocate from open
The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators). This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.
This patch alters the open semantics so that when the device is opened, no
timer is allocated. Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly. (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)
/dev/hpet is accessed via mmap(). This is the only interface of /dev/hpet
that is actually used in practice.
[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build] Signed-off-by: Magnus Lynch <maglyx@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Acked-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Mon, 24 Oct 2011 14:58:31 +0000 (01:58 +1100)]
selinuxfs: remove custom hex_to_bin()
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eric Paris <eparis@parisplace.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Mon, 24 Oct 2011 14:58:30 +0000 (01:58 +1100)]
mm: munlock use mapcount to avoid terrible overhead
A process spent 30 minutes exiting, just munlocking the pages of a large
anonymous area that had been alternately mprotected into page-sized vmas:
for every single page there's an anon_vma walk through all the other
little vmas to find the right one.
A general fix to that would be a lot more complicated (use prio_tree on
anon_vma?), but there's one very simple thing we can do to speed up the
common case: if a page to be munlocked is mapped only once, then it is our
vma that it is mapped into, and there's no need whatever to walk through
all the others.
Okay, there is a very remote race in munlock_vma_pages_range(), if between
its follow_page() and lock_page(), another process were to munlock the
same page, then page reclaim remove it from our vma, then another process
mlock it again. We would find it with page_mapcount 1, yet it's still
mlocked in another process. But never mind, that's much less likely than
the down_read_trylock() failure which munlocking already tolerates (in
try_to_unmap_one()): in due course page reclaim will discover and move the
page to unevictable instead.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hillf Danton [Mon, 24 Oct 2011 14:58:30 +0000 (01:58 +1100)]
mm/huge_memory: fix typo when updating mmu cache
There are three cases of update_mmu_cache() in the file, and the case in
function collapse_huge_page() has a typo, namely the last parameter used,
which is corrected based on the other two cases.
Due to the define of update_mmu_cache by X86, the only arch that
implements THP currently, the change here has no really crystal point, but
one or two minutes of efforts could be saved for those archs that are
likely to support THP in future.
Signed-off-by: Hillf Danton <dhillf@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hillf Danton [Mon, 24 Oct 2011 14:58:12 +0000 (01:58 +1100)]
mm/huge_memory: fix copying user highpage
The THP copy-on-write handler falls back to regular-sized pages for a huge
page replacement upon allocation failure or if THP has been individually
disabled in the target VMA. The loop responsible for copying page-sized
chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
the virtual address arg for copy_user_highpage().
Signed-off-by: Hillf Danton <dhillf@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 24 Oct 2011 14:54:32 +0000 (01:54 +1100)]
oom: thaw threads if oom killed thread is frozen before deferring
If a thread has been oom killed and is frozen, thaw it before returning to
the page allocator. Otherwise, it can stay frozen indefinitely and no
memory will be freed.
Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nicolas Pitre [Mon, 24 Oct 2011 14:54:32 +0000 (01:54 +1100)]
mm: add vm_area_add_early()
The existing vm_area_register_early() allows for early vmalloc space
allocation. However upcoming cleanups in the ARM architecture require
that some fixed locations in the vmalloc area be reserved also very early.
The name "vm_area_register_early" would have been a good name for the
reservation part without the allocation. Since it is already in use with
Both vm_area_register_early() and vm_area_add_early() can be used together
meaning that the former is now implemented using the later where it is
ensured that no conflicting areas are added, but no attempt is made to
make the allocation scheme in vm_area_register_early() more sophisticated.
After all, you must know what you're doing when using those functions.
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:31 +0000 (01:54 +1100)]
vmscan: abort reclaim/compaction if compaction can proceed
If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps. This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Mon, 24 Oct 2011 14:54:31 +0000 (01:54 +1100)]
vmscan: limit direct reclaim for higher order allocations
When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.
Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.
This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.
This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.
If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.
Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:54:30 +0000 (01:54 +1100)]
vmscan: add barrier to prevent evictable page in unevictable list
When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1. Then, the page
would be stranded on the unevictable list.
spin_lock
SetPageLRU
spin_unlock
clear_bit(AS_UNEVICTABLE)
spin_lock
if PageLRU()
if !test_bit(AS_UNEVICTABLE)
move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
move evictable list
spin_unlock
But, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.
This patch adds a barrier after mapping_clear_unevictable.
I didn't meet this problem but just found during review.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
warning: symbol 'default_policy' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers,com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tomi Valkeinen <tomi.valkeinen@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 24 Oct 2011 14:54:28 +0000 (01:54 +1100)]
mm: disable user interface to manually rescue unevictable pages
At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again. But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.
The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions. But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.
In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.
This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reported-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Mon, 24 Oct 2011 14:54:28 +0000 (01:54 +1100)]
vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node()
write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.
However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.
Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.
Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul. In
both cases, we do not perform the scanning of this node's zones.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kyungmin Park [Mon, 24 Oct 2011 14:54:27 +0000 (01:54 +1100)]
mm: compaction: make compact_zone_order() static
There's no compact_zone_order() user outside file scope, so make it static.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dean Nelson [Mon, 24 Oct 2011 14:54:27 +0000 (01:54 +1100)]
HWPOISON: convert pr_debug()s to pr_info()s
Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.
About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage"). And these pr_debug()s failed to get
converted to pr_info()s.
This patch converts them as well. And does some minor related whitespace
cleanup.
Signed-off-by: Dean Nelson <dnelson@redhat.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tao Ma [Mon, 24 Oct 2011 14:54:26 +0000 (01:54 +1100)]
fs/buffer.c: add device information for error output in __find_get_block_slow()
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.
If the device information is given, we can know the name of the sick
volume. Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Mon, 24 Oct 2011 14:54:26 +0000 (01:54 +1100)]
mm/mmap.c: eliminate the ret variable from mm_take_all_locks()
The ret variable is really not needed in mm_take_all_locks().
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:23 +0000 (01:54 +1100)]
s390: gup_huge_pmd() return 0 if pte changes
s390 didn't return 0 in that case, if it's rolling back the *nr pointer it
should also return zero to avoid adding pages to the array at the wrong
offset.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:23 +0000 (01:54 +1100)]
powerpc: gup_huge_pmd() return 0 if pte changes
powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
it should also return zero to avoid adding pages to the array at the wrong
offset.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:22 +0000 (01:54 +1100)]
powerpc: get_hugepte() don't put_page() the wrong page
"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.
This is a longstanding issue pre-thp inclusion.
It's totally unclear how these page_cache_add_speculative and pte_val(pte)
!= pte_val(*ptep) checks are necessary across all the powerpc gup_fast
code, when x86 doesn't need any of that: there's no way the page can be
freed with irq disabled so we're guaranteed the atomic_inc will happen on
a page with page_count > 0 (so not needing the speculative check). The
pte check is also meaningless on x86: no need to rollback on x86 if the
pte changed, because the pte can still change a CPU tick after the check
succeeded and it won't be rolled back in that case. The important thing
is we got a reference on a valid page that was mapped there a CPU tick
ago. So not knowing the soft tlb refill code of ppc64 in great detail I'm
not removing the "speculative" page_count increase and the pte checks
across all the code, but unless there's a strong reason for it they should
be later cleaned up too.
If a pte can change from huge to non-huge (like it could happen with THP)
passing a pte_t *ptep to gup_hugepte() would also require to repeat the
is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs only
so I'm not altering that.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:21 +0000 (01:54 +1100)]
powerpc: remove superfluous PageTail checks on the pte gup_fast
This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.
Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:21 +0000 (01:54 +1100)]
mm: thp: tail page refcounting fix
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't
safe, if the pfn ended up being a tail page of a transparent hugepage
under splitting by __split_huge_page_refcount(). He then found the
problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and is
unused for all tail pages of a compound page, so we can simply account the
tail page references there and transfer them to tail_page->_count in
__split_huge_page_refcount() (in addition to the head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge). The standard get_page() as invoked by direct-io instead
will now take the compound_lock but still only for tail pages. The
direct-io paths are usually I/O bound and the compound_lock is per THP so
very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation and
usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Mon, 24 Oct 2011 14:54:20 +0000 (01:54 +1100)]
mm-add-extra-free-kbytes-tunable-update
All the fixes suggested by Andrew Morton. Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch
Thank you for pointing these out, Andrew.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Mon, 24 Oct 2011 14:54:20 +0000 (01:54 +1100)]
mm: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.
This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.
It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.
Testing results from Satoru Moriya:
: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
: - CPU: 1 socket, 4 core
: - Memory: 4GB
:
: - Background load:
: $ dd if=3D/dev/zero of=3D/tmp/tmp1
: $ dd if=3D/dev/zero of=3D/tmp/tmp2
: $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
: - Main load:
: $ mapped-file-stream 1 $((1024 * 1024 * 640)) --(*)
:
: (*) This is made by Johannes Weiner
: https://lkml.org/lkml/2010/8/30/226
:
: It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
: | | extra |
: | default | kbytes |
: --------------------------------------------------------------
: min_free_kbytes | 8113 | 8113 |
: extra_free_kbytes | 0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency | 517.762 | 20.775 | (usec)
: --------------------------------------------------------------
: vmstat result | | |
: nr_vmscan_write | 0 | 0 |
: pgsteal_dma | 0 | 0 |
: pgsteal_dma32 | 143667 | 144882 |
: pgsteal_normal | 31486 | 27001 |
: pgsteal_movable | 0 | 0 |
: pgscan_kswapd_dma | 0 | 0 |
: pgscan_kswapd_dma32 | 138617 | 156351 |
: pgscan_kswapd_normal | 30593 | 27955 |
: pgscan_kswapd_movable | 0 | 0 |
: pgscan_direct_dma | 0 | 0 |
: pgscan_direct_dma32 | 5050 | 0 |
: pgscan_direct_normal | 896 | 0 |
: pgscan_direct_movable | 0 | 0 |
: kswapd_steal | 169207 | 171883 |
: kswapd_inodesteal | 0 | 0 |
: kswapd_low_wmark_hit_quickly | 43 | 45 |
: kswapd_high_wmark_hit_quickly | 1 | 0 |
: allocstall | 32 | 0 |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs. This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.
Signed-off-by: Rik van Riel<riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Tested-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>