Setting the alarm to a time not on a minute boundary results in repeated
interrupts being generated by the DA9052/3 PMIC device until the kernel
RTC core sees that the alarm has rung. Sometimes the number and frequency
of interrupts can cause the kernel to disable the IRQ line used by the
DA9052/3 PMIC with disasterous consequences. This patch fixes the
problem.
Even though the DA9052/3 PMIC is capable generating periodic interrupts,
ie TICKS, the method used to distinguish RTC_AF from RTC_PF events was
flawed and can not work in conjunction with the regmap_irq kernel core.
Thus that flawed detection has also been removed by the DA9052/3 PMIC RTC
driver's irq handler, so that it no longer reports the wrong type of event
to the kernel RTC core.
The internal static functions within the DA9052/3 PMIC RTC driver have
been changed to pass the 'da9052_rtc' structure instead of the 'da9052'
because there is no backwards pointer from the 'da9052' structure.
This patch fixes the three issues described above. The first is serious
because usiing the RTC alarm set to a non minute boundary will eventually
cause all component drivers that depend on the interrupt line to fail.
The solution adopted is to round up to alarm time to the next highest
minute.
The second bug, reporting a RTC_PF event instead of an RTC_AF event turns
out to not matter with the current implementation of the kernel RTC core
as it seems to ignore the event type. However, should that change in the
future it is better to fix the issue now and not have 'problems waiting to
happen'
The third set of changes are to make the da9052_rtc structure available to
all the local internal functions in the driver. This was done during
testing so that diagnostic data could be stored there. Should the
solution to the first issue be found not acceptable, then the alternative
of using the TICKS interrupt at the fixed one second interval in order to
step to the exact second of the requested alarm requires an extra (alarm
time) piece of data to be stored. In devices that use the alarm function
to wake up from sleep, accuracy to the second will result in the device
being awake for up to nearly a minute longer than expected.
Signed-off-by: Anthony Olech <anthony.olech.opensource@diasemi.com> Cc: David Dajun Chen <dchen@diasemi.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/rtc/rtc-cmos.c: drivers/char/rtc.c features for DECstation support
This brings in drivers/char/rtc.c functionality required for DECstation
and, should the maintainers decide to switch, Alpha systems to use
rtc-cmos.
Specifically these features are made available:
* RTC iomem rather than x86/PCI port I/O mapping, controlled with the
RTC_IOMAPPED macro as with the original driver. The DS1287A chip in all
DECstation systems is mapped in the host bus address space as a
contiguous block of 64 32-bit words of which the least significant byte
accesses the RTC chip for both reads and writes. All the address and
data window register accesses are made transparently by the chipset glue
logic so that the device appears directly mapped on the host bus.
* A way to set the size of the address space explicitly with the
newly-added `address_space' member of the platform part of the RTC
device structure. This avoids the unreliable heuristics that does not
work in a setup where the RTC is not explicitly accessed with the usual
address and data window register pair.
* The ability to use the RTC periodic interrupt as a system clock
device, which is implemented by arch/mips/kernel/cevt-ds1287.c for
DECstation systems and takes the RTC interrupt away from the RTC driver.
Eventually hooking back to the clock device's interrupt handler should
be possible for the purpose of the alarm clock and possibly also
update-in-progress interrupt, but this is not done by this change.
o To avoid interfering with the clock interrupt all the places where
the RTC interrupt mask is fiddled with are only executed if and IRQ
has been assigned to the RTC driver.
o To avoid changing the clock setup Register A is not fiddled with
if CMOS_RTC_FLAGS_NOFREQ is set in the newly-added `flags' member of
the platform part of the RTC device structure. Originally, in
drivers/char/rtc.c, this was keyed with the absence of the RTC
interrupt, just like the interrupt mask, but there only the periodic
interrupt frequency is set, whereas rtc-cmos also sets the divider
bits. Therefore a new flag is introduced so that systems where the
RTC interrupt is not usable rather than used as a system clock device
can fully initialise the RTC.
* A small clean-up is made to the IRQ assignment code that makes the IRQ
number hardcoded to -1 rather than arbitrary -ENXIO (or whatever error
happens to be returned by platform_get_irq) where no IRQ has been
assigned to the RTC driver (NO_IRQ might be another candidate, but it
looks like this macro has inconsistent or missing definitions and
limited use and might therefore be unsafe).
Verified to work correctly with a DECstation 5000/240 system.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lee, Chun-Yi [Wed, 14 May 2014 00:02:38 +0000 (10:02 +1000)]
drivers/rtc/rtc-efi.c: avoid subtracting day twice when computing year days
Compared source code of rtc-lib.c::rtc_year_days() with
efirtc.c::rtc_year_days(), found the code in rtc-efi decreases value of
day twice when it computing year days. rtc-lib.c::rtc_year_days() has
already decrease days and return the year days from 0 to 365.
Ales Novak [Wed, 14 May 2014 00:02:36 +0000 (10:02 +1000)]
drivers/rtc/interface.c: fix for fix of alarm initialization
Seems the previous patch "fix infinite loop in initializing the alarm"
did break the infinite loop in alarm initialization, but not in the right
way. The loop indeed should walk through the not-leap years and stop on
the leap one.
This patch does apply on top of the previous one.
Signed-off-by: Ales Novak <alnovak@suse.cz> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ales Novak [Wed, 14 May 2014 00:02:36 +0000 (10:02 +1000)]
drivers/rtc/interface.c: fix infinite loop in initializing the alarm
In __rtc_read_alarm(), if the alarm time retrieved by
rtc_read_alarm_internal() from the device contains invalid values (e.g.
month=2,mday=31) and the year not set (=-1), the initialization will loop
infinitely because the year-fixing loop expects the time being invalid due
to leap year.
Fix reduces the loop to the leap years and adds final validity check.
Signed-off-by: Ales Novak <alnovak@suse.cz> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Reported-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Wed, 14 May 2014 00:02:35 +0000 (10:02 +1000)]
kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND
1. Remove CLONE_KERNEL, it has no users and it is dangerous.
The (old) comment says "List of flags we want to share for kernel
threads" but this is not true, we do not want to share ->sighand by
default. This flag can only be used if the caller is sure that both
parent/child will never play with signals (say, allow_signal/etc).
2. Change rest_init() to clone kernel_init() without CLONE_SIGHAND.
In this case CLONE_SIGHAND does not really hurt, and it looks like
optimization because copy_sighand() can avoid kmem_cache_alloc().
But in fact this only adds the minor pessimization. kernel_init()
is going to exec the init process, and de_thread() will need to
unshare ->sighand and do kmem_cache_alloc(sighand_cachep) anyway,
but it needs to do more work and take tasklist_lock and siglock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a module is built into the kernel the module_init() function becomes
an initcall. Sometimes debugging through dynamic debug can help, however,
debugging built in kernel modules is typically done by changing the
.config, recompiling, and booting the new kernel in an effort to determine
exactly which module caused a problem.
This patchset can be useful stand-alone or combined with initcall_debug.
There are cases where some initcalls can hang the machine before the
console can be flushed, which can make initcall_debug output inaccurate.
Having the ability to skip initcalls can help further debugging of these
scenarios.
Usage: initcall_blacklist=<list of comma separated initcalls>
ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and
the log contains:
Andrew Morton [Wed, 14 May 2014 00:02:35 +0000 (10:02 +1000)]
init/main.c: don't use pr_debug()
Pertially revert ea676e846a8171b8 ("init/main.c: convert to pr_foo()").
Unbeknownst to me, pr_debug() is different from the other pr_foo() levels:
pr_debug() is a no-op when DEBUG is not defined.
Happily, init/main.c does have a #define DEBUG so we didn't break
initcall_debug. But the functioning of initcall_debug should not be
dependent upon the presence of that #define DEBUG.
Reported-by: Russell King <rmk@arm.linux.org.uk> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We observed this problem has been occurring since 2.6.30 with
fs/binfmt_elf.c: create_elf_tables()->get_random_bytes(), introduced by f06295b44c296c8f ("ELF: implement AT_RANDOM for glibc PRNG seeding").
/*
* Generate 16 random bytes for userspace PRNG seeding.
*/
get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes));
The patch introduces a wrapper around get_random_int() which has lower
overhead than calling get_random_bytes() directly.
With this patch applied:
$ cat /proc/sys/kernel/random/entropy_avail
2731
$ cat /proc/sys/kernel/random/entropy_avail
2802
$ cat /proc/sys/kernel/random/entropy_avail
2878
Analyzed by John Sobecki.
This has been applied on a specific Oracle kernel and has been running on
the customer's production environment (the original bug reporter) for
several months; it has worked fine until now.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <aedilger@gmail.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Arnd Bergmann <arnn@arndb.de> Cc: John Sobecki <john.sobecki@oracle.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Wed, 14 May 2014 00:02:33 +0000 (10:02 +1000)]
checkpatch: make --strict a default for files in drivers/net and net/
Networking files are generally more strictly conformant to linux-kernel
style so make checkpatch more verbose by default for patches to files or
when checking files in these directories.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Wed, 14 May 2014 00:02:33 +0000 (10:02 +1000)]
checkpatch: improve missing blank line after declarations test
A couple more modifications to the declarations tests.
o Declarations can also be bitfields so exclude things with a colon
o Make sure the current and previous lines are indented the same
to avoid matching some macro where a struct type is passed on
the previous line like:
next = list_entry(buffer->entry.next,
struct binder_buffer, entry);
if (buffer_start_page(next) == buffer_end_page(buffer))
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We attempt to search for compatible strings which use a variable token in
the documented name such as <chip> or <soc>. While this was attempted to
be handled, it's utterly broken.
The desired forms of matching are:
vendor,<chip>-*
vendor,name<part#>-*
For <chip>, lower case characters and numbers are permitted. For <part#>,
only numeric values are allowed.
With this change, the number of missing compatible strings reported in
arch/arm/boot/dts is reduced from 1071 to 960.
Reported-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Rob Herring <robh@kernel.org> Cc: Florian Vaussard <florian.vaussard@epfl.ch> Cc: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marian Chereji [Wed, 14 May 2014 00:02:32 +0000 (10:02 +1000)]
lib: Add CRC64 ECMA module
Add implementation of CRC64 ECMA checksum.
We have an IP Acceleration driver for Freescale network processors which
is using this CRC64. However, it still needs some work in order for it to
become upstreamable.
Signed-off-by: Marian Chereji <marian.chereji@freescale.com> Reviewed-by: Varvara Andrei-B21317 <andrei.varvara@freescale.com> Reviewed-by: Fleming Andrew-AFLEMING <AFLEMING@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kstrimdup() creates a whitespace-trimmed duplicate of the passed in
null-terminated string. This is useful for strings coming from sysfs that
often include trailing whitespace due to user input.
Thanks to Joe Perches for this implementation.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org> Cc: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Wed, 14 May 2014 00:02:32 +0000 (10:02 +1000)]
lib/plist.c: make CONFIG_DEBUG_PI_LIST selectable
Change CONFIG_DEBUG_PI_LIST to be user-selectable, and add a title and
description. Remove the dependency on DEBUG_RT_MUTEXES since they were
changed to use rbtrees, and there are other users of plists now.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lasse Collin [Wed, 14 May 2014 00:02:30 +0000 (10:02 +1000)]
lib/xz: enable all filters by default in Kconfig
This restores the old behavior that existed before 2013-02-22, when
changes were made by 64dbfb444c150 ("decompressors: drop dependency on
CONFIG_EXPERT") and 5dc49c75a2 ("decompressors: make the default XZ_DEC_*
config match the selected architecture").
Disabling the filters only makes sense on embedded systems.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Acked-by: Kyle McMartin <kyle@infradead.org> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Wed, 14 May 2014 00:02:30 +0000 (10:02 +1000)]
lib/plist.c: replace pr_debug with printk in plist_test()
Replace pr_debug() in lib/plist.c test function plist_test() with
printk(KERN_DEBUG ...).
Without DEBUG defined, pr_debug() is complied out, but the entire
plist_test() function is already inside CONFIG_DEBUG_PI_LIST, so printk
should just be used directly.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Carpenter [Wed, 14 May 2014 00:02:30 +0000 (10:02 +1000)]
lib/string.c: use the name "C-string" in comments
For strncpy() and friends the source string may or may not have an actual
NUL character at the end. The documentation is confusing in this because
it specifically mentions that you are passing a "NUL-terminated" string.
Wikipedia says that "C-string" is an alternative name we can use instead.
Most mobile phones have Ambient Light Sensors and it changes brightness
according to the lux. It means it changes backlight brightness frequently
by just writing sysfs node, so it generates uevent.
Usually there's no user to use this backlight changes. But it forks udev
worker threads and it takes about 5ms. The main problem is that it hurts
other process activities. so remove it.
Kay said
"Uevents are for the major, low-frequent, global device state-changes,
not for carrying-out any sort of measurement data. Subsystems which
need that should use other facilities like poll()-able sysfs file or
any other subscription-based, client-tracking interface which does not
cause overhead if it isn't used. Uevents are not the right thing to
use here, and upstream udev should not paper-over broken kernel
subsystems."
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Jingoo Han <jg1.han@samsung.com> Cc: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Triplett [Wed, 14 May 2014 00:02:29 +0000 (10:02 +1000)]
MAINTAINERS: add linux-api for review of API/ABI changes
This makes it more likely that patch submitters will CC API/ABI changes to
the linux-api list, and tools like get_maintainer.pl will do so
automatically.
Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michael Kerrisk <mtk.man-pages@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Wed, 14 May 2014 00:02:29 +0000 (10:02 +1000)]
lib/vsprintf: add %pT format specifier
Since task_struct->comm can be modified by other threads while the current
thread is reading it, it is recommended to use get_task_comm() for reading
it.
However, since get_task_comm() holds task_struct->alloc_lock spinlock,
some users cannot use get_task_comm(). Also, a lot of users are directly
reading from task_struct->comm even if they can use get_task_comm(). Such
users might obtain inconsistent result.
This patch introduces %pT format specifier for printing task_struct->comm.
Currently %pT does not provide consistency. I'm planning to change to
use RCU in the future. By using RCU, the comm name read from
task_struct->comm will be guaranteed to be consistent. But before
modifying set_task_comm() to use RCU, we need to kill direct ->comm users
who do not use get_task_comm().
An example for converting direct ->comm users is shown below. Since many
debug printings use p == current, you can pass NULL instead of p if p ==
current.
Will Deacon [Wed, 14 May 2014 00:02:29 +0000 (10:02 +1000)]
printk: report dropping of messages from logbuf
If the log ring buffer becomes full, we silently overwrite old messages
with new data. console_unlock will detect this case and fast-forward the
console_* pointers to skip over the corrupted data, but nothing will be
reported to the user.
This patch hijacks the first valid log message after detecting that we
dropped messages and prefixes it with a note detailing how many messages
were dropped. For long (~1000 char) messages, this will result in some
truncation of the real message, but given that we're dropping things
anyway, that doesn't seem to be the end of the world.
Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Kay Sievers <kay@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Wed, 14 May 2014 00:02:28 +0000 (10:02 +1000)]
Documentation: expand/clarify debug documentation
The pr_debug() and related debug print macros all differ from the normal
pr_XXX() macros, in that the normal ones print unconditionally, while the
debug macros are compiled out unless DEBUG is defined or
CONFIG_DYNAMIC_DEBUG is set. This isn't obvious, and the only way to find
this out is either to review the actual printk.h code or to read
CodingStyle, and the message there doesn't highlight the fact.
Change Documentation/CodingStyle to clearly indicate that pr_debug() and
related debug printing macros behave differently than all other pr_XXX()
macros, and attempt to clarify when and where the different debug printing
methods might be used.
Add short comment to printk.h above the pr_XXX() macros indicating that
while these macros print unconditionally, pr_debug() does not.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Joe Perches <joe@perches.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Stultz [Wed, 14 May 2014 00:02:28 +0000 (10:02 +1000)]
timekeeping: use printk_deferred when holding timekeeping seqlock
Jiri Bohac pointed out that there are rare but potential deadlock
possibilities when calling printk while holding the timekeeping
seqlock.
This is due to printk() triggering console sem wakeup, which can
cause scheduling code to trigger hrtimers which may try to read
the time.
Specifically, as Jiri pointed out, that path is:
printk
vprintk_emit
console_unlock
up(&console_sem)
__up
wake_up_process
try_to_wake_up
ttwu_do_activate
ttwu_activate
activate_task
enqueue_task
enqueue_task_fair
hrtick_update
hrtick_start_fair
hrtick_start_fair
get_time
ktime_get
--> endless loop on
read_seqcount_retry(&timekeeper_seq, ...)
This patch tries to avoid this issue by using printk_deferred (previously
named printk_sched) which should defer printing via a irq_work_queue.
Signed-off-by: John Stultz <john.stultz@linaro.org> Reported-by: Jiri Bohac <jbohac@suse.cz> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Stultz [Wed, 14 May 2014 00:02:28 +0000 (10:02 +1000)]
printk: Add printk_deferred_once
Two of the three prink_deferred uses are really printk_once style
uses, so add a printk_deferred_once macro to simplify those call
sites.
Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Stultz [Wed, 14 May 2014 00:02:28 +0000 (10:02 +1000)]
printk: rename printk_sched to printk_deferred
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Stultz [Wed, 14 May 2014 00:02:27 +0000 (10:02 +1000)]
printk: disable preemption for printk_sched
An earlier change in -mm (printk: remove separate printk_sched
buffers...), removed the printk_sched irqsave/restore lines since it was
safe for current users. Since we may be expanding usage of
printk_sched(), disable preepmtion for this function to make it more
generally safe to call.
Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Steven Rostedt [Wed, 14 May 2014 00:02:27 +0000 (10:02 +1000)]
printk: remove separate printk_sched buffers and use printk buf instead
To prevent deadlocks with doing a printk inside the scheduler,
printk_sched() was created. The issue is that printk has a console_sem
that it can grab and release. The release does a wake up if there's a
task pending on the sem, and this wake up grabs the rq locks that is held
in the scheduler. This leads to a possible deadlock if the wake up uses
the same rq as the one with the rq lock held already.
What printk_sched() does is to save the printk write in a per cpu buffer
and sets the PRINTK_PENDING_SCHED flag. On a timer tick, if this flag is
set, the printk() is done against the buffer.
There's a couple of issues with this approach.
1) If two printk_sched()s are called before the tick, the second one
will overwrite the first one.
2) The temporary buffer is 512 bytes and is per cpu. This is a quite a
bit of space wasted for something that is seldom used.
In order to remove this, the printk_sched() can use the printk buffer
instead, and delay the console_trylock()/console_unlock() to the queued
work.
Because printk_sched() would then be taking the logbuf_lock, the
logbuf_lock must not be held while doing anything that may call into the
scheduler functions, which includes wake ups. Unfortunately, printk()
also has a console_sem that it uses, and on release, the up(&console_sem)
may do a wake up of any pending waiters. This must be avoided while
holding the logbuf_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 14 May 2014 00:02:27 +0000 (10:02 +1000)]
printk: enable interrupts before calling console_trylock_for_printk()
We need interrupts disabled when calling console_trylock_for_printk() only
so that cpu id we pass to can_use_console() remains valid (for other
things console_sem provides all the exclusion we need and deadlocks on
console_sem due to interrupts are impossible because we use
down_trylock()). However if we are rescheduled, we are guaranteed to run
on an online cpu so we can easily just get the cpu id in
can_use_console().
We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH it
can somewhat reduce interrupt latency caused by console_unlock()
especially since later in the patch series we will want to spin on
console_sem in console_trylock_for_printk().
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 14 May 2014 00:02:27 +0000 (10:02 +1000)]
printk: fix lockdep instrumentation of console_sem
Printk calls mutex_acquire() / mutex_release() by hand to instrument
lockdep about console_sem. However in some corner cases the
instrumentation is missing. Fix the problem by creating helper functions
for locking / unlocking console_sem which take care of lockdep
instrumentation as well.
Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Fabio Estevam <festevam@gmail.com> Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 14 May 2014 00:02:26 +0000 (10:02 +1000)]
printk: release lockbuf_lock before calling console_trylock_for_printk()
There's no reason to hold lockbuf_lock when entering
console_trylock_for_printk().
The first thing this function does is to call down_trylock(console_sem)
and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock
isn't needed for that branch. When down_trylock() succeeds, the rest of
console_trylock() is OK without lockbuf_lock (it is called without it from
other places), and the only remaining thing in
console_trylock_for_printk() is can_use_console() call. For that call
console_sem is enough (it iterates all consoles and checks CON_ANYTIME
flag).
So we drop logbuf_lock before entering console_trylock_for_printk() which
simplifies the code.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At that time release_console_sem() (today's equivalent is
console_unlock()) was indeed using lockbuf_lock to avoid races between
trylock on console_sem in printk() and unlock of console_sem. However
these days the interlocking is gone and the races are avoided by
rechecking logbuf state after releasing console_sem.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Wed, 14 May 2014 00:02:26 +0000 (10:02 +1000)]
printk: return really stored message length
I wonder if anyone uses printk return value but it is there and should be
counted correctly.
This patch modifies log_store() to return the number of really stored
bytes from the 'text' part. Also it handles the return value in
vprintk_emit().
Note that log_store() is used also in cont_flush() but we could ignore the
return value there. The function works with characters that were already
counted earlier. In addition, the store could newer fail here because the
length of the printed text is limited by the "cont" buffer and "dict" is
NULL.
Signed-off-by: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Wed, 14 May 2014 00:02:25 +0000 (10:02 +1000)]
printk: shrink too long messages
We might want to print at least part of too long messages and add some
warning for debugging purpose.
The question is how long the shrunken message should be. If we use the
whole buffer, it might get rotated too soon. Let's try to use only 1/4 of
the buffer for now.
Also shrink the whole dictionary. We do not want to parse it or break it
in the middle of some pair of values. It would not cause any real harm
but still.
Signed-off-by: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Wed, 14 May 2014 00:02:25 +0000 (10:02 +1000)]
printk: split message size computation
We will want to recompute the message size when shrinking too long
messages. Let's put the code into separate function.
The side effect of setting "pad_len" is not nice but it is worth removing
the code duplication. Note that I will probably have one more usage for
this function when handling messages safe way in NMI context.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Wed, 14 May 2014 00:02:25 +0000 (10:02 +1000)]
printk: ignore too long messages
There was no check for too long messages. The check for free space always
passed when first_seq and next_seq were equal. Enough free space was not
guaranteed, though.
log_store() might be called to store messages up to 64kB + 64kB + 16B.
This is sum of maximal text_len, dict_len values, and the size of the
structure printk_log.
On the other hand, the minimal size for the main log buffer currently is
4kB and it is enforced only by Kconfig.
The good news is that the usage looks safe right now. log_store() is
called only from vprintk_emit() and cont_flush(). Here the "text" part is
always passed via a static buffer and the length is limited to
LOG_LINE_MAX which is 1024. The "dict" part is NULL in most cases. The
only exceptions is when vprintk_emit() is called from printk_emit() and
dev_vprintk_emit(). But printk_emit() is currently used only in
devkmsg_writev() and here "dict" is NULL as well. In dev_vprintk_emit(),
"dict" is limited by the static buffer "hdr" of the size 128 bytes. It
meas that the current maximal printed text is 1024B + 128B + 16B and it
always fit the log buffer.
But it is only matter of time when someone calls printk_emit() with unsafe
parameters, especially the "dict" one.
This patch adds a check for the free space when the buffer is empty. It
reuses the already existing log_has_space() function but it has to add an
extra parameter. It defines whether the buffer is empty. Note that the
same values of "first_idx" and "next_idx" might also mean that the buffer
is full.
If the buffer is empty, we must respect the current position of the
indexes. We cannot reset them to the beginning of the buffer. Otherwise,
the functions reading the buffer would get crazy.
The question is what to do when the message is too long. This patch uses
the easiest solution and just ignores the problematic message. Let's do
something better in a followup patch.
Signed-off-by: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Wed, 14 May 2014 00:02:25 +0000 (10:02 +1000)]
printk: split code for making free space in the log buffer
The check for free space in the log buffer always passes when "first_seq"
and "next_seq" are equal. In theory, it might cause writing outside of
the log buffer.
Fortunately, the current usage looks safe because the used "text" and
"dict" buffers are quite limited. See the second patch for more details.
Anyway, it is better to be on the safe side and add a check. An easy
solution is done in the 2nd patch and it is improved in the 4th patch.
5th patch fixes the computation of the printed message length.
1st and 3rd patches just do some code refactoring to make the other
patches easier.
This patch (of 5):
There will be needed some fixes in the check for free space. They will be
easier if the code is moved outside of the quite long log_store()
function.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gustavo Padovan [Wed, 14 May 2014 00:02:24 +0000 (10:02 +1000)]
drivers/misc/ti-st/st_core.c: fix NULL dereference on protocol type check
If the type we receive is greater than ST_MAX_CHANNELS we can't rely on
type as vector index since we would be accessing unknown memory when we use the type
as index.
Fabian Frederick [Wed, 14 May 2014 00:02:21 +0000 (10:02 +1000)]
kernel/cpu.c: convert printk to pr_foo()
no level printk converted to pr_warn (if err)
no level printk converted to pr_info (disabling non-boot cpus)
Other printk converted to respective level.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sys_sgetmask and sys_ssetmask are obsolete system calls no longer
supported in libc.
This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
mode configuration.That option is enabled by default for those
architectures.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steven Miao <realmz6@gmail.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Weijie Yang [Wed, 14 May 2014 00:02:20 +0000 (10:02 +1000)]
zram: correct offset usage in zram_bio_discard
We want to skip the physical block(PAGE_SIZE) which is partially covered
by the discard bio, so we check the remaining size and subtract it if
there is a need to goto the next physical block.
The current offset usage in zram_bio_discard is incorrect, it will cause
its upper filesystem breakdown. Consider the following scenario:
On some architecture or config, PAGE_SIZE is 64K for example, filesystem
is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
offset = 4K and size=72K, normally, it should not really discard any
physical block as it partially cover two physical blocks. However, with
the current offset usage, it will discard the second physical block and
free its memory, which will cause filesystem breakdown.
This patch corrects the offset usage in zram_bio_discard.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In file included from include/linux/suspend.h:8,
from arch/x86/kernel/asm-offsets.c:12:
include/linux/mm.h: In function 'check_freepage_migratetype':
include/linux/mm.h:296: error: implicit declaration of function 'page_to_section'
In file included from include/linux/suspend.h:8,
from arch/x86/kernel/asm-offsets.c:12:
include/linux/mm.h: At top level:
include/linux/mm.h:892: error: conflicting types for 'page_to_section'
include/linux/mm.h:296: note: previous implicit declaration of 'page_to_section' was here
Vlastimil Babka [Wed, 14 May 2014 00:02:19 +0000 (10:02 +1000)]
mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages
For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in
try_to_steal_freepages(). For MIGRATE_CMA, the pages also must not go to
a different free_list, otherwise they could get allocated as unmovable and
result in CMA failure.
This is ensured by setting the freepage_migratetype appropriately when
placing pages on pcp lists, and using the information when releasing them
back to free_list. It is also assumed that CMA and RESERVE pageblocks are
created only in the init phase. This patch adds DEBUG_VM checks to catch
any regressions introduced for this invariant.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:19 +0000 (10:02 +1000)]
mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
Mel Gorman [Wed, 14 May 2014 00:02:19 +0000 (10:02 +1000)]
fs: buffer: do not use unnecessary atomic operations when discarding buffers
Discarding buffers uses a bunch of atomic operations when discarding
buffers because ...... I can't think of a reason. Use a cmpxchg loop to
clear all the necessary flags. In most (all?) cases this will be a single
atomic operations.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:18 +0000 (10:02 +1000)]
mm: do not use unnecessary atomic operations when adding pages to the LRU
When adding pages to the LRU we clear the active bit unconditionally. As
the page could be reachable from other paths we cannot use unlocked
operations without risk of corruption such as a parallel
mark_page_accessed. This patch tests if is necessary to clear the active
flag before using an atomic operation. This potentially opens a tiny race
when PageActive is checked as mark_page_accessed could be called after
PageActive was checked. The race already exists but this patch changes it
slightly. The consequence is that that the page may be promoted to the
active list that might have been left on the inactive list before the
patch. It's too tiny a race and too marginal a consequence to always use
atomic operations for.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:18 +0000 (10:02 +1000)]
mm: shmem: avoid atomic operation during shmem_getpage_gfp
shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against? Use an unlocked variant.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:18 +0000 (10:02 +1000)]
mm: page_alloc: convert hot/cold parameter and immediate callers to bool
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:17 +0000 (10:02 +1000)]
mm: page_alloc: use unsigned int for order in more places
X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:17 +0000 (10:02 +1000)]
mm: page_alloc: use word-based accesses for get/set pageblock bitmaps
The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.
In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:02:16 +0000 (10:02 +1000)]
mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path
ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
unlikely branch in the fast path. This patch moves the check out of the
fast path and after it has been determined that the watermarks have not
been met. This helps the common fast path at the cost of making the slow
path slower and hitting kswapd with a performance cost. It's a reasonable
tradeoff.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>