Alexey Dobriyan [Tue, 22 Mar 2011 23:34:40 +0000 (16:34 -0700)]
kstrto*: converting strings to integers done (hopefully) right
1. simple_strto*() do not contain overflow checks and crufty,
libc way to indicate failure.
2. strict_strto*() also do not have overflow checks but the name and
comments pretend they do.
3. Both families have only "long long" and "long" variants,
but users want strtou8()
4. Both "simple" and "strict" prefixes are wrong:
Simple doesn't exactly say what's so simple, strict should not exist
because conversion should be strict by default.
The solution is to use "k" prefix and add convertors for more types.
Enter
kstrtoull()
kstrtoll()
kstrtoul()
kstrtol()
kstrtouint()
kstrtoint()
Include runtime testsuite (somewhat incomplete) as well.
strict_strto*() become deprecated, stubbed to kstrto*() and
eventually will be removed altogether.
Use kstrto*() in code today!
Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
they'll be unused at runtime. This is temporarily solution,
because I don't want to hardcode list of archs where these
functions aren't needed. Current solution with sizeof() and
__alignof__ at least always works.
Joe Perches [Tue, 22 Mar 2011 23:34:35 +0000 (16:34 -0700)]
MAINTAINERS: remove SHARP LH7A40X section
commit 82e6923e186 ("ARM: lh7a40x: remove unmaintained platform support")
removed support, remove it from MAINTAINERS.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 22 Mar 2011 23:34:31 +0000 (16:34 -0700)]
MAINTAINERS: update clkdev location
Commit 6d803ba736a ("ARM: 6483/1: arm & sh: factorised duplicated
clkdev.c") moved it to a separate directory.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 22 Mar 2011 23:34:24 +0000 (16:34 -0700)]
get_maintainer.pl: allow "K:" pattern tests to match non-patch text
Extend the usage of the K section in the MAINTAINERS file to support
matching regular expressions to any arbitrary text that may precede the
patch itself. For example, the commit message or mail headers generated
by git-format-patch.
Signed-off-by: Joe Perches <joe@perches.com> Original-patch-by: L. Alberto Giménez <agimenez@sysvalve.es> Acked-by: L. Alberto Giménez <agimenez@sysvalve.es> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk: allow setting DEFAULT_MESSAGE_LEVEL via Kconfig
We've been burned by regressions/bugs which we later realized could have
been triaged quicker if only we'd paid closer attention to dmesg. To make
it easier to audit dmesg, we'd like to make DEFAULT_MESSAGE_LEVEL
Kconfig-settable. That way we can set it to KERN_NOTICE and audit any
messages <= KERN_WARNING.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Cc: Olof Johansson <olofj@chromium.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Tue, 22 Mar 2011 23:34:22 +0000 (16:34 -0700)]
printk: use %pK for /proc/kallsyms and /proc/modules
In an effort to reduce kernel address leaks that might be used to help
target kernel privilege escalation exploits, this patch uses %pK when
displaying addresses in /proc/kallsyms, /proc/modules, and
/sys/module/*/sections/*.
Note that this changes %x to %p, so some legitimately 0 values in
/proc/kallsyms would have changed from 00000000 to "(null)". To avoid
this, "(null)" is not used when using the "K" format. Anything that was
already successfully parsing "(null)" in addition to full hex digits
should have no problem with this change. (Thanks to Joe Perches for the
suggestion.) Due to the %x to %p, "void *" casts are needed since these
addresses are already "unsigned long" everywhere internally, due to their
starting life as ELF section offsets.
Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Eugene Teo <eugene@redhat.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Feng Tang [Tue, 22 Mar 2011 23:34:21 +0000 (16:34 -0700)]
console: prevent registered consoles from dumping old kernel message over again
For a platform with many consoles like:
"console=tty1 console=ttyMFD2 console=ttyS0 earlyprintk=mrst"
Each time when the non "selected_console" (tty1 and ttyMFD2 here) get
registered, the existing kernel message will be printed out on registered
consoles again, the "mrst" early console will get some same message for 3
times, and "tty1" will get some for twice.
As suggested by Andrew Morton, every time a new console is registered, it
will be set as the "exclusive" console which will dump the already
existing kernel messages.
Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
console: allow to retain boot console via boot option keep_bootcon
On some architectures, the boot process involves de-registering the boot
console (early boot), initialize drivers and then re-register the console.
This mechanism introduces a window in which no printk can happen on the
console and messages are buffered and then printed once the new console is
available.
If a kernel crashes during this window, all it's left on the boot console
is "console [foo] enabled, bootconsole disabled" making debug of the crash
rather 'interesting'.
By adding "keep_bootcon" option, do not unregister the boot console, that
will allow to printk everything that is happening up to the crash.
The option is clearly meant only for debugging purposes as it introduces
lots of duplicated info printed on console, but will make bug report from
users easier as it doesn't require a kernel build just to figure out where
we crash.
Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don Zickus [Tue, 22 Mar 2011 23:34:17 +0000 (16:34 -0700)]
kernel/watchdog.c: always return NOTIFY_OK during cpu up/down events
This patch addresses a couple of problems. One was the case when the
hardlockup failed to start, it also failed to start the softlockup. There
were valid cases when the hardlockup shouldn't start and that shouldn't
block the softlockup (no lapic, bios controls perf counters).
The second problem was when the hardlockup failed to start on boxes (from
a no lapic or bios controlled perf counter case), it reported failure to
the cpu notifier chain. This blocked the notifier from continuing to
start other more critical pieces of cpu bring-up (in our case based on a
2.6.32 fork, it was the mce). As a result, during soft cpu online/offline
testing, the system would panic when a cpu was offlined because the cpu
notifier would succeed in processing a watchdog disable cpu event and
would panic in the mce case as a result of un-initialized variables from a
never executed cpu up event.
I realized the hardlockup/softlockup cases are really just debugging aids
and should never impede the progress of a cpu up/down event. Therefore I
modified the code to always return NOTIFY_OK and instead rely on printks
to inform the user of problems.
Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don Zickus [Tue, 22 Mar 2011 23:34:16 +0000 (16:34 -0700)]
kernel/watchdog.c: allow hardlockup to panic by default
When a cpu is considered stuck, instead of limping along and just printing
a warning, it is sometimes preferred to just panic, let kdump capture the
vmcore and reboot. This gets the machine back into a stable state quickly
while saving the info that got it into a stuck state to begin with.
Add a Kconfig option to allow users to set the hardlockup to panic
by default. Also add in a 'nmi_watchdog=nopanic' to override this.
[akpm@linux-foundation.org: fix strncmp length] Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Phil Carmody [Tue, 22 Mar 2011 23:34:15 +0000 (16:34 -0700)]
calibrate: retry with wider bounds when converge seems to fail
Systems with unmaskable interrupts such as SMIs may massively
underestimate loops_per_jiffy, and fail to converge anywhere near the real
value. A case seen on x86_64 was an initial estimate of 256<<12, which
converged to 511<<12 where the real value should have been over 630<<12.
This admitedly requires bypassing the TSC calibration (lpj_fine), and a
failure to settle in the direct calibration too, but is physically
possible. This failure does not depend on my previous calibration
optimisation, but by luck is easy to fix with the optimisation in place
with a trivial retry loop.
In the context of the optimised converging method, as we can no longer
trust the starting estimate, enlarge the search bounds exponentially so
that the number of retries is logarithmically bounded.
[akpm@linux-foundation.org: mention x86_64 SMIs in comment] Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Phil Carmody [Tue, 22 Mar 2011 23:34:13 +0000 (16:34 -0700)]
calibrate: home in on correct lpj value more quickly
Binary chop with a jiffy-resync on each step to find an upper bound is
slow, so just race in a tight-ish loop to find an underestimate.
If done with lots of individual steps, sometimes several hundreds of
iterations would be required, which would impose a significant overhead,
and make the initial estimate very low. By taking slowly increasing steps
there will be less overhead.
E.g. an x86_64 2.67GHz could have fitted in 613 individual small delays,
but in reality should have been able to fit in a single delay 644 times
longer, so underestimated by 31 steps. To reach the equivalent of 644
small delays with the accelerating scheme now requires about 130
iterations, so has <1/4th of the overhead, and can therefore be expected
to underestimate by only 7 steps.
As now we have a better initial estimate we can binary chop over a smaller
range. With the loop overhead in the initial estimate kept low, and the
step sizes moderate, we won't have under-estimated by much, so chose as
tight a range as we can.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Phil Carmody [Tue, 22 Mar 2011 23:34:12 +0000 (16:34 -0700)]
calibrate: extract fall-back calculation into own helper
The motivation for this patch series is that currently our OMAP calibrates
itself using the trial-and-error binary chop fallback that some other
architectures no longer need to perform. This is a lengthy process,
taking 0.2s in an environment where boot time is of great interest.
Patch 2/4 has two optimisations. Firstly, it replaces the initial
repeated- doubling to find the relevant power of 2 with a tight loop that
just does as much as it can in a jiffy. Secondly, it doesn't binary chop
over an entire power of 2 range, it choses a much smaller range based on
how much it squeezed in, and failed to squeeze in, during the first stage.
Both are significant optimisations, and bring our calibration down from
23 jiffies to 5, and, in the process, often arrive at a more accurate lpj
value.
The 'bands' and 'sub-logarithmic' growth may look over-engineered, but
they only cost a small level of inaccuracy in the initial guess (for all
architectures) in order to avoid the very large inaccuracies that appeared
during testing (on x86_64 architectures, and presumably others with less
metronomic operation). Note that due to the existence of the TSC and
other timers, the x86_64 will not typically use this fallback routine, but
I wanted to code defensively, able to cope with all kinds of processor
behaviours and kernel command line options.
Patch 3/4 is an additional trap for the nightmare scenario where the
initial estimate is very inaccurate, possibly due to things like SMIs.
It simply retries with a larger bound.
Stephen said:
I tried this patch set out on an MSM7630.
:
: Before:
:
: Calibrating delay loop... 681.57 BogoMIPS (lpj=3407872)
:
: After:
:
: Calibrating delay loop... 680.75 BogoMIPS (lpj=3403776)
:
: But the really good news is calibration time dropped from ~247ms to ~56ms.
: Sadly we won't be able to benefit from this should my udelay patches make
: it into ARM because we would be using calibrate_delay_direct() instead (at
: least on machines who choose to). Can we somehow reapply the logic behind
: this to calibrate_delay_direct()? That would be even better, but this is
: definitely a boot time improvement.
:
: Or maybe we could just replace calibrate_delay_direct() with this fallback
: calculation? If __delay() is a thin wrapper around read_current_timer()
: it should work just as well (plus patch 3 makes it handle SMIs). I'll try
: that out.
This patch:
... so that it can be modified more clinically.
This is almost entirely cosmetic. The only change to the operation
is that the global variable is only set once after the estimation is
completed, rather than taking on all the intermediate values. However,
there are no readers of that variable, so this change is unimportant.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Tue, 22 Mar 2011 23:34:09 +0000 (16:34 -0700)]
sys_unshare: remove the dead CLONE_THREAD/SIGHAND/VM code
Cleanup: kill the dead code which does nothing but complicates the code
and confuses the reader.
sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I
doubt very much it will ever work. At least, nobody even tried since the
original 99d1419d96d7df9cfa56 ("unshare system call -v5: system call
handler function") was applied more than 4 years ago.
And the code is not consistent. unshare_thread() always fails
unconditionally, while unshare_sighand() and unshare_vm() pretend to work
if there is nothing to unshare.
Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and
related variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM
check into check_unshare_flags().
Also, move the "CLONE_NEWNS needs CLONE_FS" check from
check_unshare_flags() to sys_unshare(). This looks more consistent and
matches the similar do_sysvsem check in sys_unshare().
Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give
a false positive due to get_task_mm().
Change the printk() calls to have the KERN_INFO/KERN_ERROR stuff, and
fixes other coding style errors. Not _all_ of them are gone, though.
[akpm@linux-foundation.org: revert the bits I disagree with] Signed-off-by: Michael Rodriguez <dkingston02@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uwe Kleine-König [Tue, 22 Mar 2011 23:34:05 +0000 (16:34 -0700)]
include/linux/err.h: add a function to cast error-pointers to a return value
PTR_RET() can be used if you have an error-pointer and are only interested
in the eventual error value, but not the pointer. Yields the usual 0 for
no error, -ESOMETHING otherwise.
Axel Lin [Tue, 22 Mar 2011 23:34:01 +0000 (16:34 -0700)]
drivers/misc/atmel_tclib.c: fix a memory leak
request_mem_region() will call kzalloc to allocate memory for struct
resource. release_resource() unregisters the resource but does not free
the allocated memory, thus use release_mem_region() instead to fix the
memory leak.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Lin [Tue, 22 Mar 2011 23:34:00 +0000 (16:34 -0700)]
drivers/misc/hmc6352.c: fix wrong return value checking for i2c_master_recv()
i2c_master_recv() returns negative errno, or else the number of bytes
read. Thus i2c_master_recv(client, i2c_data, 2) returns 2 instead of 1 in
success case.
[akpm@linux-foundation.org: make `ret' signed] Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Kalhan Trisal <kalhan.trisal@intel.com> Cc: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hong Liu [Tue, 22 Mar 2011 23:33:59 +0000 (16:33 -0700)]
drivers/misc/apds9802als.c: put the device into runtime suspend after resume()/probe() is handled
Put the device into runtime suspend after resume()/probe() is handled by
the PM core and the device core code. No need to manually add them in
each single driver. And correct the runtime state in remove().
Signed-off-by: Hong Liu <hong.liu@intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pratyush Anand [Tue, 22 Mar 2011 23:33:58 +0000 (16:33 -0700)]
ST SPEAr: PCIE gadget suppport
This is a configurable gadget. can be configured by configfs interface.
Any IP available at PCIE bus can be programmed to be used by host
controller.It supoorts both INTX and MSI.
By default, the gadget is configured for INTX and SYSRAM1 is mapped to
BAR0 with size 0x1000
Richard Kennedy [Tue, 22 Mar 2011 23:33:56 +0000 (16:33 -0700)]
fs.h: remove 8 bytes of padding from block_device on 64bit builds
Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
to fit into one fewer cache lines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk()s without a priority level default to KERN_WARNING. To reduce
noise at KERN_WARNING, this patch set the priority level appriopriately
for unleveled printks()s. This should be useful to folks that look at
dmesg warnings closely.
Commit 6caa76b ("tty: now phase out the ioctl file pointer for good")
removed the ioctl file pointer. User Mode Linux's line driver uses this
ioctl and needs a signature update too.
Signed-off-by: Richard Weinberger <richard@nod.at> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <greg@kroah.com> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Examining the core shows that NT_PRSTATUS notes for all threads other than
the one that crashed are zeroed out.
I believe this is happening because neither ELF_CORE_COPY_TASK_REGS nor
task_pt_regs are defined under ARCH=um, and so elf_core_copy_task_regs()
becomes a no-op.
Attached patch fixes this for SUBARCH={x86_64,i386}.
Signed-off-by: Paul Pluzhnikov <ppluzhnikov@google.com> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2011 23:33:43 +0000 (16:33 -0700)]
shmem: let shared anonymous be nonlinear again
Up to 2.6.22, you could use remap_file_pages(2) on a tmpfs file or a
shared mapping of /dev/zero or a shared anonymous mapping. In 2.6.23 we
disabled it by default, but set VM_CAN_NONLINEAR to enable it on safe
mappings. We made sure to set it in shmem_mmap() for tmpfs files, but
missed it in shmem_zero_setup() for the others. Fix that at last.
mm/memblock: properly handle overlaps and fix error path
Currently memblock_reserve() or memblock_free() don't handle overlaps of
any kind. There is some special casing for coalescing exactly adjacent
regions but that's about it.
This is annoying because typically memblock_reserve() is used to mark
regions passed by the firmware as reserved and we all know how much we can
trust our firmwares...
Also, with the current code, if we do something it doesn't handle right
such as trying to memblock_reserve() a large range spanning multiple
existing smaller reserved regions for example, or doing overlapping
reservations, it can silently corrupt the internal region array, causing
odd errors much later on, such as allocations returning reserved regions
etc...
This patch rewrites the underlying functions that add or remove a region
to the arrays. The new code is a lot more robust as it fully handles
overlapping regions. It's also, imho, simpler than the previous
implementation.
In addition, while doing so, I found a bug where if we fail to double the
array while adding a region, we would remove the last region of the array
rather than the region we just allocated. This fixes it too.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Namhyung Kim [Tue, 22 Mar 2011 23:33:41 +0000 (16:33 -0700)]
vmalloc: remove confusing comment on vwrite()
KM_USER1 is never used for vwrite() path so the caller doesn't need to
guarantee it is not used. Only the caller should guarantee is KM_USER0
and it is commented already.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jun'ichi Nomura [Tue, 22 Mar 2011 23:33:40 +0000 (16:33 -0700)]
writeback: make mapping->writeback_index to point to the last written page
For range-cyclic writeback (e.g. kupdate), the writeback code sets a
continuation point of the next writeback to mapping->writeback_index which
is set the page after the last written page. This happens so that we
evenly write the whole file even if pages in it get continuously
redirtied.
However, in some cases, sequential writer is writing in the middle of the
page and it just redirties the last written page by continuing from that.
For example with an application which uses a file as a big ring buffer we
see:
[1st writeback session]
...
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898514 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898522 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898530 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898538 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898546 + 8
kworker/0:1-11 4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
>> flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4571: block_rq_issue: 8,0 W 0 () 94898554 + 8
[2nd writeback session after 35sec]
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898562 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898570 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898578 + 8
...
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
...
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899962 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899970 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899978 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899986 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899994 + 8
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
>> flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4606: block_rq_issue: 8,0 W 0 () 94898554 + 8
So we seeked back to 94898554 after we wrote all the pages at the end of
the file.
This extra seek seems unnecessary. If we continue writeback from the last
written page, we can avoid it and do not cause harm to other cases. The
original intent of even writeout over the whole file is preserved and if
the page does not get redirtied pagevec_lookup_tag() just skips it.
As an exceptional case, when I/O error happens, set done_index to the next
page as the comment in the code suggests.
Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: separate final enabling of the swapfile
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
this code to a separate function, and use it both in sys_swapon and
sys_swapoff.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
for the order of the operations within the lock. Since the order should
not matter, arbitrarily change sys_swapoff to match sys_swapon, in
preparation to making both share the same code.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
able to make both share the same code, move the printk() call in the
middle of it to just after it.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It still exists within setup_swap_map_and_extents(), but after it
nr_good_pages == p->pages.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: simplify error flow in setup_swap_map_and_extents()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: separate parsing of bad blocks and extents
Move the code which parses the bad block list and the extents to a
separate function. Only code movement, no functional changes.
This change uses the fact that, after the success path, nr_good_pages ==
p->pages.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The call to swap_cgroup_swapon is in the middle of loading the swap map
and extents. As it only does memory allocation and does not depend on
the swapfile layout (map/extents), it can be called earlier (or later).
Move it to just after the allocation of swap_map, since it is
conceptually similar (allocates a map).
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the code which parses and checks the swapfile header (except for
the bad block list) to a separate function. Only code movement, no
functional changes.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: move setting of swapfilepages near use
There is no reason I can see to read inode->i_size long before it is
needed. Move its read to just before it is needed, to reduce the
variable lifetime.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon currently has two error labels, bad_swap and bad_swap_2.
bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
jumped to. With a single extra test for inode (matching the one in the
S_ISREG case below), all the error paths in the function can go to
bad_swap.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only way error is 0 in the cleanup blocks is when the function is
returning successfully. In this case, the cleanup blocks were setting
S_SWAPFILE in the S_ISREG case. But this is not a cleanup.
Move the setting of S_SWAPFILE to just before the "goto out;" to make
this more clear. At this point, we do not need to test for inode because
it will never be NULL.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
this fact to remove the bdev variable.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now there is nothing which jumps to the cleanup blocks before the name
variable is set. There is no need to set it initially to NULL anymore.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: simplify error flow in alloc_swap_info()
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: simplify error return from swap_info allocation
At this point in sys_swapon, there is nothing to free. Return directly
instead of jumping to the cleanup block at the end of the function.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the swap_info allocation to its own function. Only code movement,
no functional changes.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: do not depend on "type" after allocation
Within sys_swapon, after the swap_info entry has been allocated, we
always have type == p->type and swap_info[type] == p. Use this fact to
reduce the dependency on the "type" local variable within the function,
as a preparation to move the allocation of the swap_info entry to a
separate function.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: remove changelog from function comment
Changelogs belong in the git history instead of in the source code.
Also, "The swapon system call" is redundant with
"SYSCALL_DEFINE2(swapon, ...)".
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Gaah. That's a _historical_ comment. But the patch-series depends on removal ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_swapon: use vzalloc() instead of vmalloc/memset
This patch series refactors the sys_swapon function.
sys_swapon is currently a very large function, with 313 lines (more than
12 25-line screens), which can make it a bit hard to read. This patch
series reduces this size by half, by extracting large chunks of related
code to new helper functions.
One of these chunks of code was nearly identical to the part of
sys_swapoff which is used in case of a failure return from
try_to_unuse(), so this patch series also makes both share the same
code.
As a side effect of all this refactoring, the compiled code gets a bit
smaller (from v1 of this patch series):
text data bss dec hex filename
14012 944 276 15232 3b80 mm/swapfile.o.before
13941 944 276 15161 3b39 mm/swapfile.o.after
This patch:
Use vzalloc() instead of vmalloc/memset.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andi Kleen [Tue, 22 Mar 2011 23:33:13 +0000 (16:33 -0700)]
mm: use __GFP_OTHER_NODE for transparent huge pages
Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
hugepages daemon. This way the low level accounting for local versus
remote pages works correctly.
Contains improvements from Andrea Arcangeli
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andi Kleen [Tue, 22 Mar 2011 23:33:12 +0000 (16:33 -0700)]
mm: add __GFP_OTHER_NODE flag
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.
This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.
I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.
Open: should be also used for migration?
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Tue, 22 Mar 2011 23:33:11 +0000 (16:33 -0700)]
mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback
__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.
Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.
Andrea Arcangeli [Tue, 22 Mar 2011 23:33:10 +0000 (16:33 -0700)]
mm: compaction: minimise the time IRQs are disabled while isolating pages for migration
compaction_alloc() isolates pages for migration in isolate_migratepages.
While it's scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Tests show this to be true for the most part
but contention times on the LRU lock can be increased. Before this patch,
the IRQ disabled times for a simple test looked like
Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1
This patch reduces the worst-case IRQs-disabled latencies by releasing the
lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
necessary. The cost of this is that the processing performing compaction will
be slower but IRQs being disabled for too long a time has worse consequences
as the following report shows;
Total sampled time IRQs off (not real total time): 4367
Event shrink_inactive_list..shrink_zone 881 us count 1
Event shrink_inactive_list..shrink_zone 875 us count 1
Event shrink_inactive_list..shrink_zone 868 us count 1
Event shrink_inactive_list..shrink_zone 555 us count 1
Event split_huge_page..add_to_swap 495 us count 1
Event compact_zone..compact_zone_order 269 us count 1
Event split_huge_page..add_to_swap 266 us count 1
Event shrink_inactive_list..shrink_zone 85 us count 1
Event save_args..call_softirq 36 us count 2
Event __wake_up..__wake_up 1 us count 1
[akpm@linux-foundation.org: simplify with s/unlocked/locked/] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Arthur Marsh <arthur.marsh@internode.on.net> Cc: Clemens Ladisch <cladisch@googlemail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2011 23:33:08 +0000 (16:33 -0700)]
mm: compaction: minimise the time IRQs are disabled while isolating free pages
compaction_alloc() isolates free pages to be used as migration targets.
While its scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Analysis showed that IRQs were in fact being
disabled for substantial time. A simple test was run using large
anonymous mappings with transparent hugepage support enabled to trigger
frequent compactions. A monitor sampled what the worst IRQ-off latencies
were and a post-processing tool found the following;
Total sampled time IRQs off (not real total time): 22355
Event compaction_alloc..compaction_alloc 8409 us count 1
Event compaction_alloc..compaction_alloc 7341 us count 1
Event compaction_alloc..compaction_alloc 2463 us count 1
Event compaction_alloc..compaction_alloc 2054 us count 1
Event shrink_inactive_list..shrink_zone 1864 us count 1
Event shrink_inactive_list..shrink_zone 88 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __make_request..__blk_run_queue 24 us count 1
Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1
i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
one instance. The full report generated by the tool can be found at
This patch reduces the time IRQs are disabled by simply disabling IRQs at
the last possible minute. An updated IRQs-off summary report then looks
like;
Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1
Hugh Dickins [Tue, 22 Mar 2011 23:33:07 +0000 (16:33 -0700)]
mm: don't return 0 too early from find_get_pages()
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
truncate_inode_pages_range() - stop looking further when it returns 0.
But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
especially if we have preemptible RCU enabled, isn't it conceivable that
all 14 pages returned could be removed from the page cache by
shrink_page_list(), before find_get_pages() gets to process them? So
causing it to return 0 although there may be plenty more pages beyond.
Make find_get_pages() and find_get_pages_tag() check for this unlikely
case, and restart should it occur; but callers of find_get_pages_contig()
have no such expectation, it's okay for that to return 0 early.
I have not seen this in practice, just worried by the possibility.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2011 23:33:06 +0000 (16:33 -0700)]
mm: remove worrying dead code from find_get_pages()
The radix_tree_deref_retry() case in find_get_pages() has a strange little
excrescence, not seen in the other gang lookups: it looks like the start
of an abandoned attempt to guarantee forward progress in a case that
cannot arise.
ret should always be 0 here: if it isn't, then going back to restart will
leak references to pages already gotten. There used to be a comment
saying nr_found is necessarily 1 here: that's not quite true, but the
radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
race with it being moved out of the radix_tree root or back.
Remove the worrisome two lines, add a brief comment here and in
find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
find_get_pages() should it ever be seen elsewhere than at 0.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Holasek [Tue, 22 Mar 2011 23:33:05 +0000 (16:33 -0700)]
hugetlbfs: correct handling of negative input to /proc/sys/vm/nr_hugepages
When the user inserts a negative value into /proc/sys/vm/nr_hugepages it
will cause the kernel to allocate as many hugepages as possible and to
then update /proc/meminfo to reflect this.
This changes the behavior so that the negative input will result in
nr_hugepages value being unchanged.
Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Anton Arapov <anton@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2011 23:33:04 +0000 (16:33 -0700)]
mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones
When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.
A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark. Historically this was
reasonable as min_free_kbytes was small. However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes. With the introduction of
transparent huge page support, this recommended value is also applied. On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free. The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly. As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone. DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored. What the user sees is that free memory far
higher than it should be.
To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller. While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.
To test this, 80G were copied from a partition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Namhyung Kim [Tue, 22 Mar 2011 23:33:02 +0000 (16:33 -0700)]
mempolicy: remove redundant check in __mpol_equal()
The 'flags' field is already checked, no need to do it again.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 22 Mar 2011 23:33:01 +0000 (16:33 -0700)]
smaps: have smaps show transparent huge pages
Now that the mere act of _looking_ at /proc/$pid/smaps will not destroy
transparent huge pages, tell how much of the VMA is actually mapped with
them.
This way, we can make sure that we're getting THPs where we
expect to see them.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 22 Mar 2011 23:33:00 +0000 (16:33 -0700)]
smaps: teach smaps_pte_range() about THP pmds
This adds code to explicitly detect and handle pmd_trans_huge() pmds. It
then passes HPAGE_SIZE units in to the smap_pte_entry() function instead
of PAGE_SIZE.
This means that using /proc/$pid/smaps now will no longer cause THPs to be
broken down in to small pages.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 22 Mar 2011 23:32:59 +0000 (16:32 -0700)]
smaps: pass pte size argument in to smaps_pte_entry()
Add an argument to the new smaps_pte_entry() function to let it account in
things other than PAGE_SIZE units. I changed all of the PAGE_SIZE sites,
even though not all of them can be reached for transparent huge pages,
just so this will continue to work without changes as THPs are improved.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 22 Mar 2011 23:32:58 +0000 (16:32 -0700)]
smaps: break out smaps_pte_entry() from smaps_pte_range()
We will use smaps_pte_entry() in a moment to handle both small and
transparent large pages. But, we must break it out of smaps_pte_range()
first.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 22 Mar 2011 23:32:56 +0000 (16:32 -0700)]
pagewalk: only split huge pages when necessary
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to. In
practice, that means that anyone doing a
cat /proc/$pid/smaps
will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later. This is fairly suboptimal.
This patch changes that behavior. It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves. Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.
This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Tue, 22 Mar 2011 23:32:54 +0000 (16:32 -0700)]
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reported-by: Steven Barrett <damentz@liquorix.net> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Tue, 22 Mar 2011 23:32:53 +0000 (16:32 -0700)]
memcg: move memcg reclaimable page into tail of inactive list
The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list. That way the
VM will find those pages first next time it needs to free memory.
This patch applies the rule in memcg. It can help to prevent unnecessary
working page eviction of memcg.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Tue, 22 Mar 2011 23:32:52 +0000 (16:32 -0700)]
mm: deactivate invalidated pages
Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync). That's because the workload makes just
use-once pages and touches pages twice. It promotes the page into active
list so that it results in working set page eviction.
Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)
By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
problem. If kernel meets page is writing during invalidate_mapping_pages,
it can't work. It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)
In fact, invalidation is very big hint to reclaimer. It means we don't
use the page any more. So let's move the writing page into inactive
list's head if we can't truncate it right now.
Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later. It can prevent writeout of pageout which is
less effective than flusher's writeout.
Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reported-by: Ben Gamari <bgamari.foss@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Richard Kennedy [Tue, 22 Mar 2011 23:32:50 +0000 (16:32 -0700)]
mm: mm_struct: remove 16 bytes of alignment padding on 64 bit builds
Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
builds. On my config this shrinks mm_struct by enough to fit in one
fewer cache lines and allows more objects per slab in mm_struct
kmem_cache under SLUB.
slabinfo before patch :-
Sizes (bytes) Slabs
--------------------------------
Object : 848 Total : 9
SlabObj: 896 Full : 2
SlabSiz: 16384 Partial: 5
Loss : 48 CpuSlab: 2
Align : 64 Objects: 18
slabinfo after :-
Sizes (bytes) Slabs
--------------------------------
Object : 832 Total : 7
SlabObj: 832 Full : 2
SlabSiz: 16384 Partial: 3
Loss : 0 CpuSlab: 2
Align : 64 Objects: 19
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TestSetPageLocked() isn't being used anywhere. Also, using it would
likely be an error, since the proper interface trylock_page() provides
stronger ordering guarantees.
Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>