Wolfram Strepp [Tue, 16 Jun 2009 22:34:13 +0000 (15:34 -0700)]
rb_tree: remove redundant if()-condition in rb_erase()
Furthermore, notice that the initial checks:
if (!node->rb_left)
child = node->rb_right;
else if (!node->rb_right)
child = node->rb_left;
else
{
...
}
guarantee that old->rb_right is set in the final else branch, therefore
we can omit checking that again.
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wolfram Strepp [Tue, 16 Jun 2009 22:34:12 +0000 (15:34 -0700)]
rb_tree: make clear distinction between two different cases in rb_erase()
There are two cases when a node, having 2 childs, is erased:
'normal case': the successor is not the right-hand-child of the node to be erased
'special case': the successor is the right-hand child of the node to be erased
Here some ascii-art, with following symbols (referring to the code):
O: node to be deleted
N: the successor of O
P: parent of N
C: child of N
L: some other node
normal case:
O N
/ \ / \
/ \ / \
L \ L \
/ \ P ----> / \ P
/ \ / \
/ /
N C
\ / \
\
C
/ \
special case:
O|P N
/ \ / \
/ \ / \
L \ L \
/ \ N ----> / C
\ / \
\
C
/ \
Notice that for the special case we don't have to reconnect C to N.
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 16 Jun 2009 22:34:09 +0000 (15:34 -0700)]
MAINTAINERS: add Paul McKenney to RCU and RCUTORTURE
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 16 Jun 2009 22:34:06 +0000 (15:34 -0700)]
MAINTAINERS: add file patterns to "THE REST"
These file patterns match all sources.
By default, scripts/get_maintainers.pl excludes Linus Torvalds
from the CC: list. Option --git-chief-penguins will include him.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zygo Blaxell [Tue, 16 Jun 2009 22:33:57 +0000 (15:33 -0700)]
lib/genalloc.c: remove unmatched write_lock() in gen_pool_destroy
There is a call to write_lock() in gen_pool_destroy which is not balanced
by any corresponding write_unlock(). This causes problems with preemption
because the preemption-disable counter is incremented in the write_lock()
call, but never decremented by any call to write_unlock(). This bug is
gen_pool_destroy, and one of them is non-x86 arch-specific code.
Florian Fainelli [Tue, 16 Jun 2009 22:33:53 +0000 (15:33 -0700)]
drivers: add support for the TI VLYNQ bus
Add support for the TI VLYNQ high-speed, serial and packetized bus.
This bus allows external devices to be connected to the System-on-Chip and
appear in the main system memory just like any memory mapped peripheral.
It is widely used in TI's networking and multimedia SoC, including the AR7
SoC.
Daniel Mack [Tue, 16 Jun 2009 22:33:52 +0000 (15:33 -0700)]
console: make blank timeout value a boot option
The console blank timer is currently hardcoded to 10*60 seconds which
might be annoying on systems with no input devices attached to wake up the
console again. Especially during development, disabling the screen saver
can be handy - for example when debugging the root fs mount mechanism or
other scenarios where no userspace program could be started to do that at
runtime from userspace.
This patch defines a core_param for the variable in charge which allows
users to entirely disable the blank feature at boot time by setting it 0.
The value can still be overwritten at runtime using the standard ioctl
call - this just allows to conditionally change the default.
Signed-off-by: Daniel Mack <daniel@caiaq.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add EISA IDs for Network Peripherals FDDI boards. Descriptions taken from
the respective EISA configuration files.
It's unlikely we'll ever support these cards, the problem being the lack
of documentation. Assuming the policy for the EISA ID database is the
same as for PCI I'm sending these entries for the sake of completeness.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: Marc Zyngier <maz@misterjones.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/accounting/getdelays.c intialize the variable before using it
Fix compilation warning:
Documentation/accounting/getdelays.c: In function `main':
Documentation/accounting/getdelays.c:249: warning: `cmd_type' may be used uninitialized in this function
I'd expect the output to be "[41 42]", but actually it's "[41 42 ]"
This patch also makes the required buf to be minimum. To print the hex
format of "AB", a buf with size 6 should be sufficient, but
hex_dump_to_buffer() required at least 8.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Peterson [Tue, 16 Jun 2009 22:33:43 +0000 (15:33 -0700)]
slow-work: use round_jiffies() for thread pool's cull and OOM timers
Round the slow work queue's cull and OOM timeouts to whole second boundary
with round_jiffies(). The slow work queue uses a pair of timers to cull
idle threads and, after OOM, to delay new thread creation.
This patch also extracts the mod_timer() logic for the cull timer into a
separate helper function.
By rounding non-time-critical timers such as these to whole seconds, they
will be batched up to fire at the same time rather than being spread out.
This allows the CPU wake up less, which saves power.
Signed-off-by: Chris Peterson <cpeterso@cpeterso.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Tue, 16 Jun 2009 22:33:39 +0000 (15:33 -0700)]
remove put_cpu_no_resched()
put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
can cause high latencies.
The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
reschedule request caused by an interrupt between the get_cpu() and the
put_cpu_no_resched() can delay the reschedule for at least HZ.
The other users of put_cpu_no_resched() optimize correctly in interrupt
code, but there is no real harm in using the put_cpu() function which is
an alias for preempt_enable(). The extra check of the preemmpt count is
not as critical as the potential source of missing a reschedule.
Debugged in the preempt-rt tree and verified in mainline.
Impact: remove a high latency source
[akpm@linux-foundation.org: build fix] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 16 Jun 2009 22:33:37 +0000 (15:33 -0700)]
headers: move module_bug_finalize()/module_bug_cleanup() definitions into module.h
They're in linux/bug.h at present, which causes include order tangles. In
particular, linux/bug.h cannot be used by linux/atomic.h because,
according to Nikanth:
Eric Dumazet [Tue, 16 Jun 2009 22:33:36 +0000 (15:33 -0700)]
poll: avoid extra wakeups in select/poll
After introduction of keyed wakeups Davide Libenzi did on epoll, we are
able to avoid spurious wakeups in poll()/select() code too.
For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP frames call
sock_wfree() which in turn schedules thread.
When scheduled, thread does a full scan of all polled fds and can sleep
again, because nothing is really available. If number of fds is large,
this cause significant load.
This patch makes select()/poll() aware of keyed wakeups and useless
wakeups are avoided. This reduces number of context switches by about 50%
on some setups, and work performed by sofirq handlers.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert P. J. Day [Tue, 16 Jun 2009 22:33:35 +0000 (15:33 -0700)]
ntfs: use is_power_of_2() function for clarity.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert P. J. Day [Tue, 16 Jun 2009 22:33:34 +0000 (15:33 -0700)]
kernel/kfifo.c: replace conditional test with is_power_of_2()
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Blunck [Tue, 16 Jun 2009 22:33:33 +0000 (15:33 -0700)]
atomic: only take lock when the counter drops to zero on UP as well
_atomic_dec_and_lock() should not unconditionally take the lock before
calling atomic_dec_and_test() in the UP case. For consistency reasons it
should behave exactly like in the SMP case.
Besides that this works around the problem that with CONFIG_DEBUG_SPINLOCK
this spins in __spin_lock_debug() if the lock is already taken even if the
counter doesn't drop to 0.
Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Valerie Aurora <vaurora@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Smith [Tue, 16 Jun 2009 22:33:33 +0000 (15:33 -0700)]
utsname.h: make new_utsname fields use the proper length constant
The members of the new_utsname structure are defined with magic numbers
that *should* correspond to the constant __NEW_UTS_LEN+1. Everywhere
else, code assumes this and uses the constant, so this patch makes the
structure match.
Roel Kluin [Tue, 16 Jun 2009 22:33:32 +0000 (15:33 -0700)]
uml: bad macro expansion, parameter is member
`ELF_CORE_COPY_REGS(x, y)' will make expansions like:
`(y)[0] = (x)->x.gp[0]' but correct is `(y)[0] = (x)->regs.gp[0]'
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: WANG Cong <amwang@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Amerigo Wang [Tue, 16 Jun 2009 22:33:30 +0000 (15:33 -0700)]
uml: fix a section warning
When compiling uml on x86_64:
MODPOST vmlinux.o
WARNING: vmlinux.o (.__syscall_stub.2): unexpected non-allocatable section.
Did you forget to use "ax"/"aw" in a .S file?
Note that for example <linux/init.h> contains
section definitions for use in .S files.
Because modpost checks for missing SHF_ALLOC section flag. So just add
it.
Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Tue, 16 Jun 2009 22:33:29 +0000 (15:33 -0700)]
um: remove obsolete hw_interrupt_type
The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.
This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.
Impact: cleanup
Convert the last remaining users to struct irq_chip and remove the
define.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alan Cox <alan@linux.intel.com> Reported-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Cc: Roland Kletzing <devzero@web.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Tue, 16 Jun 2009 22:33:26 +0000 (15:33 -0700)]
m32r: remove obsolete hw_interrupt_type
The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.
This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.
Impact: cleanup
Convert the last remaining users to struct irq_chip and remove the
define.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Tue, 16 Jun 2009 22:33:25 +0000 (15:33 -0700)]
alpha: remove obsolete hw_interrupt_type
The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.
This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.
Impact: cleanup
Convert the last remaining users to struct irq_chip and remove the
define.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: fix lumpy reclaim lru handling at isolate_lru_pages
At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
pushed back to "src" list by list_move(). But the page may not be from
"src" list. This pushes the page back to wrong LRU. And list_move()
itself is unnecessary because the page is not on top of LRU. Then, leave
it as it is if __isolate_lru_page() fails.
Mel Gorman [Tue, 16 Jun 2009 22:33:23 +0000 (15:33 -0700)]
vmscan: count the number of times zone_reclaim() scans and fails
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat. If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.
[akpm@linux-foundation.org: name things consistently] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 16 Jun 2009 22:33:22 +0000 (15:33 -0700)]
vmscan: do not unconditionally treat zones that fail zone_reclaim() as full
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.
This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.
This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.
There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.
This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.
Mel Gorman [Tue, 16 Jun 2009 22:33:20 +0000 (15:33 -0700)]
vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.
The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall. The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.
This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation. It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().
Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.
Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.
Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.
This patch:
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.
Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with. Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.
This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.
[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:33:17 +0000 (15:33 -0700)]
writeback: skip new or to-be-freed inodes
1) I_FREEING tests should be coupled with I_CLEAR
The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
races with generic_forget_inode()
generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:
generic_forget_inode
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode_lock);
generic_sync_sb_inodes()
spin_lock(&inode_lock);
__iget(inode);
__writeback_single_inode
// see non zero i_count
may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
spin_unlock(&inode_lock);
may call generic_forget_inode again ==> iput(inode);
The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will
guarantee that. So skip I_WILL_FREE inodes for the sake of safety.
Cc: Eric Sandeen <sandeen@sandeen.net> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 16 Jun 2009 22:33:16 +0000 (15:33 -0700)]
oom: only oom kill exiting tasks with attached memory
When a task is chosen for oom kill and is found to be PF_EXITING,
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.
This privilege is unnecessary, however, if the task has already detached
its mm. Although its possible for the mm to become detached later since
task_lock() is not held, __oom_kill_task() will simply be a no-op in such
circumstances.
Subsequently, it is no longer necessary to warn about killing mm-less
tasks since it is a no-op.
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
But the result of get_scan_ratio() is ignored when priority == 0, so anon
lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
not an expected behavior.
As for memcg especially, because of this behavior many and many pages are
swapped-out just in vain when oom is invoked by mem+swap limit.
This patch is for handling may_swap flag more strictly.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:33:13 +0000 (15:33 -0700)]
vmscan: merge duplicate code in shrink_active_list()
The "move pages to active list" and "move pages to inactive list" code
blocks are mostly identical and can be served by a function.
Thanks to Andrew Morton for pointing this out.
Note that buffer_heads_over_limit check will also be carried out for
re-activated pages, which is slightly different from pre-2.6.28 kernels.
Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
scans of active file list when memory pressure is low. So the net effect
could be, the number of buffer heads is now more likely to grow large.
However that's fine according to Johannes' comments:
I don't think that this could be harmful. We just preserve the buffer
mappings of what we consider the working set and with low memory
pressure, as you say, this set is not big.
As to stripping of reactivated pages: the only pages we re-activate
for now are those VM_EXEC mapped ones. Since we don't expect IO from
or to these pages, removing the buffer mappings in case they grow too
large should be okay, I guess.
Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:33:12 +0000 (15:33 -0700)]
vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
(1) memory mapped executables
(2) and their anonymous pages
(3) and other files
(4) and the dcache/icache/.. slabs
while the least important data are
(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list. Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
before patch: guaranteed to be cached if reference intervals < I
after patch: guaranteed to be cached if reference intervals < I+A
(except when randomly reclaimed by the lumpy reclaim)
where
A = time to fully scan the active file LRU
I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic. But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
after starting each new program.
1.3) progress timing (seconds)
before after programs
0.02 0.02 N xeyes
0.75 0.76 N firefox
2.02 1.88 N nautilus
3.36 3.17 N nautilus --browser
5.26 4.89 N gthumb
7.12 6.47 N gedit
9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
13.58 12.55 N xterm
15.87 14.57 N mlterm
18.63 17.06 N gnome-terminal
21.16 18.90 N urxvt
26.24 23.48 N gnome-system-monitor
28.72 26.52 N gnome-help
32.15 29.65 N gnome-dictionary
39.66 36.12 N /usr/games/sol
43.16 39.27 N /usr/games/gnometris
48.65 42.56 N /usr/games/gnect
53.31 47.03 N /usr/games/gtali
58.60 52.05 N /usr/games/iagno
65.77 55.42 N /usr/games/gnotravex
70.76 61.47 N /usr/games/mahjongg
76.15 67.11 N /usr/games/gnome-sudoku
86.32 75.15 N /usr/games/glines
92.21 79.70 N /usr/games/glchess
103.79 88.48 N /usr/games/gnomine
113.84 96.51 N /usr/games/gnotski
124.40 102.19 N /usr/games/gnibbles
137.41 114.93 N /usr/games/gnobots2
155.53 125.02 N /usr/games/blackjack
179.85 135.11 N /usr/games/same-gnome
224.49 154.50 N /usr/bin/gnome-window-properties
248.44 162.09 N /usr/bin/gnome-default-applications-properties
282.62 173.29 N /usr/bin/gnome-at-properties
323.72 188.21 N /usr/bin/gnome-typing-monitor
363.99 199.93 N /usr/bin/gnome-at-visual
394.21 206.95 N /usr/bin/gnome-sound-properties
435.14 224.49 N /usr/bin/gnome-at-mobility
463.05 234.11 N /usr/bin/gnome-keybinding-properties
503.75 248.59 N /usr/bin/gnome-about-me
554.00 276.27 N /usr/bin/gnome-display-properties
615.48 304.39 N /usr/bin/gnome-network-preferences
693.03 342.01 N /usr/bin/gnome-mouse-properties
759.90 388.58 N /usr/bin/gnome-appearance-properties
937.90 508.47 N /usr/bin/gnome-control-center
1109.75 587.57 N /usr/bin/gnome-keyboard-properties
1399.05 758.16 N : oocalc
1524.64 830.03 N : oodraw
1684.31 900.03 N : ooimpress
1874.04 993.91 N : oomath
2115.12 1081.89 N : ooweb
2369.02 1161.99 N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
before patch:
total used free shared buffers cached
Mem: 474 467 7 0 0 236
-/+ buffers/cache: 230 243
Swap: 1023 418 605
after patch:
total used free shared buffers cached
Mem: 474 457 16 0 0 236
-/+ buffers/cache: 221 253
Swap: 1023 404 619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10
iotrace.rb --load pattern-hot-10 --play /b/sparse
vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin: shortly after the big read IO starts;
(2) end: just before the big read IO stops;
(3) restore: the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
windows to restore their working set.
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
That's a huge improvement - which means with the VM_EXEC protection logic,
active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
That roughly means the active mmap pages get 20.8 more chances to get
re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
dropped pages are mostly inactive ones. The patch has almost no impact in
this aspect, that means it won't unnecessarily increase memory pressure.
(In contrast, your 20% mmap protection ratio will keep them all, and
therefore eliminate the extra 41 major faults to restore working set
of zsh etc.)
The iotrace.rb read throughput is
151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
during the whole process.
Cc: Elladan <elladan@eskimo.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:33:05 +0000 (15:33 -0700)]
vmscan: report vm_flags in page_referenced()
Collect vma->vm_flags of the VMAs that actually referenced the page.
This is preparing for more informed reclaim heuristics, eg. to protect
executable file pages more aggressively. For now only the VM_EXEC bit
will be used by the caller.
Thanks to Johannes, Peter and Minchan for all the good tips.
Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 16 Jun 2009 22:33:04 +0000 (15:33 -0700)]
mm: add a gfp-translate script to help understand page allocation failure reports
The page allocation failure messages include a line that looks like
page allocation failure. order:1, mode:0x4020
The mode is easy to translate but irritating for the lazy and a bit error
prone. This patch adds a very simple helper script gfp-translate for the
mode: portion of the page allocation failure messages. An example usage
looks like
Yinghai Lu [Tue, 16 Jun 2009 22:33:00 +0000 (15:33 -0700)]
page-allocator: clear N_HIGH_MEMORY map before we set it again
SRAT tables may contains nodes of very small size. The arch code may
decide to not activate such a node. However, currently the early boot
code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
active although these nodes have no present pages.
For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
Signed-off-by: Yinghai Lu <Yinghai@kernel.org> Tested-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Waychison [Tue, 16 Jun 2009 22:32:59 +0000 (15:32 -0700)]
mm: remove __invalidate_mapping_pages variant
Remove __invalidate_mapping_pages atomic variant now that its sole caller
can sleep (fixed in eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb ("vfs: fix
lock inversion in drop_pagecache_sb()")).
This fixes softlockups that can occur while in the drop_caches path.
Signed-off-by: Mike Waychison <mikew@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 16 Jun 2009 22:32:58 +0000 (15:32 -0700)]
oom: invoke oom killer for __GFP_NOFAIL
The oom killer must be invoked regardless of the order if the allocation
is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free
some memory.
Cc: Nick Piggin <npiggin@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 16 Jun 2009 22:32:57 +0000 (15:32 -0700)]
oom: avoid unnecessary mm locking and scanning for OOM_DISABLE
This moves the check for OOM_DISABLE to the badness heuristic so it is
only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the
score is 0, which is also correctly exported via /proc/pid/oom_score.
This requires that tasks with badness scores of 0 are prohibited from
being oom killed, which makes sense since they would not allow for future
memory freeing anyway.
Since the oom_adj value is a characteristic of an mm and not a task, it is
no longer necessary to check the oom_adj value for threads sharing the
same memory (except when simply issuing SIGKILLs for threads in other
thread groups).
Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 16 Jun 2009 22:32:56 +0000 (15:32 -0700)]
oom: move oom_adj value from task_struct to mm_struct
The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.
This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.
Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.
Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.
Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently we can know a swap entry is just used as SwapCache via swap_map,
without looking up swap cache.
Then, we have a chance to reuse swap-cache-only swap entries in
get_swap_pages().
This patch tries to free swap-cache-only swap entries if swap is not
enough.
Note: We hit following path when swap_cluster code cannot find a free
cluster. Then, vm_swap_full() is not only condition to allow the kernel
to reclaim unused swap.
This is a part of the patches for fixing memcg's swap accountinf leak.
But, IMHO, not a bad patch even if no memcg.
There are 2 kinds of references to swap.
- reference from swap entry
- reference from swap cache
Then,
- If there is swap cache && swap's refcnt is 1, there is only swap cache.
(*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
This counting logic have worked well for a long time. But considering
that we cannot know there is a _real_ reference or not by swap_map[],
current usage of counter is not very good.
This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
entry has a cache or not. This will remove -1 magic used in swapfile.c
and be a help to avoid unnecessary find_get_page().
In a following patch, the usage of swap cache is recorded into swap_map.
This patch is for necessary interface changes to do that.
2 interfaces:
- swapcache_prepare()
- swapcache_free()
are added for allocating/freeing refcnt from swap-cache to existing swap
entries. But implementation itself is not changed under this patch. At
adding swapcache_free(), memcg's hook code is moved under
swapcache_free(). This is better than using scattered hooks.
Minchan Kim [Tue, 16 Jun 2009 22:32:49 +0000 (15:32 -0700)]
page-allocator: add inactive ratio calculation function of each zone
Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s
loop into a a separate function, calculate_zone_inactive_ratio(). This
function will be used in a later patch
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Tue, 16 Jun 2009 22:32:48 +0000 (15:32 -0700)]
page-allocator: clean up functions related to pages_min
Change the names of two functions. It doesn't affect behavior.
Presently, setup_per_zone_pages_min() changes low, high of zone as well as
min. So a better name is setup_per_zone_wmarks(). That's because Mel
changed zone->pages_[hig/low/min] to zone->watermark array in "page
allocator: replace the watermark-related union in struct zone with a
watermark[] array".
page-allocator: use integer fields lookup for gfp_zone and check for errors in flags passed to the page allocator
This simplifies the code in gfp_zone() and also keeps the ability of the
compiler to use constant folding to get rid of gfp_zone processing.
The lookup of the zone is done using a bitfield stored in an integer. So
the code in gfp_zone is a simple extraction of bits from a constant
bitfield. The compiler is generating a load of a constant into a register
and then performs a shift and mask operation to get the zone from a gfp_t.
No cachelines are touched and no branches have to be predicted by the
compiler.
We are doing some macro tricks here to convince the compiler to always do
the constant folding if possible.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Tue, 16 Jun 2009 22:32:45 +0000 (15:32 -0700)]
mm: check the argument of kunmap on architectures without highmem
If you're using a non-highmem architecture, passing an argument with the
wrong type to kunmap() doesn't give you a warning because the ifdef
doesn't check the type.
Using a static inline function solves the problem nicely.
Reported-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MinChan Kim [Tue, 16 Jun 2009 22:32:44 +0000 (15:32 -0700)]
vmscan: prevent shrinking of active anon lru list in case of no swap space V3
shrink_zone() can deactivate active anon pages even if we don't have a
swap device. Many embedded products don't have a swap device. So the
deactivation of anon pages is unnecessary.
This patch prevents unnecessary deactivation of anon lru pages. But, it
don't prevent aging of anon pages to swap out.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brice Goglin [Tue, 16 Jun 2009 22:32:43 +0000 (15:32 -0700)]
migration: only migrate_prep() once per move_pages()
migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
throughput by breaking it into chunks, but it also made migrate_prep() be
called once per chunk (every 128pages or so) instead of once per
move_pages().
This patch reverts to calling migrate_prep() only once per chunk as we did
before 2.6.29. It is also a followup to commit 0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
under mmap_sem").
This improves migration throughput on the above machine from 600MB/s to
750MB/s.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, PM/Freezer: Disable OOM killer when tasks are frozen
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the memory
shrinking carried out by the hibernation code, it is theoretically
possible to trigger during suspend after the memory shrinking has been
removed from that code path. Moreover, since memory allocations are
going to be used for the hibernation memory shrinking, it will be even
more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled switch
that will cause __alloc_pages_internal() to fail in the situations in
which the OOM killer would have been called and make the freezer set
this switch after tasks have been successfully frozen.
[akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 16 Jun 2009 22:32:38 +0000 (15:32 -0700)]
mm: madvise(): correct return code
The posix_madvise() function succeeds (and does nothing) when called with
parameters (NULL, 0, -1); according to LSB tests, it should fail with
EINVAL because -1 is not a valid flag.
When called with a valid address and size, it correctly fails.
So perform an initial check for valid flags first.
Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 16 Jun 2009 22:32:37 +0000 (15:32 -0700)]
page-allocator: warn if __GFP_NOFAIL is used for a large allocation
__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should
detect and suitably handle this (and not by lamely moving the infinite
loop up to the caller level either).
Attempting to use __GFP_NOFAIL for a higher-order allocation is even
worse, so add a once-off runtime check for this to slap people around for
even thinking about trying it.
Cc: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Magnus Damm [Tue, 16 Jun 2009 22:32:36 +0000 (15:32 -0700)]
videobuf-dma-contig: zero copy USERPTR support
Since videobuf-dma-contig is designed to handle physically contiguous
memory, this patch modifies the videobuf-dma-contig code to only accept a
user space pointer to physically contiguous memory. For now only
VM_PFNMAP vmas are supported, so forget hotplug.
On SuperH Mobile we use this with our sh_mobile_ceu_camera driver together
with various multimedia accelerator blocks that are exported to user space
using UIO. The UIO kernel code exports physically contiguous memory to
user space and lets the user space application mmap() this memory and pass
a pointer using the USERPTR interface for V4L2 zero copy operation.
With this approach we support zero copy capture, hardware scaling and
various forms of hardware encoding and decoding.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:32:31 +0000 (15:32 -0700)]
vmscan: ZVC updates in shrink_active_list() can be done once
This effectively lifts the unit of updates to nr_inactive_* and
pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or
MAX_ORDER_NR_PAGES=1024 for reclaim_zone().
Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:32:30 +0000 (15:32 -0700)]
vmscan: don't export nr_saved_scan in /proc/zoneinfo
The lru->nr_saved_scan's are not meaningful counters for even kernel
developers. They typically are smaller than 32 and are always 0 for large
lists. So remove them from /proc/zoneinfo.
Hopefully this interface change won't break too many scripts.
/proc/zoneinfo is too unstructured to be script friendly, and I wonder the
affected scripts - if there are any - are still bleeding since the not
long ago commit "vmscan: split LRU lists into anon & file sets", which
also touched the "scanned" line :)
If we are to re-export accumulated vmscan counts in the future, they can
go to new lines in /proc/zoneinfo instead of the current form, or to
/sys/devices/system/node/node0/meminfo?
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rik van Riel [Tue, 16 Jun 2009 22:32:28 +0000 (15:32 -0700)]
vmscan: evict use-once pages first
When the file LRU lists are dominated by streaming IO pages, evict those
pages first, before considering evicting other pages.
This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:
1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted
The pages freed in this way can either be reused for streaming IO, or
allocated for something else. If the pages are used for streaming IO,
this pageout pattern continues. Otherwise, we will fall back to the
normal pageout pattern.
Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Tue, 16 Jun 2009 22:32:27 +0000 (15:32 -0700)]
pagemap: add page-types tool
Add page-types, a handy tool for querying page flags.
It will expand some of the overloaded flags:
PG_slob_free = PG_private
PG_slub_frozen = PG_active
PG_slub_debug = PG_error
PG_readahead = PG_reclaim
and mask out obscure flags except in -raw mode:
PG_reserved
PG_mlocked
PG_mappedtodisk
PG_private
PG_private_2
PG_owner_priv_1
PG_arch_1
PG_uncached
PG_compound* for non hugeTLB pages
Wu Fengguang [Tue, 16 Jun 2009 22:32:22 +0000 (15:32 -0700)]
mm: introduce PageHuge() for testing huge/gigantic pages
A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.
Export 10 more flags to end users (and more for kernel developers):
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_HUGE hugeTLB pages
18. KPF_UNEVICTABLE page is in the unevictable LRU list
19. KPF_HWPOISON hardware detected corruption
20. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
a simple demo of the page-types tool
# ./page-types -h
page-types [options]
-r|--raw Raw mode, for kernel developers
-a|--addr addr-spec Walk a range of pages
-b|--bits bits-spec Walk pages with specified bits
-l|--list Show page details in ranges
-L|--list-each Show page details one by one
-N|--no-summary Don't show summay info
-h|--help Show this usage message
addr-spec:
N one page at offset N (unit: pages)
N+M pages range from N to N+M-1
N,M pages range from N to M-1
N, pages range from N to end
,M pages range from 0 to M
bits-spec:
bit1,bit2 (flags & (bit1|bit2)) != 0
bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
bit1,~bit2 (flags & (bit1|bit2)) == bit1
=bit1,bit2 flags == (bit1|bit2)
bit-names:
locked error referenced uptodate
dirty lru active slab
writeback reclaim buddy mmap
anonymous swapcache swapbacked compound_head
compound_tail huge unevictable hwpoison
nopage reserved(r) mlocked(r) mappedtodisk(r)
private(r) private_2(r) owner_private(r) arch(r)
uncached(r) readahead(o) slob_free(o) slub_frozen(o)
slub_debug(o)
(r) raw mode bits (o) overloaded bits