Rik van Riel [Mon, 24 Oct 2011 14:54:31 +0000 (01:54 +1100)]
vmscan: limit direct reclaim for higher order allocations
When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.
Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.
This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.
This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.
If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.
Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:54:30 +0000 (01:54 +1100)]
vmscan: add barrier to prevent evictable page in unevictable list
When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1. Then, the page
would be stranded on the unevictable list.
spin_lock
SetPageLRU
spin_unlock
clear_bit(AS_UNEVICTABLE)
spin_lock
if PageLRU()
if !test_bit(AS_UNEVICTABLE)
move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
move evictable list
spin_unlock
But, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.
This patch adds a barrier after mapping_clear_unevictable.
I didn't meet this problem but just found during review.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
warning: symbol 'default_policy' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers,com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tomi Valkeinen <tomi.valkeinen@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 24 Oct 2011 14:54:28 +0000 (01:54 +1100)]
mm: disable user interface to manually rescue unevictable pages
At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again. But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.
The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions. But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.
In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.
This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reported-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Mon, 24 Oct 2011 14:54:28 +0000 (01:54 +1100)]
vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node()
write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.
However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.
Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.
Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul. In
both cases, we do not perform the scanning of this node's zones.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kyungmin Park [Mon, 24 Oct 2011 14:54:27 +0000 (01:54 +1100)]
mm: compaction: make compact_zone_order() static
There's no compact_zone_order() user outside file scope, so make it static.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dean Nelson [Mon, 24 Oct 2011 14:54:27 +0000 (01:54 +1100)]
HWPOISON: convert pr_debug()s to pr_info()s
Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.
About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage"). And these pr_debug()s failed to get
converted to pr_info()s.
This patch converts them as well. And does some minor related whitespace
cleanup.
Signed-off-by: Dean Nelson <dnelson@redhat.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tao Ma [Mon, 24 Oct 2011 14:54:26 +0000 (01:54 +1100)]
fs/buffer.c: add device information for error output in __find_get_block_slow()
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.
If the device information is given, we can know the name of the sick
volume. Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Mon, 24 Oct 2011 14:54:26 +0000 (01:54 +1100)]
mm/mmap.c: eliminate the ret variable from mm_take_all_locks()
The ret variable is really not needed in mm_take_all_locks().
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:23 +0000 (01:54 +1100)]
s390: gup_huge_pmd() return 0 if pte changes
s390 didn't return 0 in that case, if it's rolling back the *nr pointer it
should also return zero to avoid adding pages to the array at the wrong
offset.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:23 +0000 (01:54 +1100)]
powerpc: gup_huge_pmd() return 0 if pte changes
powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
it should also return zero to avoid adding pages to the array at the wrong
offset.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:22 +0000 (01:54 +1100)]
powerpc: get_hugepte() don't put_page() the wrong page
"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.
This is a longstanding issue pre-thp inclusion.
It's totally unclear how these page_cache_add_speculative and pte_val(pte)
!= pte_val(*ptep) checks are necessary across all the powerpc gup_fast
code, when x86 doesn't need any of that: there's no way the page can be
freed with irq disabled so we're guaranteed the atomic_inc will happen on
a page with page_count > 0 (so not needing the speculative check). The
pte check is also meaningless on x86: no need to rollback on x86 if the
pte changed, because the pte can still change a CPU tick after the check
succeeded and it won't be rolled back in that case. The important thing
is we got a reference on a valid page that was mapped there a CPU tick
ago. So not knowing the soft tlb refill code of ppc64 in great detail I'm
not removing the "speculative" page_count increase and the pte checks
across all the code, but unless there's a strong reason for it they should
be later cleaned up too.
If a pte can change from huge to non-huge (like it could happen with THP)
passing a pte_t *ptep to gup_hugepte() would also require to repeat the
is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs only
so I'm not altering that.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:21 +0000 (01:54 +1100)]
powerpc: remove superfluous PageTail checks on the pte gup_fast
This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.
Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:21 +0000 (01:54 +1100)]
mm: thp: tail page refcounting fix
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't
safe, if the pfn ended up being a tail page of a transparent hugepage
under splitting by __split_huge_page_refcount(). He then found the
problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and is
unused for all tail pages of a compound page, so we can simply account the
tail page references there and transfer them to tail_page->_count in
__split_huge_page_refcount() (in addition to the head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge). The standard get_page() as invoked by direct-io instead
will now take the compound_lock but still only for tail pages. The
direct-io paths are usually I/O bound and the compound_lock is per THP so
very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation and
usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Mon, 24 Oct 2011 14:54:20 +0000 (01:54 +1100)]
mm-add-extra-free-kbytes-tunable-update
All the fixes suggested by Andrew Morton. Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch
Thank you for pointing these out, Andrew.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Mon, 24 Oct 2011 14:54:20 +0000 (01:54 +1100)]
mm: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.
This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.
It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.
Testing results from Satoru Moriya:
: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
: - CPU: 1 socket, 4 core
: - Memory: 4GB
:
: - Background load:
: $ dd if=3D/dev/zero of=3D/tmp/tmp1
: $ dd if=3D/dev/zero of=3D/tmp/tmp2
: $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
: - Main load:
: $ mapped-file-stream 1 $((1024 * 1024 * 640)) --(*)
:
: (*) This is made by Johannes Weiner
: https://lkml.org/lkml/2010/8/30/226
:
: It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
: | | extra |
: | default | kbytes |
: --------------------------------------------------------------
: min_free_kbytes | 8113 | 8113 |
: extra_free_kbytes | 0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency | 517.762 | 20.775 | (usec)
: --------------------------------------------------------------
: vmstat result | | |
: nr_vmscan_write | 0 | 0 |
: pgsteal_dma | 0 | 0 |
: pgsteal_dma32 | 143667 | 144882 |
: pgsteal_normal | 31486 | 27001 |
: pgsteal_movable | 0 | 0 |
: pgscan_kswapd_dma | 0 | 0 |
: pgscan_kswapd_dma32 | 138617 | 156351 |
: pgscan_kswapd_normal | 30593 | 27955 |
: pgscan_kswapd_movable | 0 | 0 |
: pgscan_direct_dma | 0 | 0 |
: pgscan_direct_dma32 | 5050 | 0 |
: pgscan_direct_normal | 896 | 0 |
: pgscan_direct_movable | 0 | 0 |
: kswapd_steal | 169207 | 171883 |
: kswapd_inodesteal | 0 | 0 |
: kswapd_low_wmark_hit_quickly | 43 | 45 |
: kswapd_high_wmark_hit_quickly | 1 | 0 |
: allocstall | 32 | 0 |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs. This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.
Signed-off-by: Rik van Riel<riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Tested-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit v2.6.36-5896-gd065bd8 "mm: retry page fault when blocking on
disk transfer" we usually wait in page-faults without mmap_sem held, so
all swap-token logic was broken, because it based on using
rwsem_is_locked(&mm->mmap_sem) as sign of in progress page-faults.
Add an atomic counter of in progress page-faults for mm to the mm_struct
with swap-token.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex,Shi [Mon, 24 Oct 2011 14:54:18 +0000 (01:54 +1100)]
kswapd: assign new_order and new_classzone_idx after wakeup in sleeping
There 2 places to read pgdat in kswapd. One is return from a successful
balance, another is waked up from kswapd sleeping. The new_order and
new_classzone_idx represent the balance input order and classzone_idx.
But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.
1: after a successful balance, kswapd goes to sleep, and new_order = 0;
new_classzone_idx = __MAX_NR_ZONES - 1;
2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
3: in the balance_pgdat() running, a new balance wakeup happened with
order = 5, and classzone_idx = ZONE_NORMAL
4: the first wakeup(order = 3) finished successufly, return order = 3
but, the new_order is still 0, so, this balancing will be treated as a
failed balance. And then the second tighter balancing will be missed.
So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.
Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Tested-by: Pádraig Brady <P@draigBrady.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages. It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.
Alex,Shi [Mon, 24 Oct 2011 14:54:17 +0000 (01:54 +1100)]
kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing. But in the following scenario, kswapd do something that
is not matched our expectation. The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Pádraig Brady <P@draigBrady.com> Cc: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Mon, 24 Oct 2011 14:54:16 +0000 (01:54 +1100)]
mm: neaten warn_alloc_failed
Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Layton [Mon, 24 Oct 2011 14:54:15 +0000 (01:54 +1100)]
mm: iov_iter: have iov_iter_advance() decrement nr_segs appropriately
Currently, when you call iov_iter_advance, then the pointer to the iovec
array can be incremented, but it does not decrement the nr_segs value in
the iov_iter struct. The result is a iov_iter struct with a nr_segs value
that goes beyond the end of the array.
While I'm not aware of anything that's specifically broken by this, it
seems odd and a bit dangerous not to decrement that value. If someone
were to trust the nr_segs value to be correct, then they could end up
walking off the end of the array.
Changing this might also provide some micro-optimization when dealing with
the last iovec in an array. Many of the other routines that deal with
iov_iter have optimized codepaths when nr_segs == 1.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sonic Zhang [Mon, 24 Oct 2011 14:54:15 +0000 (01:54 +1100)]
include/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined macro
On NOMMU architectures, if physical memory doesn't start from 0,
ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
Because virtual address is equal to physical address, PAGE_OFFSET is
always 0. virt_to_page and page_to_virt should not index page by
PAGE_OFFSET directly.
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:13 +0000 (01:54 +1100)]
mremap: avoid sending one IPI per page
This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.
The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things. It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey). If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls. Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.
For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs. One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 24 Oct 2011 14:54:13 +0000 (01:54 +1100)]
mremap: check for overflow using deltas
Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check. This I feel
will improve readability.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sam Ravnborg [Mon, 24 Oct 2011 14:54:12 +0000 (01:54 +1100)]
memblock: add memblock_start_of_DRAM()
SPARC32 require access to the start address. Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.
The awkward name was chosen to match the already present
memblock_end_of_DRAM().
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vmscan: activate executable pages after first usage
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file pages.
Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mitsuo Hayasaka [Mon, 24 Oct 2011 14:54:11 +0000 (01:54 +1100)]
mm: avoid null pointer access in vm_struct via /proc/vmallocinfo
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.
Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.
The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic-long operations instead of looping around cmpxchg().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause really
big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Mon, 24 Oct 2011 14:54:09 +0000 (01:54 +1100)]
lib/string.c: introduce memchr_inv()
memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.
The function name and prototype are stolen from logfs and the
implementation is from SLUB.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Mon, 24 Oct 2011 14:54:08 +0000 (01:54 +1100)]
vmscan: count pages into balanced for zone with good watermark
It's possible a zone watermark is ok when entering the balance_pgdat()
loop, while the zone is within the requested classzone_idx. Count pages
from this zone into `balanced'. In this way, we can skip shrinking zones
too much for high order allocation.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:08 +0000 (01:54 +1100)]
mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes
When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch expands on a comment on how we throttle from reclaim context.
It should be merged with
mm-vmscan-throttle-reclaim-if-encountering-too-many-dirty-pages-under-writeback.patch
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:07 +0000 (01:54 +1100)]
mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback
Workloads that are allocating frequently and writing files place a large
number of dirty pages on the LRU. With use-once logic, it is possible for
them to reach the end of the LRU quickly requiring the reclaimer to scan
more to find clean pages. Ordinarily, processes that are dirtying memory
will get throttled by dirty balancing but this is a global heuristic and
does not take into account that LRUs are maintained on a per-zone basis.
This can lead to a situation whereby reclaim is scanning heavily, skipping
over a large number of pages under writeback and recycling them around the
LRU consuming CPU.
This patch checks how many of the number of pages isolated from the LRU
were dirty and under writeback. If a percentage of them under writeback,
the process will be throttled if a backing device or the zone is
congested. Note that this applies whether it is anonymous or file-backed
pages that are under writeback meaning that swapping is potentially
throttled. This is intentional due to the fact if the swap device is
congested, scanning more pages and dispatching more IO is not going to
help matters.
The percentage that must be in writeback depends on the priority. At
default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
greater the likelihood the process will get throttled to allow the flusher
threads to make some progress.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:06 +0000 (01:54 +1100)]
mm: vmscan: do not writeback filesystem pages in kswapd except in high priority
It is preferable that no dirty pages are dispatched for cleaning from the
page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.
However, page reclaim does have a requirement that pages be freed in a
particular zone. If it is failing to make sufficient progress (reclaiming
< SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
scan more pages. A priority of DEF_PRIORITY - 3 is considered to be the
point where kswapd is getting into trouble reclaiming pages. If this
priority is reached, kswapd will dispatch pages for writing.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:05 +0000 (01:54 +1100)]
mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback
Lumpy reclaim worked with two passes - the first which queued pages for IO
and the second which waited on writeback. As direct reclaim can no longer
write pages there is some dead code. This patch removes it but direct
reclaim will continue to wait on pages under writeback while in
synchronous reclaim mode.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 24 Oct 2011 14:54:05 +0000 (01:54 +1100)]
mm: vmscan: do not writeback filesystem pages in direct reclaim
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd. Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low). That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied. In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings. The command line for fs_mark when
botted with 512M looked something like
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores. The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available. Swap is on a
separate disk. Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely. The time to completion for all tests is improved which is an
important result. In general, kswapd efficiency is not affected by
skipping dirty pages.
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped. The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim. In other words, roughly the same number of pages were reclaimed
but swapping was slower. As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well. For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced. KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher. However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage. A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
1024M-xfs Elapsed Time fsmark 2053.52 1641.03
1024M-xfs Elapsed Time simple-wb 1229.53 768.05
1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
1024M-xfs Kswapd efficiency fsmark 84% 85%
1024M-xfs Kswapd efficiency simple-wb 92% 81%
1024M-xfs Kswapd efficiency mmap-strm 60% 51%
1024M-xfs Avg wait ms fsmark 5404.53 4473.87
1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory. On the postive side, firefox launched
marginally faster with the patches applied. Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive. It was
also the case that "Avg wait ms" on the root filesystem was lower. I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems. First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex. Some
filesystems ignore write requests from direct reclaim as a result. The
second is that a single-page flush is inefficient in terms of IO. While
there is an expectation that the elevator will merge requests, this does
not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists. There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".
The difference between mlocking and pinning is:
A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.
B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.
I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:
Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.
This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 24 Oct 2011 14:54:03 +0000 (01:54 +1100)]
mm: vmscan: drop nr_force_scan[] from get_scan_count
The nr_force_scan[] tuple holds the effective scan numbers for anon and
file pages in case the situation called for a forced scan and the
regularly calculated scan numbers turned out zero.
However, the effective scan number can always be assumed to be
SWAP_CLUSTER_MAX right before the division into anon and file. The
numerators and denominator are properly set up for all cases, be it force
scan for just file, just anon, or both, to do the right thing.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Robert P. J. Day [Mon, 24 Oct 2011 14:54:03 +0000 (01:54 +1100)]
tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.
Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 24 Oct 2011 14:54:02 +0000 (01:54 +1100)]
oom: fix race while temporarily setting current's oom_score_adj
test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.
Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.
To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 24 Oct 2011 14:54:02 +0000 (01:54 +1100)]
oom: remove oom_disable_count
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.
The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:
- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and
- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.
Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 24 Oct 2011 14:54:02 +0000 (01:54 +1100)]
oom: avoid killing kthreads if they assume the oom killed thread's mm
After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups. It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.
A kernel thread, however, may assume a user thread's mm by using
use_mm(). This is only temporary and should not result in sending a
SIGKILL to that kthread.
This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.
Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Mon, 24 Oct 2011 14:54:01 +0000 (01:54 +1100)]
vmscan: add block plug for page reclaim
per-task block plug can reduce block queue lock contention and increase
request merge. Currently page reclaim doesn't support it. I originally
thought page reclaim doesn't need it, because kswapd thread count is
limited and file cache write is done at flusher mostly.
When I test a workload with heavy swap in a 4-node machine, each CPU is
doing direct page reclaim and swap. This causes block queue lock
contention. In my test, without below patch, the CPU utilization is about
2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk
throughput isn't changed. This should improve normal kswapd write and
file cache write too (increase request merge for example), but might not
be so obvious as I explain above.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:53:59 +0000 (01:53 +1100)]
mm: zone_reclaim: make isolate_lru_page() filter-aware
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.
Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:53:58 +0000 (01:53 +1100)]
mm: compaction: make isolate_lru_page() filter-aware
In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.
Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.
So, this patch helps cpu overhead and prevent unnecessary LRU churning.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:53:58 +0000 (01:53 +1100)]
mm: change isolate mode from #define to bitwise type
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.
Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."
This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Mon, 24 Oct 2011 14:53:57 +0000 (01:53 +1100)]
mm: compaction: trivial clean up in acct_isolated()
acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.
This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christopher Yeoh [Mon, 24 Oct 2011 14:53:57 +0000 (01:53 +1100)]
cross-memory-attach-v4
> You might get some speed benefit by optimising for the small copies
> here. Define a local on-stack array of N page*'s and point
> process_pages at that if the number of pages is <= N. Saves a
> malloc/free and is more cache-friendly. But only if the result is
> measurable!
I have done some benchmarking on this, and it gains about 5-7% on a
microbenchmark with 4kb size copies and about a 1% gain with a more
realistic (but modified for smaller copies) hpcc benchmark. The
performance gain disappears into the noise by about 64kb sized copies.
No measurable overhead for larger copies. So I think its worth including
Included below is the patch (based on v4) - for ease of review the first diff
is just against the latest version of CMA which has been posted here previously.
The second is the entire CMA patch.
Signed-off-by: Chris Yeoh <cyeoh@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christopher Yeoh [Mon, 24 Oct 2011 14:53:57 +0000 (01:53 +1100)]
cross-memory-attach-update
- Add x86_64 specific wire up
- Change behaviour so process_vm_readv and process_vm_writev return
the number of bytes successfully read or written even if an error
occurs
- Add more kernel doc interface comments
- rename some internal functions (process_vm_rw_check_iovecs,
process_vm_rw) so they make more sense.
- Add licence message
- Fix kernel-doc comment format
Still need to do benchmarking to see if the optimisation for small copies
using a local on-stack array in process_vm_rw_core is worth it.
Signed-off-by: Chris Yeoh <cyeoh@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christopher Yeoh [Mon, 24 Oct 2011 14:53:56 +0000 (01:53 +1100)]
Cross Memory Attach
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
Greg Lee [Mon, 24 Oct 2011 14:53:54 +0000 (01:53 +1100)]
drivers/watchdog/w83627hf_wdt.c: implement WDIOC_GETTIMELEFT ioctl support
Implement the WDIOC_GETTIMELEFT ioctl, allowing you to check how much time
is left on the watchdog counter before a reset occurs. A few additional
naming clean-ups requested by Padraig Brady as well.
Signed-off-by: Greg Lee <glee@swspec.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Padraig Brady <P@draigbrady.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Carpenter [Mon, 24 Oct 2011 14:53:54 +0000 (01:53 +1100)]
paride: fix potential information leak in pg_read()
Smatch has a new check for Rosenberg type information leaks where structs
are copied to the user with uninitialized stack data in them. i In this
case, the pg_write_hdr struct has a hole in it.
Dan Carpenter [Mon, 24 Oct 2011 14:53:54 +0000 (01:53 +1100)]
bio: change some signed vars to unsigned
This is just a cleanup patch to silence a static checker warning.
The problem is that we cap "nr_iovecs" so it can't be larger than
"UIO_MAXIOV" but we don't check for negative values. It turns out this is
prevented at other layers, but logically it doesn't make sense to have
negative nr_iovecs so making it unsigned is nicer.
Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Rothwell [Mon, 24 Oct 2011 14:53:53 +0000 (01:53 +1100)]
include/linux/bio.h: use a static inline function for bio_integrity_clone()
When CONFIG_BLK_DEV_INTEGRITY is not set, we get these warnings:
drivers/md/dm.c: In function 'split_bvec':
drivers/md/dm.c:1061:3: warning: statement with no effect
drivers/md/dm.c: In function 'clone_bio':
drivers/md/dm.c:1088:3: warning: statement with no effect
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>