]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
8 years agomm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
Andrew Morton [Tue, 9 Feb 2016 23:13:07 +0000 (10:13 +1100)]
mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes

ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 0123456789ab ("commit description")'
#9:
The first condition was added by a41f24ea9fd6 ("page allocator: smarter

WARNING: line over 80 characters
#47: FILE: mm/page_alloc.c:3005:
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without

total: 1 errors, 1 warnings, 68 lines checked

./patches/mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: use watermark checks for __GFP_REPEAT high order allocations
Michal Hocko [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm: use watermark checks for __GFP_REPEAT high order allocations

__alloc_pages_slowpath retries costly allocations until at least order
worth of pages were reclaimed or the watermark check for at least one zone
would succeed after all reclaiming all pages if the reclaim hasn't made
any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim could
have created a page of the sufficient order.  Lumpy reclaim, has been
removed quite some time ago so the assumption doesn't hold anymore.  It
would be more appropriate to check the compaction progress instead but
this patch simply removes the check and relies solely on the watermark
check.

To prevent from too many retries the no_progress_loops is not reseted
after a reclaim which made progress because we cannot assume it helped
high order situation.  Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: throttle on IO only when there are too many dirty and writeback pages
Michal Hocko [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm: throttle on IO only when there are too many dirty and writeback pages

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.  We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if there
are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls and
sleeping for the full timeout even when the BDI wasn't congested.  Hence
wait_iff_congested was used instead.  But it seems that even
wait_iff_congested doesn't work as expected.  We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round of
direct reclaim didn't make any progress and dirty+writeback pages are more
than a half of the reclaimable pages on the zone which might be usable for
our target allocation.  This shouldn't reintroduce stalls fixed by
0e093d99763e because congestion_wait is called only when we are getting
hopeless when sleeping is a better choice than OOM with many pages under
IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested because
the sleep is needed to be done only once in the allocation retry cycle.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-oom-rework-oom-detection-checkpatch-fixes
Andrew Morton [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm-oom-rework-oom-detection-checkpatch-fixes

Cc: David Rientjes <rientjes@google.com>
WARNING: line over 80 characters
#99: FILE: mm/page_alloc.c:2965:
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).

WARNING: line over 80 characters
#129: FILE: mm/page_alloc.c:2995:
+  * Keep reclaiming pages while there is a chance this will lead somewhere.

WARNING: line over 80 characters
#134: FILE: mm/page_alloc.c:3000:
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {

WARNING: line over 80 characters
#138: FILE: mm/page_alloc.c:3004:
+ available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);

WARNING: line over 80 characters
#142: FILE: mm/page_alloc.c:3008:
+  * Would the allocation succeed if we reclaimed the whole available?

WARNING: line over 80 characters
#146: FILE: mm/page_alloc.c:3012:
+ /* Wait for some write requests to complete then retry */

total: 0 errors, 6 warnings, 202 lines checked

./patches/mm-oom-rework-oom-detection.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, oom: rework oom detection
Michal Hocko [Tue, 9 Feb 2016 23:13:05 +0000 (10:13 +1100)]
mm, oom: rework oom detection

As pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious.  I tend to agree, not
only it is really obscure, it is not hard to imagine cases where a single
page freed in the loop keeps all the reclaimers looping without getting
any progress because their gfp_mask wouldn't allow to get that page anyway
(e.g.  single GFP_ATOMIC alloc and free loop).  This is rather rare so it
doesn't happen in the practice but the current logic which we have is
rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and easier
to follow because each reclaimer basically tracks its own progress which
is implemented at the page allocator layer rather spread out between the
allocator and the reclaim.  The more on the implementation is described in
the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative.  There is usually a
tiny gap between almost OOM and full blown OOM which is often time
sensitive.  Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
   removes the files and starts over again) running in parallel for 10s
   to build up a lot of dirty pages when 100 parallel mem_eaters (anon
   private populated mmap which waits until it gets signal) with 80M
   each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements but
the runtime is way too different (it took 800+s).  One thing that could
have skewed results was that I was tail -f the serial log on the host
system to see the progress.  I have stopped doing that.  The results are
more consistent now but still too different from the last time.  This is
really weird so I've retested with the last 4.2 mmotm again and I am
getting consistent ~220s which is really close to the above.  If I apply
the WQ vmstat patch on top I am getting close to 160s so the stale vmstat
counters made a difference which is to be expected.  I have a new SSD in
my laptop which migh have made a difference but I wouldn't expect it to be
that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall runtime
of the test was much longer with the patched kernel.  This can be
attributed to more retries in general.  The results from the base kernel
are quite inconsitent and I think that consistency is better here.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
   Start 10 memeaters in parallel each with
   size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
   This will cause an OOM killer which will kill one of them which will free up
   200M and then try to use all the remaining space for hugetlb pages. See how
   many of them will pass kill everything, wait 2s and try again.
   This tests whether we do not fail __GFP_REPEAT costly allocations too early
   now.
* base kernel
$ sort base-hugepages.log | uniq -c
      1 64
     13 65
      6 66
     20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
     17 65
      3 66
     20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/1448974607-10208-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com

This patch (of 3):

__alloc_pages_slowpath has traditionally relied on the direct reclaim and
did_some_progress as an indicator that it makes sense to retry allocation
rather than declaring OOM.  shrink_zones had to rely on zone_reclaimable
if shrink_zone didn't make any progress to prevent from a premature OOM
killer invocation - the LRU might be full of dirty or writeback pages and
direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several times and
restart if a page is freed.  This is really subtle behavior and it might
lead to a livelock when a single freed page keeps allocator looping but
the current task will not be able to allocate that single page.  OOM
killer would be more appropriate than looping without any progress for
unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as
OOM which is per zonelist property.  It is __alloc_pages_slowpath which
knows how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath.  It tries to be more deterministic and easier
to follow.  It builds on an assumption that retrying makes sense only if
the currently reclaimable memory + free pages would allow the current
allocation request to succeed (as per __zone_watermark_ok) at least for
one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might get
stuck and reclaimable pages might be pinned for a really long time or even
depend on the current allocation context.  Therefore there is a feedback
mechanism implemented which reduces the reclaim target after each reclaim
round without any progress.  This means that we should eventually converge
to only NR_FREE_PAGES as the target and fail on the wmark check and
proceed to OOM.  The backoff is simple and linear with 1/16 of the
reclaimable pages for each round without any progress.  We are optimistic
and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will back
off after the amount of reclaimable pages reaches equivalent of the
requested order.  The only difference is that if there was no progress
during the reclaim we rely on zone watermark check.  This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled.
Tetsuo Handa [Tue, 9 Feb 2016 23:13:05 +0000 (10:13 +1100)]
mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled.

After the OOM killer is disabled during suspend operation, any
!__GFP_NOFAIL && __GFP_FS allocations are forced to fail.  Thus, any
!__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm,oom: make oom_killer_disable() killable.
Tetsuo Handa [Tue, 9 Feb 2016 23:13:04 +0000 (10:13 +1100)]
mm,oom: make oom_killer_disable() killable.

While oom_killer_disable() is called by freeze_processes() after all user
threads except the current thread are frozen, it is possible that kernel
threads invoke the OOM killer and sends SIGKILL to the current thread due
to sharing the thawed victim's memory.  Therefore, checking for SIGKILL is
preferable than TIF_MEMDIE.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agozram-export-the-number-of-available-comp-streams-fix
Andrew Morton [Tue, 9 Feb 2016 23:13:04 +0000 (10:13 +1100)]
zram-export-the-number-of-available-comp-streams-fix

tweak documentation text

Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agozram: export the number of available comp streams
Sergey Senozhatsky [Tue, 9 Feb 2016 23:13:04 +0000 (10:13 +1100)]
zram: export the number of available comp streams

I've been asked several very simple questions:
a) How can I ensure that zram uses (or used) several compression
   streams?
b) What is the current number of comp streams (how much memory
   does zram *actually* use for compression streams, if there are
   more than one stream)?

zram, indeed, does not provide any info and does not answer these
questions.  Reading from `max_comp_streams' let to estimate only
theoretical comp streams memory consumption, which assumes that zram will
allocate max_comp_streams.  However, it's possible that the real number of
compression streams will never reach that max value, due to various
reasons, e.g.  max_comp_streams is too high, etc.

The patch adds `avail_streams' column to the /sys/block/zram<id>/mm_stat
device file.  For a single compression stream backend it's always 1, for a
multi stream backend - it shows the actual ->avail_strm value.

The number of allocated compression streams answers several
questions:
a) the current `level of concurrency' that the device has
   experienced
b) the amount of memory used by compression streams (by multiplying
   the `avail_streams' column value, ->buffer size and algorithm's
   specific scratch buffer size; the last are easy to find out,
   unlike `avail_streams').

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agothp: do not hold anon_vma lock during swap in
Kirill A. Shutemov [Tue, 9 Feb 2016 23:13:03 +0000 (10:13 +1100)]
thp: do not hold anon_vma lock during swap in

khugepaged does swap in during collapse under anon_vma lock. It causes
complain from lockdep. The trace below shows following scenario:

 - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
 - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
 - __read_swap_cache_async() tries to allocate the page for swap in;
 - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
   given gfp_mask we could end up in direct relaim.
 - Lockdep already knows that reclaim sometimes (e.g. in case of
   split_huge_page()) wants to take anon_vma lock on its own.

Therefore deadlock is possible.

The fix is to take anon_vma lock after swap in.

[18344.236625] =================================
[18344.236628] [ INFO: inconsistent lock state ]
[18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
[18344.236636] ---------------------------------
[18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
[18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
[18344.236662] {IN-RECLAIM_FS-W} state was registered at:
[18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
[18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
[18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
[18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
[18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
[18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
[18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
[18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
[18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
[18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
[18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
[18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
[18344.236708] irq event stamp: 6517947
[18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
[18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
[18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
[18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
[18344.236722]
               other info that might help us debug this:
[18344.236723]  Possible unsafe locking scenario:

[18344.236724]        CPU0
[18344.236725]        ----
[18344.236726]   lock(&anon_vma->rwsem);
[18344.236728]   <Interrupt>
[18344.236729]     lock(&anon_vma->rwsem);
[18344.236731]
                *** DEADLOCK ***

[18344.236733] 2 locks held by khugepaged/32:
[18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
[18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
[18344.236741]
               stack backtrace:
[18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
[18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
[18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
[18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
[18344.236755] Call Trace:
[18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
[18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
[18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
[18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
[18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
[18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
[18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
[18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
[18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
[18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
[18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
[18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
[18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
[18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
[18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
[18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
[18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
[18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
[18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
[18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
[18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agokhugepaged: __collapse_huge_page_swapin(): drop unused 'pte' parameter
Kirill A. Shutemov [Tue, 9 Feb 2016 23:13:03 +0000 (10:13 +1100)]
khugepaged: __collapse_huge_page_swapin(): drop unused 'pte' parameter

'pte' dereference is eliminated by compiler, since nobody uses the
pteval until it's overwritten inside the loop.

Value of 'pte' itself is overwritten by pte_offset_map() before first
real use.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-make-swapin-readahead-to-improve-thp-collapse-rate-fix
Kirill A. Shutemov [Tue, 9 Feb 2016 23:13:02 +0000 (10:13 +1100)]
mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix

trivial cleanup of exit path of the function

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: make swapin readahead to improve thp collapse rate
Ebru Akagunduz [Tue, 9 Feb 2016 23:13:02 +0000 (10:13 +1100)]
mm: make swapin readahead to improve thp collapse rate

This patch makes swapin readahead to improve thp collapse rate.  When
khugepaged scanned pages, there can be a few of the pages in swap area.

With the patch THP can collapse 4kB pages into a THP when there are up to
max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps.  I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a page
in each 20 pages of the area.

Without the patch, system did not swap in readahead.  THP rate was %65 of
the program of the memory, it did not change over time.

With this patch, after 10 minutes of waiting khugepaged had collapsed %99
of the program's memory.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/khugepaged: fix scan not aborted on SCAN_EXCEED_SWAP_PTE
Vladimir Davydov [Tue, 9 Feb 2016 23:13:02 +0000 (10:13 +1100)]
mm/khugepaged: fix scan not aborted on SCAN_EXCEED_SWAP_PTE

This patch fixes a typo in khugepaged_scan_pmd(): instead of setting
"result" to SCAN_EXCEED_SWAP_PTE we set "ret". Setting "ret" results in
an attempt to collapse a huge page although we meant aborting the scan.
As a result, we can call khugepaged_find_target_node() with all entries
in the khugepaged_node_load array being zeros. The latter is not ready
for that and might return an offline node on such input. This leads to a
warning followed by kernel panic:

  WARNING: CPU: 1 PID: 40 at include/linux/gfp.h:314 khugepaged_alloc_page+0xd4/0xf0()
  CPU: 1 PID: 40 Comm: khugepaged Not tainted 4.3.0-rc1-mm1+ #102
   000000000000013a ffff88010ae77b58 ffffffff813270d4 ffffffff818cda31
   0000000000000000 ffff88010ae77b98 ffffffff8107c9f5 dead000000000100
   ffff88010ae77e70 0000000000c752da 0000000000000001 0000000000000000
  Call Trace:
   [<ffffffff813270d4>] dump_stack+0x48/0x64
   [<ffffffff8107c9f5>] warn_slowpath_common+0x95/0xe0
   [<ffffffff8107ca5a>] warn_slowpath_null+0x1a/0x20
   [<ffffffff811ec124>] khugepaged_alloc_page+0xd4/0xf0
   [<ffffffff811f15c8>] collapse_huge_page+0x58/0x550
   [<ffffffff810b38e6>] ? account_entity_dequeue+0xb6/0xd0
   [<ffffffff810b5289>] ? idle_balance+0x79/0x2b0
   [<ffffffff811f1f5e>] khugepaged_scan_pmd+0x49e/0x710
   [<ffffffff810e1f3a>] ? lock_timer_base+0x5a/0x80
   [<ffffffff810e1fbb>] ? try_to_del_timer_sync+0x5b/0x70
   [<ffffffff810e214c>] ? del_timer_sync+0x4c/0x60
   [<ffffffff8168242f>] ? schedule_timeout+0x11f/0x200
   [<ffffffff811f2330>] khugepaged_scan_mm_slot+0x160/0x2a0
   [<ffffffff811f255f>] khugepaged_do_scan+0xef/0x160
   [<ffffffff810bcdb0>] ? wait_woken+0x80/0x80
   [<ffffffff811f25d0>] ? khugepaged_do_scan+0x160/0x160
   [<ffffffff811f25f8>] khugepaged+0x28/0x80
   [<ffffffff8109ab1c>] kthread+0xcc/0xf0
   [<ffffffff810a667e>] ? schedule_tail+0x1e/0xc0
   [<ffffffff8109aa50>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff8168371f>] ret_from_fork+0x3f/0x70
   [<ffffffff8109aa50>] ? kthread_freezable_should_stop+0x70/0x70

  BUG: unable to handle kernel paging request at 0000000000014028
  IP: [<ffffffff81185eb2>] __alloc_pages_nodemask+0xc2/0x2c0
  PGD aaac7067 PUD aaac6067 PMD 0
  Oops: 0000 [#1] SMP
  CPU: 1 PID: 40 Comm: khugepaged Tainted: G        W       4.3.0-rc1-mm1+ #102
  task: ffff88010ae16400 ti: ffff88010ae74000 task.ti: ffff88010ae74000
  RIP: 0010:[<ffffffff81185eb2>]  [<ffffffff81185eb2>] __alloc_pages_nodemask+0xc2/0x2c0
  RSP: 0018:ffff88010ae77ad8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000014020 RCX: 0000000000000014
  RDX: 0000000000000000 RSI: 0000000000000009 RDI: 0000000000c752da
  RBP: ffff88010ae77ba8 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000000
  R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000c752da
  FS:  0000000000000000(0000) GS:ffff88010be40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000014028 CR3: 00000000aaac4000 CR4: 00000000000006e0
  Stack:
   ffff88010ae77ae8 ffffffff810d0b3b ffff88010ae77b48 ffffffff81179e73
   0000000000000010 ffff88010ae77b58 ffff88010ae77b18 ffffffff811ec124
   ffff88010ae77b38 00000009a6e3aff4 0000000000000000 0000000000000000
  Call Trace:
   [<ffffffff810d0b3b>] ? vprintk_default+0x2b/0x40
   [<ffffffff81179e73>] ? printk+0x46/0x48
   [<ffffffff811ec124>] ? khugepaged_alloc_page+0xd4/0xf0
   [<ffffffff8107ca04>] ? warn_slowpath_common+0xa4/0xe0
   [<ffffffff811ec0cd>] khugepaged_alloc_page+0x7d/0xf0
   [<ffffffff811f15c8>] collapse_huge_page+0x58/0x550
   [<ffffffff810b38e6>] ? account_entity_dequeue+0xb6/0xd0
   [<ffffffff810b5289>] ? idle_balance+0x79/0x2b0
   [<ffffffff811f1f5e>] khugepaged_scan_pmd+0x49e/0x710
   [<ffffffff810e1f3a>] ? lock_timer_base+0x5a/0x80
   [<ffffffff810e1fbb>] ? try_to_del_timer_sync+0x5b/0x70
   [<ffffffff810e214c>] ? del_timer_sync+0x4c/0x60
   [<ffffffff8168242f>] ? schedule_timeout+0x11f/0x200
   [<ffffffff811f2330>] khugepaged_scan_mm_slot+0x160/0x2a0
   [<ffffffff811f255f>] khugepaged_do_scan+0xef/0x160
   [<ffffffff810bcdb0>] ? wait_woken+0x80/0x80
   [<ffffffff811f25d0>] ? khugepaged_do_scan+0x160/0x160
   [<ffffffff811f25f8>] khugepaged+0x28/0x80
   [<ffffffff8109ab1c>] kthread+0xcc/0xf0
   [<ffffffff810a667e>] ? schedule_tail+0x1e/0xc0
   [<ffffffff8109aa50>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff8168371f>] ret_from_fork+0x3f/0x70
   [<ffffffff8109aa50>] ? kthread_freezable_should_stop+0x70/0x70
  RIP  [<ffffffff81185eb2>] __alloc_pages_nodemask+0xc2/0x2c0
   RSP <ffff88010ae77ad8>
  CR2: 0000000000014028

Fixes: acc067d59a1f9 ("mm: make optimistic check for swapin readahead")
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: make optimistic check for swapin readahead
Ebru Akagunduz [Tue, 9 Feb 2016 23:13:01 +0000 (10:13 +1100)]
mm: make optimistic check for swapin readahead

Introduce a new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
optimistic check for swapin readahead to increase thp collapse rate.
Before getting swapped out pages to memory, checks them and allows up to a
certain number.  It also prints out using tracepoints amount of unmapped
ptes.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoksm: validate STABLE_NODE_DUP_HEAD conditional to gcc version
Andrea Arcangeli [Tue, 9 Feb 2016 23:13:01 +0000 (10:13 +1100)]
ksm: validate STABLE_NODE_DUP_HEAD conditional to gcc version

BUILD_BUG_ON can only validate this successfully on recent gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoksm-introduce-ksm_max_page_sharing-per-page-deduplication-limit-fix-2
Andrew Morton [Tue, 9 Feb 2016 23:13:00 +0000 (10:13 +1100)]
ksm-introduce-ksm_max_page_sharing-per-page-deduplication-limit-fix-2

fix non-NUMA build

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoksm: introduce ksm_max_page_sharing per page deduplication limit
Andrea Arcangeli [Tue, 9 Feb 2016 23:13:00 +0000 (10:13 +1100)]
ksm: introduce ksm_max_page_sharing per page deduplication limit

Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.

During the rmap walk each entry can take up to ~10usec to process because
of IPIs for the TLB flushing (both for the primary MMU and the secondary
MMUs with the MMU notifier).  With only 16GB of address space shared in
the same KSM page, that would amount to dozens of seconds of kernel
runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec.  Just doing the cond_resched()
during the rmap walks is not enough, the list size must have a limit too,
otherwise the caller could get blocked in (schedule friendly) kernel
computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in the
KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it
may be inevitable to send lots of IPIs if each rmap_item->mm is active on
a different CPU and there are lots of CPUs.  Even if we ignore the IPI
delivery cost, we've still to walk the whole KSM rmap list, so we can't
allow millions or billions (ulimited) number of entries in the KSM
stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree.  So there are three types of stable_nodes: the regular ones
(identical as before, living in the first flat dimension of the stable
rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if each
"dup" will be pointed by a different KSM page copy of that content.  This
way the stable rbtree lookup computational complexity is unaffected if
compared to an unlimited max_sharing_limit.  It is still enforced that
there cannot be KSM page content duplicates in the stable rbtree itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint increase
on 64bit archs.  The memory overhead of the per-KSM page stable_tree and
per virtual mapping rmap_item is unchanged.  Only after the
max_page_sharing limit hits, we need to allocate a stable_tree "chain" and
rb_replace() the "regular" stable_node with the newly allocated
stable_node "chain".  After that we simply add the "regular" stable_node
to the chain as a stable_node "dup" by linking hlist_dup in the
stable_node_chain->hlist.  This way the "regular" (flat) stable_node is
converted to a stable_node "dup" living in the second dimension of the
stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes.  So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn is
aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait for
their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty.  This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agothp, vmstats: count deferred split events
Kirill A. Shutemov [Tue, 9 Feb 2016 23:13:00 +0000 (10:13 +1100)]
thp, vmstats: count deferred split events

Counts how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: make shadow node shrinker memcg aware
Vladimir Davydov [Tue, 9 Feb 2016 23:12:59 +0000 (10:12 +1100)]
mm: workingset: make shadow node shrinker memcg aware

Workingset code was recently made memcg aware, but shadow node shrinker is
still global.  As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect.  To avoid this, we need to make shadow node shrinker
memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: size shadow nodes lru basing on file cache size
Vladimir Davydov [Tue, 9 Feb 2016 23:12:59 +0000 (10:12 +1100)]
mm: workingset: size shadow nodes lru basing on file cache size

A page is activated on refault if the refault distance stored in the
corresponding shadow entry is less than the number of active file pages.
Since active file pages can't occupy more than half memory, we assume that
the maximal effective refault distance can't be greater than half the
number of present pages and size the shadow nodes lru list appropriately.
Generally speaking, this assumption is correct, but it can result in
wasting a considerable chunk of memory on stale shadow nodes in case the
portion of file pages is small, e.g.  if a workload mostly uses anonymous
memory.

To sort this out, we need to compute the size of shadow nodes lru basing
not on the maximal possible, but the current size of file cache.  We could
take the size of active file lru for the maximal refault distance, but
active lru is pretty unstable - it can shrink dramatically at runtime
possibly disrupting workingset detection logic.

Instead we assume that the maximal refault distance equals half the total
number of file cache pages.  This will protect us against active file lru
size fluctuations while still being correct, because size of active lru is
normally maintained lower than size of inactive lru.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoradix-tree: account radix_tree_node to memory cgroup
Vladimir Davydov [Tue, 9 Feb 2016 23:12:58 +0000 (10:12 +1100)]
radix-tree: account radix_tree_node to memory cgroup

Allocation of radix_tree_node objects can be easily triggered from
userspace, so we should account them to memory cgroup.  Besides, we need
them accounted for making shadow node shrinker per memcg (see
mm/workingset.c).

A tricky thing about accounting radix_tree_node objects is that they are
mostly allocated through radix_tree_preload(), so we can't just set
SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
lot of unrelated cgroups using objects from each other's caches.

One way to overcome this would be making radix tree preloads per memcg,
but that would probably look cumbersome and overcomplicated.

Instead, we make radix_tree_node_alloc() first try to allocate from the
cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
and only if it fails fall back on using per cpu preloads.  This should
make most allocations accounted.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: zap memcg_kmem_online helper
Vladimir Davydov [Tue, 9 Feb 2016 23:12:58 +0000 (10:12 +1100)]
mm: memcontrol: zap memcg_kmem_online helper

As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.

There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache().  The former can only
be called if memcg_kmem_enabled() returned true.  Since the cgroup it
operates on is online, mem_cgroup_is_root() check will be enough.

memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead of
memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just open-code
the check.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker
Vladimir Davydov [Tue, 9 Feb 2016 23:12:58 +0000 (10:12 +1100)]
mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker

It's just convenient to implement a memcg aware shrinker when you know
that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
false.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy
Vladimir Davydov [Tue, 9 Feb 2016 23:12:57 +0000 (10:12 +1100)]
mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy

Workingset code was recently made memcg aware, but shadow node shrinker is
still global.  As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect.  To avoid this, we need to make shadow node shrinker
memcg aware.

The actual work is done in patch 6 of the series.  Patches 1 and 2 prepare
memcg/shrinker infrastructure for the change.  Patch 3 is just a
collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
shadow nodes overhead in case workload mostly uses anonymous pages.

This patch:

Currently, in the legacy hierarchy kmem accounting is off for all cgroups
by default and must be enabled explicitly by writing something to
memory.kmem.limit_in_bytes.  Since we don't support reclaim on hitting
kmem limit, nor do we have any plans to implement it, this is likely to be
-1, just to enable kmem accounting and limit kernel memory consumption by
the memory.limit_in_bytes along with user memory.

This user API was introduced when the implementation of kmem accounting
lacked slab shrinker support and hence was useless in practice.  Things
have changed since then - slab shrinkers were made memcg aware, the
accounting overhead seems to be negligible, and a failure to charge a kmem
allocation should not have critical consequences, because we only account
those kernel objects that should be safe to fail.  That's why kmem
accounting is enabled by default for all cgroups in the default hierarchy,
which will eventually replace the legacy one.

The ability to enable kmem accounting for some cgroups while keeping it
disabled for others is getting difficult to maintain.  E.g.  to make
shadow node shrinker memcg aware (see mm/workingset.c), we need to know
the relationship between the number of shadow nodes allocated for a cgroup
and the size of its lru list.  If kmem accounting is enabled for all
cgroups there is no problem, but what should we do if kmem accounting is
enabled only for half of cgroups?  We've no other choice but use global
lru stats while scanning root cgroup's shadow nodes, but that would be
wrong if kmem accounting was enabled for all cgroups (which is the case if
the unified hierarchy is used), in which case we should use lru stats of
the root cgroup's lruvec.

That being said, let's enable kmem accounting for all memory cgroups by
default.  If one finds it unstable or too costly, it can always be
disabled system-wide by passing cgroup.memory=nokmem to the kernel at boot
time.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoinclude/linux/page-flags.h: force inlining of selected page flag modifications
Denys Vlasenko [Tue, 9 Feb 2016 23:12:57 +0000 (10:12 +1100)]
include/linux/page-flags.h: force inlining of selected page flag modifications

Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:

<SetPageUptodate> (43 copies, 141 calls):
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       f0 80 0f 08             lock orb $0x8,(%rdi)
       5d                      pop    %rbp
       c3                      retq

<PagePrivate> (10 copies, 134 calls):
       48 8b 07                mov    (%rdi),%rax
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       48 c1 e8 0b             shr    $0xb,%rax
       83 e0 01                and    $0x1,%eax
       5d                      pop    %rbp
       c3                      retq

This patch fixes this via s/inline/__always_inline/.

Code size decrease after the patch is ~7k:

    text     data      bss       dec     hex filename
92125002 20826048 36417536 149368586 8e72f0a vmlinux
92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agobufferhead: Force inlining of buffer head flag operations
Denys Vlasenko [Tue, 9 Feb 2016 23:12:56 +0000 (10:12 +1100)]
bufferhead: Force inlining of buffer head flag operations

With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
about 60 times. Examples of disassembly:

<set_buffer_mapped> (14 copies, 43 calls):
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       f0 80 0f 20             lock orb $0x20,(%rdi)
       5d                      pop    %rbp
       c3                      retq
<buffer_mapped> (3 copies, 34 calls):
       48 8b 07                mov    (%rdi),%rax
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       48 c1 e8 05             shr    $0x5,%rax
       83 e0 01                and    $0x1,%eax
       5d                      pop    %rbp
       c3                      retq
<set_buffer_new> (5 copies, 13 calls):
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       f0 80 0f 40             lock orb $0x40,(%rdi)
       5d                      pop    %rbp
       c3                      retq

This patch fixes this via s/inline/__always_inline/.
This decreases vmlinux by about 3 kbytes.

    text     data      bss       dec     hex filename
88200439 19905208 36421632 144527279 89d4faf vmlinux2
88197239 19905240 36421632 144524111 89d434f vmlinux

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agotools/vm/page-types.c: add memory cgroup dumping and filtering
Vladimir Davydov [Tue, 9 Feb 2016 23:12:56 +0000 (10:12 +1100)]
tools/vm/page-types.c: add memory cgroup dumping and filtering

On Sat, Feb 06, 2016 at 01:06:29PM +0300, Konstantin Khlebnikov wrote:
...
>  static int opt_list; /* list pages (in ranges) */
>  static int opt_no_summary; /* don't show summary */
>  static pid_t opt_pid; /* process to walk */
> -const char * opt_file;
> +const char * opt_file; /* file or directory path */
> +static int64_t opt_cgroup = -1;/* cgroup inode */

ino should be a positive number, so we could use uint64_t here. Of
course, ino=0 could be used for filtering pages not charged to any
cgroup (as it is in this patch), but I doubt this would be useful.

Also, this patch conflicts with the recent change by Naoya introducing
support of dumping swap entries - https://lkml.org/lkml/2016/2/4/50

I attached a fixlet that addresses these two issues. What do you think
about it?

Other than that the patch looks good to me,

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agotools/vm/page-types.c: add memory cgroup dumping and filtering
Konstantin Khlebnikov [Tue, 9 Feb 2016 23:12:55 +0000 (10:12 +1100)]
tools/vm/page-types.c: add memory cgroup dumping and filtering

This adds two command line keys:

 -c|--cgroup path|@inode Walk only pages owned by this memory cgroup
 -C|--list-cgroup Show memory cgroup inodes

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-config_nr_zones_extended-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:55 +0000 (10:12 +1100)]
mm-config_nr_zones_extended-fix

fix 80-cols mess

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: CONFIG_NR_ZONES_EXTENDED
Dan Williams [Tue, 9 Feb 2016 23:12:55 +0000 (10:12 +1100)]
mm: CONFIG_NR_ZONES_EXTENDED

ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new mm
zones that are bumping up against the current maximum limit of 4 zones,
i.e.  2 bits in page->flags.  When adding a zone this equation still needs
to be satisified:

    SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
  <= BITS_PER_LONG - NR_PAGEFLAGS

ZONE_DEVICE currently tries to satisfy this equation by requiring that
ZONE_DMA be disabled, but this is untenable given generic kernels want to
support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like to
increase the amount of memory covered per section, but that limits the
minimum granularity at which consecutive memory ranges can be added via
devm_memremap_pages().

The trade-off of what is acceptable to sacrifice depends heavily on the
platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
page->flags is constrained, but those platforms likely do not care about
the minimum granularity of memory hotplug.  A big iron machine with 1024
numa nodes can likely sacrifice ZONE_DMA where a general purpose
distribution kernel can not.

CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected when
the number of configured zones exceeds 4.  It documents the configuration
symbols and definitions that get modified when ZONES_WIDTH is greater than
2.

For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
document the definitions that get modified when a 32-bit configuration
wants more zone bits.

Note that GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit build.
We need to be careful to only build the table for zones that have a
corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this purpose.
This patch does not attempt to solve the problem of adding a new zone
that also has a corresponding GFP_ flag.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Mark <markk@clara.co.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, compaction: adapt isolation_suitable flushing to kcompactd
Vlastimil Babka [Tue, 9 Feb 2016 23:12:54 +0000 (10:12 +1100)]
mm, compaction: adapt isolation_suitable flushing to kcompactd

Compaction maintains a pageblock_skip bitmap to record pageblocks where
isolation recently failed.  This bitmap can be reset by three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
   finished scanning the whole zone and set zone->compact_blockskip_flush.
   Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it.  The check for direct compaction in 1), and
to set the flush flag in 2) use current_is_kswapd(), which doesn't work
for kcompactd.  Thus, this patch adds bool direct_compaction to
compact_control to use in 2).  For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will see
the flushed pageblock_skip bits.  This is different from when the former
kswapd compaction observed the bits and I believe it makes more sense.
Kcompactd can afford to be more thorough than a direct compaction trying
to limit allocation latency, or kswapd whose primary goal is to reclaim.

To sum up, after this patch, the pageblock_skip flushing makes intuitively
more sense for kcompactd.  Practially, the differences are minimal.
Stress-highalloc With order-9 allocations without direct
reclaim/compaction:

stress-highalloc
                              4.5-rc1               4.5-rc1
                               4-test                5-test
Success 1 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 1 Mean         4.00 (  0.00%)        6.20 (-55.00%)
Success 1 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 2 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 2 Mean         4.20 (  0.00%)        6.40 (-52.38%)
Success 2 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 3 Min         63.00 (  0.00%)       62.00 (  1.59%)
Success 3 Mean        64.60 (  0.00%)       63.80 (  1.24%)
Success 3 Max         67.00 (  0.00%)       65.00 (  2.99%)

             4.5-rc1     4.5-rc1
              4-test      5-test
User         3088.82     3181.09
System       1142.01     1158.25
Elapsed      1780.91     1799.37

                                  4.5-rc1     4.5-rc1
                                   4-test      5-test
Minor Faults                    106582816   107907437
Major Faults                          813         734
Swap Ins                              311         235
Swap Outs                            5598        5485
Allocation stalls                     184         207
DMA allocs                             32          31
DMA32 allocs                     74843238    75757965
Normal allocs                    25886668    26130990
Movable allocs                          0           0
Direct pages scanned                31429       32797
Kswapd pages scanned              2185293     2202613
Kswapd pages reclaimed            2134389     2143524
Direct pages reclaimed              31234       32545
Kswapd efficiency                     97%         97%
Kswapd velocity                  1228.666    1218.536
Direct efficiency                     99%         99%
Direct velocity                    17.671      18.144
Percentage direct scans                1%          1%
Zone normal velocity              291.409     286.309
Zone dma32 velocity               954.928     950.371
Zone dma velocity                   0.000       0.000
Page writes by reclaim           5598.600    5485.600
Page writes file                        0           0
Page writes anon                     5598        5485
Page reclaim immediate                 96          60
Sector Reads                      4307161     4293509
Sector Writes                    11053091    11072127
Page rescued immediate                  0           0
Slabs scanned                     1555770     1549506
Direct inode steals                  2025        7018
Kswapd inode steals                 45418       40265
Kswapd skipped wait                     0           0
THP fault alloc                       614         612
THP collapse alloc                    324         316
THP splits                              0           0
THP fault fallback                    730         778
THP collapse fail                      14          16
Compaction stalls                     959        1007
Compaction success                     69          67
Compaction failures                   890         939
Page migrate success               662054      721374
Page migrate failure                32846       23469
Compaction pages isolated         1370326     1479924
Compaction migrate scanned        7025772     8812554
Compaction free scanned          73302642    84327916
Compaction cost                       762         838

With direct reclaim/compaction:

stress-highalloc
/home/vbabka/labs/mmtests-results/storm/2016-02-02_16-37/test2/1
                              4.5-rc1               4.5-rc1
                              4-test2               5-test2
Success 1 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.40 (  0.00%)       10.00 (-19.05%)
Success 1 Max         13.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.60 (  0.00%)       10.00 (-16.28%)
Success 2 Max         12.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         76.00 (  0.00%)       76.00 (  0.00%)

             4.5-rc1     4.5-rc1
             4-test2     5-test2
User         3258.62     3246.04
System       1177.92     1172.29
Elapsed      1837.02     1836.76

                                  4.5-rc1     4.5-rc1
                                  4-test2     5-test2
Minor Faults                    109392253   109773220
Minor Faults                    109392253   109773220
Major Faults                          755         864
Swap Ins                              155         262
Swap Outs                            5790        5871
Allocation stalls                    4562        4540
DMA allocs                             34          39
DMA32 allocs                     76901680    77122082
Normal allocs                    26587089    26748274
Movable allocs                          0           0
Direct pages scanned               108854      120966
Kswapd pages scanned              2131589     2135012
Kswapd pages reclaimed            2090937     2108388
Direct pages reclaimed             108699      120577
Kswapd efficiency                     98%         98%
Kswapd velocity                  1160.870    1170.537
Direct efficiency                     99%         99%
Direct velocity                    59.283      66.321
Percentage direct scans                4%          5%
Zone normal velocity              294.389     293.821
Zone dma32 velocity               925.764     943.036
Zone dma velocity                   0.000       0.000
Page writes by reclaim           5790.600    5871.200
Page writes file                        0           0
Page writes anon                     5790        5871
Page reclaim immediate                218         225
Sector Reads                      4376989     4428264
Sector Writes                    11102113    11110668
Page rescued immediate                  0           0
Slabs scanned                     1692486     1709123
Direct inode steals                 16266        6898
Kswapd inode steals                 28364       38351
Kswapd skipped wait                     0           0
THP fault alloc                       567         652
THP collapse alloc                    326         354
THP splits                              0           0
THP fault fallback                    805         793
THP collapse fail                      18          16
Compaction stalls                    2070        2025
Compaction success                    527         518
Compaction failures                  1543        1507
Page migrate success              2423657     2360608
Page migrate failure                28790       40852
Compaction pages isolated         4916017     4802025
Compaction migrate scanned       19370264    21750613
Compaction free scanned         360662356   344372001
Compaction cost                      2745        2694

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, kswapd: replace kswapd compaction with waking up kcompactd
Vlastimil Babka [Tue, 9 Feb 2016 23:12:54 +0000 (10:12 +1100)]
mm, kswapd: replace kswapd compaction with waking up kcompactd

Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim
and compaction to attempt making memory allocation of given order
available.  The details differ from direct reclaim e.g.  in having high
watermark as a goal.  The code involved in kswapd's reclaim/compaction
decisions has evolved to be quite complex.  Testing reveals that it
doesn't actually work in at least one scenario, and closer inspection
suggests that it could be greatly simplified without compromising on the
goal (make high-order page available) or efficiency (don't reclaim too
much).  The simplification relieas of doing all compaction in kcompactd,
which is simply woken up when high watermarks are reached by kswapd's
reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd.  There was no compaction attempt
from kswapd during the whole test.  Some added instrumentation shows what
happens:

- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
  cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
  merely checks if high watermarks were reached for base pages. This is true,
  so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
  compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
  compaction happens due to the condition sc.nr_reclaimed > nr_attempted
  being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
  pgdat_balanced() is false as only the small zone DMA appears balanced
  (curiously in that check, watermark appears OK and compaction_suitable()
  returns COMPACT_PARTIAL, because a lower classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the DMA
zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false.  The condition really should use >= as the
comment suggests.  Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety.  Hopefully this
demonstrates that this is unsustainable.

Luckily we can simplify this a lot.  The reclaim/compaction decisions make
sense for direct reclaim scenario, but in kswapd, our primary goal is to
reach high watermark in order-0 pages.  Afterwards we can attempt
compaction just once.  Unlike direct reclaim, we don't reclaim extra pages
(over the high watermark), the current code already disallows it for good
reasons.

After this patch, we simply wake up kcompactd to process the pgdat, after
we have either succeeded or failed to reach the high watermarks in kswapd,
which goes to sleep.  We pass kswapd's order and classzone_idx, so
kcompactd can apply the same criteria to determine which zones are worth
compacting.  Note that we use the classzone_idx from wakeup_kswapd(), not
balanced_classzone_idx which can include higher zones that kswapd tried to
balance too, but didn't consider them in pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to adjust
how it determines the zones to be balanced.  The key element here is
adding a "highorder" parameter to zone_balanced, which, when set to false,
makes it consider only order-0 watermark instead of the desired higher
order (this was done previously by kswapd_shrink_zone(), but not
elsewhere).  This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.

For testing, I used stress-highalloc configured to do order-9 allocations
with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on
kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):

stress-highalloc
                              4.5-rc1               4.5-rc1
                               3-test                4-test
Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)

             4.5-rc1     4.5-rc1
              3-test      4-test
User         3166.67     3088.82
System       1153.37     1142.01
Elapsed      1768.53     1780.91

                                  4.5-rc1     4.5-rc1
                                   3-test      4-test
Minor Faults                    106940795   106582816
Major Faults                          829         813
Swap Ins                              482         311
Swap Outs                            6278        5598
Allocation stalls                     128         184
DMA allocs                            145          32
DMA32 allocs                     74646161    74843238
Normal allocs                    26090955    25886668
Movable allocs                          0           0
Direct pages scanned                32938       31429
Kswapd pages scanned              2183166     2185293
Kswapd pages reclaimed            2152359     2134389
Direct pages reclaimed              32735       31234
Kswapd efficiency                     98%         97%
Kswapd velocity                  1243.877    1228.666
Direct efficiency                     99%         99%
Direct velocity                    18.767      17.671
Percentage direct scans                1%          1%
Zone normal velocity              299.981     291.409
Zone dma32 velocity               962.522     954.928
Zone dma velocity                   0.142       0.000
Page writes by reclaim           6278.800    5598.600
Page writes file                        0           0
Page writes anon                     6278        5598
Page reclaim immediate                 93          96
Sector Reads                      4357114     4307161
Sector Writes                    11053628    11053091
Page rescued immediate                  0           0
Slabs scanned                     1592829     1555770
Direct inode steals                  1557        2025
Kswapd inode steals                 46056       45418
Kswapd skipped wait                     0           0
THP fault alloc                       579         614
THP collapse alloc                    304         324
THP splits                              0           0
THP fault fallback                    793         730
THP collapse fail                      11          14
Compaction stalls                    1013         959
Compaction success                     92          69
Compaction failures                   920         890
Page migrate success               238457      662054
Page migrate failure                23021       32846
Compaction pages isolated          504695     1370326
Compaction migrate scanned         661390     7025772
Compaction free scanned          13476658    73302642
Compaction cost                       262         762

After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity.  The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat to kcompactd activity, yet THP
alloc successes improved a bit.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                              4.5-rc1               4.5-rc1
                              3-test2               4-test2
Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)

             4.5-rc1     4.5-rc1
             3-test2     4-test2
User         3344.73     3258.62
System       1194.24     1177.92
Elapsed      1838.04     1837.02

                                  4.5-rc1     4.5-rc1
                                  3-test2     4-test2
Minor Faults                    111269736   109392253
Major Faults                          806         755
Swap Ins                              671         155
Swap Outs                            5390        5790
Allocation stalls                    4610        4562
DMA allocs                            250          34
DMA32 allocs                     78091501    76901680
Normal allocs                    27004414    26587089
Movable allocs                          0           0
Direct pages scanned               125146      108854
Kswapd pages scanned              2119757     2131589
Kswapd pages reclaimed            2073183     2090937
Direct pages reclaimed             124909      108699
Kswapd efficiency                     97%         98%
Kswapd velocity                  1161.027    1160.870
Direct efficiency                     99%         99%
Direct velocity                    68.545      59.283
Percentage direct scans                5%          4%
Zone normal velocity              296.678     294.389
Zone dma32 velocity               932.841     925.764
Zone dma velocity                   0.053       0.000
Page writes by reclaim           5392.000    5790.600
Page writes file                        1           0
Page writes anon                     5390        5790
Page reclaim immediate                104         218
Sector Reads                      4350232     4376989
Sector Writes                    11126496    11102113
Page rescued immediate                  0           0
Slabs scanned                     1705294     1692486
Direct inode steals                  8700       16266
Kswapd inode steals                 36352       28364
Kswapd skipped wait                     0           0
THP fault alloc                       599         567
THP collapse alloc                    323         326
THP splits                              0           0
THP fault fallback                    806         805
THP collapse fail                      17          18
Compaction stalls                    2457        2070
Compaction success                    906         527
Compaction failures                  1551        1543
Page migrate success              2031423     2423657
Page migrate failure                32845       28790
Compaction pages isolated         4129761     4916017
Compaction migrate scanned       11996712    19370264
Compaction free scanned         214970969   360662356
Compaction cost                      2271        2745

Here, this patch doesn't change the success rate as direct compaction
already tries what it can.  There's however significant reduction in
direct compaction stalls, made entirely of the successful stalls.  This
means the offload to kcompactd is working as expected, and direct
compaction is reduced either due to detecting contention, or compaction
deferred by kcompactd.  In the previous version of this patchset there was
some apparent reduction of success rate, but the changes in this version
(such as using sync compaction only), new baseline kernel, and/or
averaging results from 5 executions (my bet), made this go away.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, memory hotplug: small cleanup in online_pages()
Vlastimil Babka [Tue, 9 Feb 2016 23:12:53 +0000 (10:12 +1100)]
mm, memory hotplug: small cleanup in online_pages()

We can reuse the nid we've determined instead of repeated pfn_to_nid() usages.
Also zone_to_nid() should be a bit cheaper in general than pfn_to_nid().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, compaction: fix build errors with kcompactd
Arnd Bergmann [Tue, 9 Feb 2016 23:12:53 +0000 (10:12 +1100)]
mm, compaction: fix build errors with kcompactd

The newly added kcompactd code introduces multiple build errors:

include/linux/compaction.h:91:12: error: 'kcompactd_run' defined but not used [-Werror=unused-function]
mm/compaction.c:1953:2: error: implicit declaration of function 'hotcpu_notifier' [-Werror=implicit-function-declaration]

This marks the new empty wrapper functions as 'inline' to avoid unused-function warnings,
and includes linux/cpu.h to get the hotcpu_notifier declaration.

Fixes: 8364acdfa45a ("mm, compaction: introduce kcompactd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, compaction: introduce kcompactd
Vlastimil Babka [Tue, 9 Feb 2016 23:12:52 +0000 (10:12 +1100)]
mm, compaction: introduce kcompactd

Memory compaction can be currently performed in several contexts:

- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP page
  fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc

The purpose of compaction is two-fold.  The obvious purpose is to satisfy
a (pending or future) high-order allocation, and is easy to evaluate.  The
other purpose is to keep overal memory fragmentation low and help the
anti-fragmentation mechanism.  The success wrt the latter purpose is more

The current situation wrt the purposes has a few drawbacks:

- compaction is invoked only when a high-order page or hugepage is not
  available (or manually). This might be too late for the purposes of keeping
  memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would be
  better if compaction was performed asynchronously to keep fragmentation low,
  before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP page
  faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in some
  scenarios. It could also end up compacting for a high-order allocation
  request when it should be reclaiming memory for a later order-0 request.

To improve the situation, we should be able to benefit from an equivalent
of kswapd, but for compaction - i.e.  a background thread which responds
to fragmentation and the need for high-order allocations (including
hugepages) somewhat proactively.

One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much.  It should be better to let kswapd
handle reclaim, as order-0 allocations are often more critical than
high-order ones.

Another possibility is to extend khugepaged, but this kthread is a single
instance and tied to THP configs.

This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new tunables.
The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.

For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.

This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with the
old approach.

Waking up of the kcompactd threads is also tied to kswapd activity and follows
these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from the
  slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so don't
  invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd

Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are not
available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
possible to perform periodic compaction with kcompactd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, kswapd: remove bogus check of balance_classzone_idx
Vlastimil Babka [Tue, 9 Feb 2016 23:12:51 +0000 (10:12 +1100)]
mm, kswapd: remove bogus check of balance_classzone_idx

During work on kcompactd integration I have spotted a confusing check of
balance_classzone_idx, which I believe is bogus.

The balanced_classzone_idx is filled by balance_pgdat() as the highest
zone it attempted to balance.  This was introduced by commit dc83edd941f4
("mm: kswapd: use the classzone idx that kswapd was using for
sleeping_prematurely()").  The intention is that (as expressed in today's
function names), the value used for kswapd_shrink_zone() calls in
balance_pgdat() is the same as for the decisions in kswapd_try_to_sleep().
An unwanted side-effect of that commit was breaking the checks in
kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
classzone_idx.  Commits 215ddd6664ce ("mm: vmscan: only read
new_classzone_idx from pgdat when reclaiming successfully") and
d2ebd0f6b895 ("kswapd: avoid unnecessary rebalance after an unsuccessful
balancing") tried to fixed, but apparently introduced a bogus check that
this patch removes.

Consider zone indexes X < Y < Z, where:
- Z is the value used for the first kswapd wakeup.
- Y is returned as balanced_classzone_idx, which means zones with index higher
  than Y (including Z) were found to be unreclaimable.
- X is the value used for the second kswapd wakeup

The new wakeup with value X means that kswapd is now supposed to balance
harder all zones with index <= X.  But instead, due to Y < Z, it will go
sleep and won't read the new value X.  This is subtly wrong.

The effect of this patch is that kswapd will react better in some
situations, where e.g.  the first wakeup is for ZONE_DMA32, the second is
for ZONE_DMA, and due to unreclaimable ZONE_NORMAL.  Before this patch,
kswapd would go sleep instead of reclaiming ZONE_DMA harder.  I expect
these situations are very rare, and more value is in better
maintainability due to the removal of confusing and bogus check.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agotile: query dynamic DEBUG_PAGEALLOC setting
Joonsoo Kim [Tue, 9 Feb 2016 23:12:50 +0000 (10:12 +1100)]
tile: query dynamic DEBUG_PAGEALLOC setting

We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query whether
it is enabled or not in runtime.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agosound-query-dynamic-debug_pagealloc-setting-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:49 +0000 (10:12 +1100)]
sound-query-dynamic-debug_pagealloc-setting-fix

export _debug_pagealloc_enabled to modules

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agosound: query dynamic DEBUG_PAGEALLOC setting
Joonsoo Kim [Tue, 9 Feb 2016 23:12:48 +0000 (10:12 +1100)]
sound: query dynamic DEBUG_PAGEALLOC setting

We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-slub-query-dynamic-debug_pagealloc-setting-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:47 +0000 (10:12 +1100)]
mm-slub-query-dynamic-debug_pagealloc-setting-fix

clean up code, per Christian

Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/slub: query dynamic DEBUG_PAGEALLOC setting
Joonsoo Kim [Tue, 9 Feb 2016 23:12:46 +0000 (10:12 +1100)]
mm/slub: query dynamic DEBUG_PAGEALLOC setting

We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-vmalloc-query-dynamic-debug_pagealloc-setting-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:45 +0000 (10:12 +1100)]
mm-vmalloc-query-dynamic-debug_pagealloc-setting-fix

update comment, per David.  Adjust comment to use 80 cols

Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/vmalloc: query dynamic DEBUG_PAGEALLOC setting
Joonsoo Kim [Tue, 9 Feb 2016 23:12:44 +0000 (10:12 +1100)]
mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting

As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters we
can optimize some cases by checking the enablement state.

This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:
https://lkml.org/lkml/2016/1/27/194

Remaining work is to make sparc to be aware of this but it looks not easy
for me so I skip that in this series.

This patch (of 5):

We can disable debug_pagealloc processing even if the code is complied
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agotools/vm/page-types.c: support swap entry
Naoya Horiguchi [Tue, 9 Feb 2016 23:12:43 +0000 (10:12 +1100)]
tools/vm/page-types.c: support swap entry

/proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
about swap entry, so let's make the in-kernel utility aware of it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years ago/proc/kpageflags: return KPF_SLAB for slab tail pages
Naoya Horiguchi [Tue, 9 Feb 2016 23:12:43 +0000 (10:12 +1100)]
/proc/kpageflags: return KPF_SLAB for slab tail pages

Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
pages, which is inconvenient when grasping how slab pages are distributed
(userspace always needs to check which kind of tail pages by itself).
This patch sets KPF_SLAB for such pages.

With this patch:

  $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
  Slab:              64880 kB
               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000000080           16220       63  _______S__________________________________ slab
               total           16220       63

16220 pages equals to 64880 kB, so returned result is consistent with the
global counter.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoproc-kpageflags-return-kpf_buddy-for-tail-buddy-pages-fix-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:42 +0000 (10:12 +1100)]
proc-kpageflags-return-kpf_buddy-for-tail-buddy-pages-fix-fix

update comment, per Naoya

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoproc: kpageflags: do not report buddy and balloon pages as mapped
Vladimir Davydov [Tue, 9 Feb 2016 23:12:41 +0000 (10:12 +1100)]
proc: kpageflags: do not report buddy and balloon pages as mapped

PageBuddy and PageBalloon are not usual page flags - they are identified
by a special negative (so as not to confuse with mapped pages) value of
page->_mapcount.  Since /proc/kpageflags uses page_mapcount helper to
check if a page is mapped, it reports pages of these kinds as being
mapped, which is confusing.  Fix that by replacing page_mapcount with
page_mapped.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years ago/proc/kpageflags: return KPF_BUDDY for "tail" buddy pages
Naoya Horiguchi [Tue, 9 Feb 2016 23:12:41 +0000 (10:12 +1100)]
/proc/kpageflags: return KPF_BUDDY for "tail" buddy pages

Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed.  This patch
sets KPF_BUDDY for such pages.

With this patch:

  $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
  MemFree:         3134992 kB
               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000000400          779272     3044  __________B_______________________________ buddy
  0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
               total          783657     3061

783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: ASLR: use get_random_long()
Daniel Cashman [Tue, 9 Feb 2016 23:12:40 +0000 (10:12 +1100)]
mm: ASLR: use get_random_long()

Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long().  Also address shifting bug which, in case
of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.

Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agodrivers: char: random: add get_random_long()
Daniel Cashman [Tue, 9 Feb 2016 23:12:40 +0000 (10:12 +1100)]
drivers: char: random: add get_random_long()

d07e22597d1d355 ("mm: mmap: add new /proc tunable for mmap_base ASLR")
added the ability to choose from a range of values to use for entropy
count in generating the random offset to the mmap_base address.  The
maximum value on this range was set to 32 bits for 64-bit x86 systems, but
this value could be increased further, requiring more than the 32 bits of
randomness provided by get_random_int(), as is already possible for arm64.
Add a new function: get_random_long() which more naturally fits with the
mmap usage of get_random_int() but operates exactly the same as
get_random_int().

Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
that values greater than 31 bits generate an appropriate mask without
overflow.  This is especially important on x86, as its shift instruction
uses a 5-bit mask for the shift operand, which meant that any value for
mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
randomization.

Finally, replace calls to get_random_int() with get_random_long() where
appropriate.

This patch (of 2):

Add get_random_long().

Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-memcontrol-report-kernel-stack-usage-in-cgroup2-memorystat-v2
Vladimir Davydov [Tue, 9 Feb 2016 23:12:39 +0000 (10:12 +1100)]
mm-memcontrol-report-kernel-stack-usage-in-cgroup2-memorystat-v2

> The only thing that strikes me is that you
> appended the new stat items to the enum, but then prepended them to
> the doc and stat file sections. Why is that?

No reason. Let's rearrange the enum fields to be consistent with the
stat file output.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: report kernel stack usage in cgroup2 memory.stat
Vladimir Davydov [Tue, 9 Feb 2016 23:12:39 +0000 (10:12 +1100)]
mm: memcontrol: report kernel stack usage in cgroup2 memory.stat

Show how much memory is allocated to kernel stacks.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: report slab usage in cgroup2 memory.stat
Vladimir Davydov [Tue, 9 Feb 2016 23:12:38 +0000 (10:12 +1100)]
mm: memcontrol: report slab usage in cgroup2 memory.stat

Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-memcontrol-make-tree_statevents-fetch-all-stats-fix
Vladimir Davydov [Tue, 9 Feb 2016 23:12:38 +0000 (10:12 +1100)]
mm-memcontrol-make-tree_statevents-fetch-all-stats-fix

> This looks much better, and most of the callstacks involved here are
> very flat, so the increased stack consumption should be alright.
>
> The only exception there is the threshold code, which can happen from
> the direct reclaim path and thus with a fairly deep stack already.

Yeah, I missed this case. Thought mem_cgroup_usage is only used for
reporting to userspace. Thanks for catching this.

>
> Would it be better to leave mem_cgroup_usage() alone, open-code it,
> and then use tree_stat() and tree_events() only for v2 memory.stat?
>

Definitely.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: make tree_{stat,events} fetch all stats
Vladimir Davydov [Tue, 9 Feb 2016 23:12:38 +0000 (10:12 +1100)]
mm: memcontrol: make tree_{stat,events} fetch all stats

Currently, tree_{stat,events} helpers can only get one stat index at a
time, so when there are a lot of stats to be reported one has to call it
over and over again (see memory_stat_show).  This is neither effective,
nor does it look good.  Instead, let's make these helpers take a snapshot
of all available counters.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: do not bypass slab charge if memcg is offline
Vladimir Davydov [Tue, 9 Feb 2016 23:12:37 +0000 (10:12 +1100)]
mm: memcontrol: do not bypass slab charge if memcg is offline

Slab pages are charged in two steps.  First, an appropriate per memcg
cache is selected (see memcg_kmem_get_cache) basing on the current
context, then the new slab page is charged to the memory cgroup which the
selected cache was created for (see memcg_charge_slab ->
__memcg_kmem_charge_memcg).  It is OK to bypass kmemcg charge at step 1,
but if step 1 succeeded and we successfully allocated a new slab page,
step 2 must be performed, otherwise we would get a per memcg kmem cache
which contains a slab that does not hold a reference to the memory cgroup
owning the cache.  Since per memcg kmem caches are destroyed on memcg css
free, this could result in freeing a cache while there are still active
objects in it.

However, currently we will bypass slab page charge if the memory cgroup
owning the cache is offline (see __memcg_kmem_charge_memcg).  This is very
unlikely to occur in practice, because for this to happen a process must
be migrated to a different cgroup and the old cgroup must be removed while
the process is in kmalloc somewhere between steps 1 and 2 (e.g.  trying to
allocate a new page).  Nevertheless, it's still better to eliminate such a
possibility.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
Johannes Weiner [Tue, 9 Feb 2016 23:12:37 +0000 (10:12 +1100)]
mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()

Migration accounting in the memory controller used to have to handle both
oldpage and newpage being on the LRU already; fuse's page cache
replacement used to pass a recycled newpage that had been uncharged but
not freed and removed from the LRU, and the memcg migration code used to
uncharge oldpage to "pass on" the existing charge to newpage.

Nowadays, pages are no longer uncharged when truncated from the page
cache, but rather only at free time, so if a LRU page is recycled in page
cache replacement it'll also still be charged.  And we bail out of the
charge transfer altogether in that case.  Tell commit_charge() that we
know newpage is not on the LRU, to avoid taking the zone->lru_lock
unnecessarily from the migration path.

But also, oldpage is no longer uncharged inside migration.  We only use
oldpage for its page->mem_cgroup and page size, so we don't care about its
LRU state anymore either.  Remove any mention from the kernel doc.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: migrate: consolidate mem_cgroup_migrate() calls
Johannes Weiner [Tue, 9 Feb 2016 23:12:36 +0000 (10:12 +1100)]
mm: migrate: consolidate mem_cgroup_migrate() calls

Rather than scattering mem_cgroup_migrate() calls all over the place, have
a single call from a safe place where every migration operation eventually
ends up in - migrate_page_copy().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-compaction-speed-up-pageblock_pfn_to_page-when-zone-is-contiguous-v3
Joonsoo Kim [Tue, 9 Feb 2016 23:12:36 +0000 (10:12 +1100)]
mm-compaction-speed-up-pageblock_pfn_to_page-when-zone-is-contiguous-v3

v3
o remove pfn_valid_within() check for all pages in the pageblock
because pageblock_pfn_to_page() is only called with pageblock aligned pfn.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Aaron Lu <aaron.lu@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
Joonsoo Kim [Tue, 9 Feb 2016 23:12:36 +0000 (10:12 +1100)]
mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous

There is a performance drop report due to hugepage allocation and in there
half of cpu time are spent on pageblock_pfn_to_page() in compaction [1].
In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this case
is to find valid pageblock while scanning whole zone range. To check
if pageblock is valid to compact, valid pfn within pageblock is required
and we can obtain it by calling pageblock_pfn_to_page(). This function
checks whether pageblock is in a single zone and return valid pfn
if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to
be very expensive in this workload.

Although we have no way to skip this pageblock check in the system
where hole exists at arbitrary position, we can use cached value for
zone continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.

Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/s

Avg is improved by roughly 30% [2].

[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Aaron Lu <aaron.lu@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page
Joonsoo Kim [Tue, 9 Feb 2016 23:12:35 +0000 (10:12 +1100)]
mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page

pageblock_pfn_to_page() is used to check there is valid pfn and all pages
in the pageblock is in a single zone.  If there is a hole in the
pageblock, passing arbitrary position to pageblock_pfn_to_page() could
cause to skip whole pageblock scanning, instead of just skipping the hole
page.  For deterministic behaviour, it's better to always pass pageblock
aligned range to pageblock_pfn_to_page().  It will also help further
optimization on pageblock_pfn_to_page() in the following patch.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/compaction: fix invalid free_pfn and compact_cached_free_pfn
Joonsoo Kim [Tue, 9 Feb 2016 23:12:35 +0000 (10:12 +1100)]
mm/compaction: fix invalid free_pfn and compact_cached_free_pfn

free_pfn and compact_cached_free_pfn are the pointer that remember restart
position of freepage scanner.  When they are reset or invalid, we set them
to zone_end_pfn because freepage scanner works in reverse direction.  But,
because zone range is defined as [zone_start_pfn, zone_end_pfn),
zone_end_pfn is invalid to access.  Therefore, we should not store it to
free_pfn and compact_cached_free_pfn.  Instead, we need to store
zone_end_pfn - 1 to them.  There is one more thing we should consider.
Freepage scanner scan reversely by pageblock unit.  If free_pfn and
compact_cached_free_pfn are set to middle of pageblock, it regards that
sitiation as that it already scans front part of pageblock so we lose
opportunity to scan there.  To fix-up, this patch do round_down() to
guarantee that reset position will be pageblock aligned.

Note that thanks to the current pageblock_pfn_to_page() implementation,
actual access to zone_end_pfn doesn't happen until now.  But, following
patch will change pageblock_pfn_to_page() so this patch is needed from now
on.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/memblock.c: remove unnecessary memblock_type variable
Alexander Kuleshov [Tue, 9 Feb 2016 23:12:34 +0000 (10:12 +1100)]
mm/memblock.c: remove unnecessary memblock_type variable

We define struct memblock_type *type in the memblock_add_region() and
memblock_reserve_region() functions only for passing it to the
memlock_add_range() and memblock_reserve_range() functions.  Let's remove
these variables and will pass a type directly.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, oom_reaper: implement OOM victims queuing
Michal Hocko [Tue, 9 Feb 2016 23:12:34 +0000 (10:12 +1100)]
mm, oom_reaper: implement OOM victims queuing

wake_oom_reaper has allowed only 1 oom victim to be queued.  The main
reason for that was the simplicity as other solutions would require some
way of queuing.  The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach to help
with oom handling when the OOM victim cannot terminate in a reasonable
time.  The race could lead to missing an oom victim which can get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
     oom_reaper
  oom_reap_task
    __oom_reap_task
oom_victim terminates
      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible.  E.g.  the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing.  This is what this patch implements.  This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing
synchronization.  wake_oom_reaper will then add the task on the queue and
oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-oom_reaper-report-success-failure-fix
Michal Hocko [Tue, 9 Feb 2016 23:12:34 +0000 (10:12 +1100)]
mm-oom_reaper-report-success-failure-fix

update the log message to be more specific

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-oom_reaper-report-success-failure-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:33 +0000 (10:12 +1100)]
mm-oom_reaper-report-success-failure-fix

fix CONFIG_MMU=n build:

mm/oom_kill.c: In function 'oom_kill_process':
mm/oom_kill.c:753: error: implicit declaration of function 'K'
mm/oom_kill.c:753: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'int'
mm/oom_kill.c:753: warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 'int'
mm/oom_kill.c:753: warning: format '%lu' expects type 'long unsigned int', but argument 6 has type 'int'
mm/oom_kill.c:753: warning: format '%lu' expects type 'long unsigned int', but argument 7 has type 'int'

Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, oom_reaper: report success/failure
Michal Hocko [Tue, 9 Feb 2016 23:12:33 +0000 (10:12 +1100)]
mm, oom_reaper: report success/failure

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agooom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
Michal Hocko [Tue, 9 Feb 2016 23:12:32 +0000 (10:12 +1100)]
oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space

When oom_reaper manages to unmap all the eligible vmas there shouldn't be
much of the freable memory held by the oom victim left anymore so it makes
sense to clear the TIF_MEMDIE flag for the victim and allow the OOM killer
to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get the
access again if it needs to allocate and hits the OOM killer again due to
the fatal_signal_pending resp.  PF_EXITING check.  We can safely hide the
task from the OOM killer because it is clearly not a good candidate
anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom reaper
is able to take mmap_sem for the associated mm struct.  This is not
guaranteed now but further steps should make sure that mmap_sem for write
should be blocked killable which will help to reduce such a lock
contention.  This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agooom reaper: handle mlocked pages
Michal Hocko [Tue, 9 Feb 2016 23:12:32 +0000 (10:12 +1100)]
oom reaper: handle mlocked pages

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped.  This is primarily done for
simplicity.  There is no reason to skip over them and reduce the amount of
reclaimed memory.  This is safe from the semantic point of view because
try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, oom: introduce oom reaper
Michal Hocko [Tue, 9 Feb 2016 23:12:32 +0000 (10:12 +1100)]
mm, oom: introduce oom reaper

This patch (of 5):

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good hope
that the task will terminate in a reasonable time and frees up its memory.
Such a task (oom victim) will get an access to memory reserves via
mark_oom_victim to allow a forward progress should there be a need for
additional memory during exit path.

It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway.  There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete.  This is considered
a reasonable constrain because the overall system health is more important
than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the workers
might be busy (e.g.  allocating memory).  Kswapd which sounds like another
good fit is not appropriate as well because it might get blocked on locks
during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt.  basically any lock
blocking forward progress as described above.  In order to prevent from
blocking on the lock without any forward progress we are using only a
trylock and retry 10 times with a short sleep in between.  Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests done
with the lock held to absolute minimum to reduce the risk even further.

The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm
transition and oom_reaper clear this atomically once it is done with the
work.  This means that only a single mm_struct can be reaped at the time.
As the operation is potentially disruptive we are trying to limit it to
the ncessary minimum and the reaper blocks any updates while it operates
on an mm.  mm_struct is pinned by mm_count to allow parallel exit_mmap and
a race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agosched: add schedule_timeout_idle()
Andrew Morton [Tue, 9 Feb 2016 23:12:31 +0000 (10:12 +1100)]
sched: add schedule_timeout_idle()

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agox86: also use debug_pagealloc_enabled() for free_init_pages
Christian Borntraeger [Tue, 9 Feb 2016 23:12:31 +0000 (10:12 +1100)]
x86: also use debug_pagealloc_enabled() for free_init_pages

we want to couple all debugging features with debug_pagealloc_enabled()
and not with the config option CONFIG_DEBUG_PAGEALLOC.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agos390: query dynamic DEBUG_PAGEALLOC setting
Christian Borntraeger [Tue, 9 Feb 2016 23:12:30 +0000 (10:12 +1100)]
s390: query dynamic DEBUG_PAGEALLOC setting

We can use debug_pagealloc_enabled() to check if we can map the identity
mapping with 1MB/2GB pages as well as to print the current setting in
dump_stack.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agox86: query dynamic DEBUG_PAGEALLOC setting
Christian Borntraeger [Tue, 9 Feb 2016 23:12:30 +0000 (10:12 +1100)]
x86: query dynamic DEBUG_PAGEALLOC setting

We can use debug_pagealloc_enabled() to check if we can map the identity
mapping with 2MB pages.  We can also add the state into the dump_stack
output.

The patch does not touch the code for the 1GB pages, which ignored
CONFIG_DEBUG_PAGEALLOC.  Do we need to fence this as well?

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agothp: cleanup split_huge_page()
Kirill A. Shutemov [Tue, 9 Feb 2016 23:12:30 +0000 (10:12 +1100)]
thp: cleanup split_huge_page()

After one of bugfixes to freeze_page(), we don't have freezed pages in
rmap, therefore mapcount of all subpages of freezed THP is zero.  And we
have assert for that.

Let's drop code which deal with non-zero mapcount of subpages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: use linear_page_index() in do_fault()
Matthew Wilcox [Tue, 9 Feb 2016 23:12:29 +0000 (10:12 +1100)]
mm: use linear_page_index() in do_fault()

do_fault() assumes that PAGE_SIZE is the same as PAGE_CACHE_SIZE.  Use
linear_page_index() to calculate pgoff in the correct units.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: remove unnecessary uses of lock_page_memcg()
Johannes Weiner [Tue, 9 Feb 2016 23:12:29 +0000 (10:12 +1100)]
mm: remove unnecessary uses of lock_page_memcg()

We have customer demand to use 1GB pages to map DAX files.  Unlike the 2MB
page support, the Linux MM does not currently support PUD pages, so I have
attempted to add support for the necessary pieces for DAX huge PUD pages.

Filesystems still need work to allocate 1GB pages.  With ext4, I can only
get 16MB of contiguous space, although it is aligned.  With XFS, I can get
80MB less than 1GB, and it's not aligned.  The XFS problem may be due to
the small amount of RAM in my test machine.

This patch set is against something approximately current -mm.  I'd like
to thank Dave Chinner & Kirill Shutemov for their reviews of v1.  The
conversion of pmd_fault & pud_fault to huge_fault is thanks to Dave's
poking, and Kirill spotted a couple of problems in the MM code.  Version 2
of the patch set is about 200 lines smaller (1016 insertions, 23 deletions
in v1).

I've done some light testing using a program to mmap a block device with
DAX enabled, calling mincore() and examining /proc/smaps and
/proc/pagemap.

This patch (of 8):

There are several users that nest lock_page_memcg() inside lock_page() to
prevent page->mem_cgroup from changing.  But the page lock prevents pages
from moving between cgroups, so that is unnecessary overhead.

Remove lock_page_memcg() in contexts with locked contexts and fix the
debug code in the page stat functions to be okay with the page lock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-simplify-lock_page_memcg-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:28 +0000 (10:12 +1100)]
mm-simplify-lock_page_memcg-fix

In file included from include/linux/swap.h:8,
                 from include/linux/suspend.h:4,
                 from arch/x86/kernel/asm-offsets.c:12:
include/linux/memcontrol.h: In function 'lock_page_memcg':
include/linux/memcontrol.h:666: warning: 'return' with a value, in function returning void
In file included from include/linux/swap.h:8,
                 from mm/mlock.c:11:
include/linux/memcontrol.h: In function 'lock_page_memcg':
include/linux/memcontrol.h:666: warning: 'return' with a value, in function returning void

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: simplify lock_page_memcg()
Johannes Weiner [Tue, 9 Feb 2016 23:12:28 +0000 (10:12 +1100)]
mm: simplify lock_page_memcg()

Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-migrate-do-not-touch-page-mem_cgroup-of-live-pages-fix-2
Vladimir Davydov [Tue, 9 Feb 2016 23:12:28 +0000 (10:12 +1100)]
mm-migrate-do-not-touch-page-mem_cgroup-of-live-pages-fix-2

It must be safe to move these calls outside tree_lock:

Will hopefully fix the lockdep splat:

[ 3587.997451] =================================
[ 3587.997453] [ INFO: inconsistent lock state ]
[ 3587.997456] 4.5.0-rc2-next-20160203-dbg-00007-g37a0a9d-dirty #377 Not tainted
[ 3587.997457] ---------------------------------
[ 3587.997459] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 3587.997462] cc1plus/22766 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 3587.997464]  (&(&mapping->tree_lock)->rlock){+.?...}, at: [<ffffffff8113aaee>] migrate_page_move_mapping+0xbd/0x33f
[ 3587.997474] {IN-SOFTIRQ-W} state was registered at:
[ 3587.997476]   [<ffffffff81082515>] __lock_acquire+0x973/0x18fc
[ 3587.997481]   [<ffffffff81083c77>] lock_acquire+0x10d/0x1a8
[ 3587.997484]   [<ffffffff813c3486>] _raw_spin_lock_irqsave+0x3d/0x51
[ 3587.997489]   [<ffffffff81100552>] test_clear_page_writeback+0x75/0x1b4
[ 3587.997493]   [<ffffffff810f53cb>] end_page_writeback+0x29/0x4a
[ 3587.997497]   [<ffffffff8117ffd5>] end_buffer_async_write+0xfb/0x176
[ 3587.997501]   [<ffffffff8117fb53>] end_bio_bh_io_sync+0x2c/0x37
[ 3587.997503]   [<ffffffff811d0a43>] bio_endio+0x53/0x5b
[ 3587.997508]   [<ffffffff811d8548>] blk_update_request+0x1fb/0x34d
[ 3587.997512]   [<ffffffffa00ed512>] scsi_end_request+0x31/0x182 [scsi_mod]
[ 3587.997522]   [<ffffffffa00eec2d>] scsi_io_completion+0x186/0x46e [scsi_mod]
[ 3587.997530]   [<ffffffffa00e7aa2>] scsi_finish_command+0xd4/0xdd [scsi_mod]
[ 3587.997537]   [<ffffffffa00ee51c>] scsi_softirq_done+0xe0/0xe7 [scsi_mod]
[ 3587.997544]   [<ffffffff811df3ef>] blk_done_softirq+0x84/0x8b
[ 3587.997548]   [<ffffffff8104625c>] __do_softirq+0x196/0x3f5
[ 3587.997552]   [<ffffffff810466aa>] irq_exit+0x40/0x94
[ 3587.997554]   [<ffffffff813c6041>] do_IRQ+0x101/0x119
[ 3587.997558]   [<ffffffff813c4689>] ret_from_intr+0x0/0x19
[ 3587.997561]   [<ffffffff812d9906>] cpuidle_enter+0x17/0x19
[ 3587.997565]   [<ffffffff8107c32a>] call_cpuidle+0x3e/0x40
[ 3587.997569]   [<ffffffff8107c65b>] cpu_startup_entry+0x242/0x35e
[ 3587.997572]   [<ffffffff813bd4fa>] rest_init+0x131/0x137
[ 3587.997575]   [<ffffffff816d4ec1>] start_kernel+0x3dd/0x3ea
[ 3587.997579]   [<ffffffff816d42f1>] x86_64_start_reservations+0x2a/0x2c
[ 3587.997582]   [<ffffffff816d445d>] x86_64_start_kernel+0x16a/0x178
[ 3587.997586] irq event stamp: 191930
[ 3587.997587] hardirqs last  enabled at (191929): [<ffffffff810fb0da>] free_hot_cold_page+0x166/0x179
[ 3587.997591] hardirqs last disabled at (191930): [<ffffffff813c34ad>] _raw_spin_lock_irq+0x13/0x47
[ 3587.997594] softirqs last  enabled at (191758): [<ffffffff810463a5>] __do_softirq+0x2df/0x3f5
[ 3587.997597] softirqs last disabled at (191741): [<ffffffff810466aa>] irq_exit+0x40/0x94
[ 3587.997600]
               other info that might help us debug this:
[ 3587.997602]  Possible unsafe locking scenario:

[ 3587.997604]        CPU0
[ 3587.997605]        ----
[ 3587.997607]   lock(&(&mapping->tree_lock)->rlock);
[ 3587.997610]   <Interrupt>
[ 3587.997611]     lock(&(&mapping->tree_lock)->rlock);
[ 3587.997614]
                *** DEADLOCK ***

[ 3587.997617] 2 locks held by cc1plus/22766:
[ 3587.997618]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81036ee0>] __do_page_fault+0x140/0x35a
[ 3587.997626]  #1:  (&(&mapping->tree_lock)->rlock){+.?...}, at: [<ffffffff8113aaee>] migrate_page_move_mapping+0xbd/0x33f
[ 3587.997633]
               stack backtrace:
[ 3587.997636] CPU: 7 PID: 22766 Comm: cc1plus Not tainted 4.5.0-rc2-next-20160203-dbg-00007-g37a0a9d-dirty #377
[ 3587.997638]  0000000000000000 ffff88010f73f818 ffffffff811f1b02 ffffffff81f35160
[ 3587.997643]  ffff88013813e900 ffff88010f73f850 ffffffff810f3000 0000000000000006
[ 3587.997647]  ffff88013813f058 ffff88013813e900 ffffffff8107f99d 0000000000000006
[ 3587.997651] Call Trace:
[ 3587.997656]  [<ffffffff811f1b02>] dump_stack+0x67/0x90
[ 3587.997660]  [<ffffffff810f3000>] print_usage_bug.part.24+0x259/0x268
[ 3587.997663]  [<ffffffff8107f99d>] ? check_usage_forwards+0x11c/0x11c
[ 3587.997666]  [<ffffffff81080579>] mark_lock+0x381/0x567
[ 3587.997670]  [<ffffffff810807bd>] mark_held_locks+0x5e/0x74
[ 3587.997673]  [<ffffffff813c363f>] ? _raw_spin_unlock_irq+0x2c/0x4a
[ 3587.997676]  [<ffffffff8108093f>] trace_hardirqs_on_caller+0x16c/0x188
[ 3587.997679]  [<ffffffff81080968>] trace_hardirqs_on+0xd/0xf
[ 3587.997682]  [<ffffffff813c363f>] _raw_spin_unlock_irq+0x2c/0x4a
[ 3587.997686]  [<ffffffff81148bef>] unlock_page_lru+0x11f/0x12a
[ 3587.997689]  [<ffffffff8114a90f>] mem_cgroup_migrate+0x196/0x1d9
[ 3587.997692]  [<ffffffff8113abf3>] migrate_page_move_mapping+0x1c2/0x33f
[ 3587.997696]  [<ffffffff8113b669>] buffer_migrate_page+0x47/0x102
[ 3587.997699]  [<ffffffff8113b4e4>] move_to_new_page+0x56/0x194
[ 3587.997702]  [<ffffffff8113bb6b>] migrate_pages+0x447/0x978
[ 3587.997705]  [<ffffffff811171f0>] ? isolate_freepages_block+0x353/0x353
[ 3587.997708]  [<ffffffff81115cd6>] ? pageblock_pfn_to_page+0xbf/0xbf
[ 3587.997711]  [<ffffffff81118834>] compact_zone+0x690/0x92e
[ 3587.997714]  [<ffffffff81118b40>] compact_zone_order+0x6e/0x8a
[ 3587.997717]  [<ffffffff81118dc3>] try_to_compact_pages+0x151/0x28f
[ 3587.997720]  [<ffffffff81118dc3>] ? try_to_compact_pages+0x151/0x28f
[ 3587.997723]  [<ffffffff8108111a>] ? __lock_is_held+0x3c/0x57
[ 3587.997726]  [<ffffffff810fc471>] __alloc_pages_direct_compact+0x3e/0xeb
[ 3587.997729]  [<ffffffff810fc965>] __alloc_pages_nodemask+0x447/0xb8b
[ 3587.997732]  [<ffffffff81120a18>] ? handle_mm_fault+0x8b4/0x16bf
[ 3587.997737]  [<ffffffff8106409f>] ? __might_sleep+0x75/0x7c
[ 3587.997740]  [<ffffffff81140704>] do_huge_pmd_anonymous_page+0x1d1/0x3fc
[ 3587.997744]  [<ffffffff811202ae>] ? handle_mm_fault+0x14a/0x16bf
[ 3587.997747]  [<ffffffff81120623>] handle_mm_fault+0x4bf/0x16bf
[ 3587.997750]  [<ffffffff8108111a>] ? __lock_is_held+0x3c/0x57
[ 3587.997754]  [<ffffffff81036f7f>] __do_page_fault+0x1df/0x35a
[ 3587.997757]  [<ffffffff8103712c>] do_page_fault+0xc/0xe
[ 3587.997760]  [<ffffffff813c5892>] page_fault+0x22/0x30

Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-migrate-do-not-touch-page-mem_cgroup-of-live-pages-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:27 +0000 (10:12 +1100)]
mm-migrate-do-not-touch-page-mem_cgroup-of-live-pages-fix

hrm

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: migrate: do not touch page->mem_cgroup of live pages
Johannes Weiner [Tue, 9 Feb 2016 23:12:27 +0000 (10:12 +1100)]
mm: migrate: do not touch page->mem_cgroup of live pages

Changing a page's memcg association complicates dealing with the page, so
we want to limit this as much as possible.  Page migration e.g.  does not
have to do that.  Just like page cache replacement, it can forcibly charge
a replacement page, and then uncharge the old page when it gets freed.
Temporarily overcharging the cgroup by a single page is not an issue in
practice, and charging is so cheap nowadays that this is much preferrable
to the headache of messing with live pages.

The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup.  But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()).  That means page->mem_cgroup is always stable in
callers that have the page isolated from the LRU or locked.  Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-workingset-per-cgroup-cache-thrash-detection-fix
Sergey Senozhatsky [Tue, 9 Feb 2016 23:12:26 +0000 (10:12 +1100)]
mm-workingset-per-cgroup-cache-thrash-detection-fix

Do not return from workingset_activation() with locked rcu and page.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: per-cgroup cache thrash detection
Johannes Weiner [Tue, 9 Feb 2016 23:12:26 +0000 (10:12 +1100)]
mm: workingset: per-cgroup cache thrash detection

Cache thrash detection (see a528910e12ec "mm: thrash detection-based file
cache sizing" for details) currently only works on the system level, not
inside cgroups.  Worse, as the refaults are compared to the global number
of active cache, cgroups might wrongfully get all their refaults activated
when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID.  This makes the thrash detection work
correctly inside cgroups.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: eviction buckets for bigmem/lowbit machines
Johannes Weiner [Tue, 9 Feb 2016 23:12:26 +0000 (10:12 +1100)]
mm: workingset: eviction buckets for bigmem/lowbit machines

For per-cgroup thrash detection, we need to store the memcg ID inside the
radix tree cookie as well.  However, on 32 bit that doesn't leave enough
bits for the eviction timestamp to cover the necessary range of recently
evicted pages.  The radix tree entry would look like this:

[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

12 bits means 4096 pages, means 16M worth of recently evicted pages.  But
refaults are actionable up to distances covering half of memory.  To not
miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted.  This way we can shave off
lower bits from the eviction timestamp until the necessary range is
covered.  E.g.  grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.

This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.

The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance.  Beyond that, thrashing won't be
detectable anymore.

During boot, the kernel will print out the exact parameters, like so:

[    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

In this example, there are 12 radix entry bits available for the eviction
timestamp, to cover a maximum distance of 2^18 pages (this is a 1G
machine).  Consequently, evictions must be grouped into buckets of 2^6
pages, or 256K.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: separate shadow unpacking and refault calculation
Johannes Weiner [Tue, 9 Feb 2016 23:12:25 +0000 (10:12 +1100)]
mm: workingset: separate shadow unpacking and refault calculation

Per-cgroup thrash detection will need to derive a live memcg from the
eviction cookie, and doing that inside unpack_shadow() will get nasty with
the reference handling spread over two functions.

In preparation, make unpack_shadow() clearly about extracting static data,
and let workingset_refault() do all the higher-level handling.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: workingset: #define radix entry eviction mask
Johannes Weiner [Tue, 9 Feb 2016 23:12:25 +0000 (10:12 +1100)]
mm: workingset: #define radix entry eviction mask

This is a compile-time constant, no need to calculate it on refault.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: memcontrol: generalize locking for the page->mem_cgroup binding
Johannes Weiner [Tue, 9 Feb 2016 23:12:24 +0000 (10:12 +1100)]
mm: memcontrol: generalize locking for the page->mem_cgroup binding

These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim work
properly and be as adaptive to new cache workingsets as global reclaim
already is.

This should have been part of the original thrash detection patch series,
but was deferred due to the complexity of those patches.

This patch (of 5):

So far the only sites that needed to exclude charge migration to stabilize
page->mem_cgroup have been per-cgroup page statistics, hence the name
mem_cgroup_begin_page_stat().  But per-cgroup thrash detection will add
another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg().  Since charge migration is a cgroup1 feature only, we
might be able to delete it at some point, and these now easy to identify
locking sites along with it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, vmscan: make zone_reclaimable_pages more precise
Michal Hocko [Tue, 9 Feb 2016 23:12:24 +0000 (10:12 +1100)]
mm, vmscan: make zone_reclaimable_pages more precise

zone_reclaimable_pages() is used in should_reclaim_retry() which uses it
to calculate the target for the watermark check.  This means that precise
numbers are important for the correct decision.  zone_reclaimable_pages
uses zone_page_state which can contain stale data with per-cpu diffs not
synced yet (the last vmstat_update might have run 1s in the past).

Use zone_page_state_snapshot() in zone_reclaimable_pages() instead.  None
of the current callers is in a hot path where getting the precise value
(which involves per-cpu iteration) would cause an unreasonable overhead.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm-madvise-update-comment-on-sys_madvise-fix
Andrew Morton [Tue, 9 Feb 2016 23:12:24 +0000 (10:12 +1100)]
mm-madvise-update-comment-on-sys_madvise-fix

modifications suggested by Michal

Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/madvise: update comment on sys_madvise()
Naoya Horiguchi [Tue, 9 Feb 2016 23:12:23 +0000 (10:12 +1100)]
mm/madvise: update comment on sys_madvise()

Some new MADV_* advices are not documented in sys_madvise() comment.
So let's update it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm: vmscan: do not clear SHRINKER_NUMA_AWARE if nr_node_ids == 1
Vladimir Davydov [Tue, 9 Feb 2016 23:12:23 +0000 (10:12 +1100)]
mm: vmscan: do not clear SHRINKER_NUMA_AWARE if nr_node_ids == 1

Currently, on shrinker registration we clear SHRINKER_NUMA_AWARE if
there's the only NUMA node present.  The comment states that this will
allow us to save some small loop time later.  It used to be true when this
code was added (see commit 1d3d4437eae1b ("vmscan: per-node deferred
work")), but since commit 6b4f7799c6a57 ("mm: vmscan: invoke slab
shrinkers from shrink_zone()") it doesn't make any difference.  Anyway,
running on non-NUMA machine shouldn't make a shrinker NUMA unaware, so zap
this hunk.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agoxen_balloon: support memory auto onlining policy
Vitaly Kuznetsov [Tue, 9 Feb 2016 23:12:22 +0000 (10:12 +1100)]
xen_balloon: support memory auto onlining policy

Add support for the newly added kernel memory auto onlining policy to Xen
ballon driver.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Suggested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomemory-hotplug: add automatic onlining policy for the newly added memory
Vitaly Kuznetsov [Tue, 9 Feb 2016 23:12:22 +0000 (10:12 +1100)]
memory-hotplug: add automatic onlining policy for the newly added memory

Currently, all newly added memory blocks remain in 'offline' state unless
someone onlines them, some linux distributions carry special udev rules
like:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

to make this happen automatically.  This is not a great solution for
virtual machines where memory hotplug is being used to address high memory
pressure situations as such onlining is slow and a userspace process doing
this (udev) has a chance of being killed by the OOM killer as it will
probably require to allocate some memory.

Introduce default policy for the newly added memory blocks in
/sys/devices/system/memory/auto_online_blocks file with two possible
values: "offline" which preserves the current behavior and "online" which
causes all newly added memory blocks to go online as soon as they're
added.  The default is "offline".

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/memory.c: make apply_to_page_range() more robust
Mika Penttilä [Tue, 9 Feb 2016 23:12:22 +0000 (10:12 +1100)]
mm/memory.c: make apply_to_page_range() more robust

Arm and arm64 used to trigger this BUG_ON() - this has now been fixed.

But a WARN_ON() here is sufficient to catch future buggy callers.

Signed-off-by: Mika Penttilä <mika.penttila@nextfour.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/mempolicy.c: skip VM_HUGETLB and VM_MIXEDMAP VMA for lazy mbind
Liang Chen [Tue, 9 Feb 2016 23:12:21 +0000 (10:12 +1100)]
mm/mempolicy.c: skip VM_HUGETLB and VM_MIXEDMAP VMA for lazy mbind

VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound
pages being marked for migration and unexpected COWs when handling hugetlb
fault.

Thanks to Naoya Horiguchi for reminding me on these checks.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/memory-failure.c: remove useless "undef"s
Wang Xiaoqiang [Tue, 9 Feb 2016 23:12:21 +0000 (10:12 +1100)]
mm/memory-failure.c: remove useless "undef"s

Remove the useless #undef, since the corresponding #define has already
been removed.

Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm/madvise: pass return code of memory_failure() to userspace
Naoya Horiguchi [Tue, 9 Feb 2016 23:12:20 +0000 (10:12 +1100)]
mm/madvise: pass return code of memory_failure() to userspace

Currently the return value of memory_failure() is not passed to userspace
when madvise(MADV_HWPOISON) is used.  This is inconvenient for test
programs that want to know the result of error handling.  So let's return
it to the caller as we already do in the MADV_SOFT_OFFLINE case.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 years agomm, debug: move bad flags printing to bad_page()
Vlastimil Babka [Tue, 9 Feb 2016 23:12:20 +0000 (10:12 +1100)]
mm, debug: move bad flags printing to bad_page()

Since bad_page() is the only user of the badflags parameter of
dump_page_badflags(), we can move the code to bad_page() and simplify a
bit.

The dump_page_badflags() function is renamed to __dump_page() and can
still be called separately from dump_page() for temporary debug prints
where page_owner info is not desired.

The only user-visible change is that page->mem_cgroup is printed before
the bad flags.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>