git.karo-electronics.de Git - karo-tx-linux.git/log

]> git.karo-electronics.de Git - karo-tx-linux.git/log

projects / karo-tx-linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Andi Kleen [Thu, 3 May 2012 05:44:09 +0000 (15:44 +1000)]

futex: use lockdep_assert_held() for lock checking

Use lockdep_assert_held() for lock checking instead of a strange homegrown
variant. This removes the return for this case, but that is unlikely to
be useful anyway.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andi Kleen [Thu, 3 May 2012 05:44:09 +0000 (15:44 +1000)]

mm/huge_memory.c: use lockdep_assert_held()

Use lockdep_assert_held() to check for locks instead of an opencoded
variant.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andi Kleen [Thu, 3 May 2012 05:44:08 +0000 (15:44 +1000)]

XFS: fix lock ASSERT on UP

ASSERT(!spin_is_locked()) doesn't work on UP builds. Replace with a
standard lockdep_assert_held()

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andi Kleen [Thu, 3 May 2012 05:44:08 +0000 (15:44 +1000)]

drivers/scsi/aha152x.c: remove broken usage of spin_is_locked()

Remove racy usage of spin_is_locked. The author seems to have been
unclear on the concept of locking.

This is debug code normally not enabled, but I caught it on a tree sweep.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andi Kleen [Thu, 3 May 2012 05:44:07 +0000 (15:44 +1000)]

sgi-xp: use lockdep_assert_held()

!spin_is_locked() is always true on a UP build, so it cannot be used for
assertions. Replace with lockdep_assert_held().

I realize UP builds are not very likely for this driver, but it's still
better to fix it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Kautuk Consul [Thu, 3 May 2012 05:44:07 +0000 (15:44 +1000)]

um/kernel/trap.c: port OOM changes to handle_page_fault()

Commit d065bd810b6deb6 ("mm: retry page fault when blocking on disk
transfer") and commit 37b23e0525 ("x86,mm: make pagefault killable")

The above commits introduced changes into the x86 pagefault handler
for making the page fault handler retryable as well as killable.

These changes reduce the mmap_sem hold time, which is crucial
during OOM killer invocation.

Port these changes to um.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Matt Fleming [Thu, 3 May 2012 05:44:07 +0000 (15:44 +1000)]

frv: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:44:06 +0000 (15:44 +1000)]

security/keys/keyctl.c: suppress memory allocation failure warning

This allocation may be large. The code is probing to see if it will
succeed and if not, it falls back to vmalloc(). We should suppress any
page-allocation failure messages when the fallback happens.

Reported-by: Dave Jones <davej@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:06 +0000 (15:44 +1000)]

mm/vmscan: kill struct mem_cgroup_zone

Kill struct mem_cgroup_zone and rename shrink_mem_cgroup_zone() to
shrink_lruvec(), it always shrinks one lruvec which it takes as an
argument.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:05 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into should_continue_reclaim()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:05 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into get_scan_count()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:05 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into shrink_list()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:04 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into inactive_list_is_low()

Switch mem_cgroup_inactive_anon_is_low() to lruvec pointers,
mem_cgroup_get_lruvec_size() is more effective than
mem_cgroup_zone_nr_lru_pages()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:04 +0000 (15:44 +1000)]

mm/vmscan: replace zone_nr_lru_pages() with get_lruvec_size()

If memory cgroup is enabled we always use lruvecs which are embedded into
struct mem_cgroup_per_zone, so we can reach lru_size counters via
container_of().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:03 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into putback_inactive_pages()

As zone_reclaim_stat is now located in the lruvec, we can reach it
directly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:03 +0000 (15:44 +1000)]

mm/vmscan: remove update_isolated_counts()

update_isolated_counts() is no longer required, because lumpy-reclaim was
removed. Insanity is over, now there is only one kind of inactive page.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:03 +0000 (15:44 +1000)]

mm/vmscan: push zone pointer into shrink_page_list()

It doesn't need a pointer to the cgroup - pointer to the zone is enough.
This patch also kills the "mz" argument of page_check_references() - it is
unused after "mm: memcg: count pte references from every member of the
reclaimed hierarch"

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:02 +0000 (15:44 +1000)]

mm/vmscan: push lruvec pointer into isolate_lru_pages()

Move the mem_cgroup_zone_lruvec() call from isolate_lru_pages() into
shrink_[in]active_list(). Further patches push it to shrink_zone() step
by step.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:02 +0000 (15:44 +1000)]

mm: add link from struct lruvec to struct zone

This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:01 +0000 (15:44 +1000)]

mm/vmscan: store "priority" in struct scan_control

In memory reclaim some function have too many arguments - "priority" is
one of them. It can be stored in struct scan_control - we construct them
on the same level. Instead of an open coded loop we set the initial
sc.priority, and do_try_to_free_pages() decreases it down to zero.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:44:01 +0000 (15:44 +1000)]

memcg-add-mlock-statistic-in-memorystat-fix

tweak comments

Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Ying Han [Thu, 3 May 2012 05:44:01 +0000 (15:44 +1000)]

memcg: add mlock statistic in memory.stat

We have the nr_mlock stat both in meminfo as well as vmstat system wide,
this patch adds the mlock field into per-memcg memory stat.  The stat
itself enhances the metrics exported by memcg since the unevictable lru
includes more than mlock()'d page like SHM_LOCK'd.

Why we need to count mlock'd pages while they are unevictable and we can
not do much on them anyway?

This is true.  The mlock stat I am proposing is more helpful for system
admin and kernel developer to understand the system workload.  The same
information should be helpful to add into OOM log as well.  Many times in
the past that we need to read the mlock stat from the per-container
meminfo for different reason.  Afterall, we do have the ability to read
the mlock from meminfo and this patch fills the info in memcg.

Note:
Here are the places where I didn't add the hook:

1. in the mlock_migrate_page() since the owner of oldpage and newpage
   is the same.

2. in the freeing path since page shouldn't get to there at the first
   place.

Testing:
1 )
$ cat /dev/cgroup/memory/memory.use_hierarchy
1

$ mkdir /dev/cgroup/memory/A
$ mkdir /dev/cgroup/memory/A/B
$ echo 1g >/dev/cgroup/memory/A/memory.limit_in_bytes
$ echo 1g >/dev/cgroup/memory/B/memory.limit_in_bytes

1. Run memtoy in B and mlock 512m file pages:
memtoy>file /export/hda3/file_512m private
memtoy>map file_512m 0 512m
memtoy>lock file_512m
memtoy:  mlock of file_512m [131072 pages] took  5.296secs.

$ cat /dev/cgroup/memory/A/B/memory.stat
mlock 536870912
unevictable 536870912
..
total_mlock 536870912
total_unevictable 536870912

$ cat /dev/cgroup/memory/A/memory.stat
mlock 0
unevictable 0
..
total_mlock 536870912
total_unevictable 536870912

2) Create 20g memcg and run single thread page fault test (pft) w/ 10g
   mlock memory, here it measures faults/cpu/second:

x before.txt
+ after.txt
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     346345.92     349113.01     347470.52     347651.93     819.71411
+  10     345934.67     348973.58      347677.9     347495.33     833.58657
No difference proven at 95.0% confidence

Signed-off-by: Ying Han <yinghan@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:44:00 +0000 (15:44 +1000)]

mm/memcg: use vm_swappiness from target memory cgroup

Use vm_swappiness from memory cgroup which is triggered this memory
reclaim. This is more reasonable and allows to kill one argument.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sha Zhengju [Thu, 3 May 2012 05:44:00 +0000 (15:44 +1000)]

memcg: revise the position of threshold index while unregistering event

Index current_threshold should point to threshold just below or equal to
usage. See below: http://www.spinics.net/lists/cgroups/msg00844.html

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sha Zhengju [Thu, 3 May 2012 05:43:59 +0000 (15:43 +1000)]

memcg: make threshold index in the right position

Index current_threshold may point to threshold that just equal to usage
after last call of __mem_cgroup_threshold.  But after registering a new
event, it will change (pointing to threshold just below usage).  So make
it consistent here.

For example:
now:
threshold array:  3  [5]  7  9   (usage = 6, [index] = 5)

next turn (after calling __mem_cgroup_threshold):
threshold array:  3   5  [7]  9   (usage = 7, [index] = 7)

after registering a new event (threshold = 10):
threshold array:  3  [5]  7  9  10 (usage = 7, [index] = 5)

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Kirill A. Shutemov [Thu, 3 May 2012 05:43:59 +0000 (15:43 +1000)]

memcg: remove redundant parentheses

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Kirill A. Shutemov [Thu, 3 May 2012 05:43:59 +0000 (15:43 +1000)]

memcg: mark stat field of mem_cgroup struct as __percpu

It fixes a lot of sparse warnings.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Kirill A. Shutemov [Thu, 3 May 2012 05:43:58 +0000 (15:43 +1000)]

memcg: remove unused variable

mm/memcontrol.c: In function `mc_handle_file_pte':
mm/memcontrol.c:5206:16: warning: variable `inode' set but not used [-Wunused-but-set-variable]

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Kirill A. Shutemov [Thu, 3 May 2012 05:43:58 +0000 (15:43 +1000)]

memcg: mark more functions/variables as static

Based on sparse output.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:58 +0000 (15:43 +1000)]

mm/memcg: kill mem_cgroup_lru_del()

This patch kills mem_cgroup_lru_del(), we can use
mem_cgroup_lru_del_list() instead. On 0-order isolation we already have
right lru list id.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:57 +0000 (15:43 +1000)]

mm: remove lru type checks from __isolate_lru_page()

After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list. And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:57 +0000 (15:43 +1000)]

mm: mark mm-inline functions as __always_inline

GCC sometimes ignores "inline" directives even for small and simple functions.
This supposed to be fixed in gcc 4.7, but it was released only yesterday.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:57 +0000 (15:43 +1000)]

mm-push-lru-index-into-shrink_active_list-fix

fix kerneldoc, per Minchan

Cc: Glauber Costa <glommer@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:56 +0000 (15:43 +1000)]

mm: push lru index into shrink_[in]active_list()

Let's toss lru index through call stack to isolate_lru_pages(), this is
better than its reconstructing from individual bits.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Thu, 3 May 2012 05:43:56 +0000 (15:43 +1000)]

mm/memcg: move reclaim_stat into lruvec

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Thu, 3 May 2012 05:43:55 +0000 (15:43 +1000)]

mm/memcg: scanning_global_lru means mem_cgroup_disabled

Although one has to admire the skill with which it has been concealed,
scanning_global_lru(mz) is actually just an interesting way to test
mem_cgroup_disabled(). Too many developer hours have been wasted on
confusing it with global_reclaim(): just use mem_cgroup_disabled().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Thu, 3 May 2012 05:43:55 +0000 (15:43 +1000)]

memcg swap: use mem_cgroup_uncharge_swap()

That stuff __mem_cgroup_commit_charge_swapin() does with a swap entry, it
has a name and even a declaration: just use mem_cgroup_uncharge_swap().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Thu, 3 May 2012 05:43:55 +0000 (15:43 +1000)]

memcg swap: mem_cgroup_move_swap_account never needs fixup

The need_fixup arg to mem_cgroup_move_swap_account() is always false,
so just remove it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

KAMEZAWA Hiroyuki [Thu, 3 May 2012 05:43:54 +0000 (15:43 +1000)]

memcg: fix/change behavior of shared anon at moving task

This patch changes memcg's behavior at task_move().

At task_move(), the kernel scans a task's page table and move the changes
for mapped pages from source cgroup to target cgroup.  There has been a
bug at handling shared anonymous pages for a long time.

Before patch:
  - The spec says 'shared anonymous pages are not moved.'
  - The implementation was 'shared anonymoys pages may be moved'.
    If page_mapcount <=2, shared anonymous pages's charge were moved.

After patch:
  - The spec says 'all anonymous pages are moved'.
  - The implementation is 'all anonymous pages are moved'.

Considering usage of memcg, this will not affect user's experience.
'shared anonymous' pages only exists between a tree of processes which
don't do exec().  Moving one of process without exec() seems not sane.
For example, libcgroup will not be affected by this change.  (Anyway, no
one noticed the implementation for a long time...)

Below is a discussion log:

- current spec/implementation are complex
- Now, shared file caches are moved
- It adds unclear check as page_mapcount(). To do correct check,
   we should check swap users, etc.
- No one notice this implementation behavior. So, no one get benefit
   from the design.
- In general, once task is moved to a cgroup for running, it will not
   be moved....
- Finally, we have control knob as memory.move_charge_at_immigrate.

Here is a patch to allow moving shared pages, completely. This makes
memcg simpler and fix current broken code.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Alex Shi [Thu, 3 May 2012 05:43:54 +0000 (15:43 +1000)]

mm: move is_vma_temporary_stack() declaration to huge_mm.h

When transparent_hugepage_enabled() is used outside mm/, such as in
arch/x86/xx/tlb.c:

+       if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB
+                       || transparent_hugepage_enabled(vma)) {
+               flush_tlb_mm(vma->vm_mm);

is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile
errors:

arch/x86/mm/tlb.c: In function `flush_tlb_range':
arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration]

Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it
is better to move it to huge_mm.h from rmap.h to avoid such errors.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Ulrich Drepper [Thu, 3 May 2012 05:43:54 +0000 (15:43 +1000)]

tools/vm/page-types.c: cleanups

Compiling page-type.c with a recent compiler produces many warnings,
mostly related to signed/unsigned comparisons.  This patch cleans up most
of them.

One remaining warning is about an unused parameter.  The <compiler.h> file
doesn't define a __unused macro (or the like) yet.  This can be addressed
later.

Signed-off-by: Ulrich Drepper <drepper@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Ulrich Drepper [Thu, 3 May 2012 05:43:53 +0000 (15:43 +1000)]

kbuild: install kernel-page-flags.h

Programs using /proc/kpageflags need to know about the various flags.  The
<linux/kernel-page-flags.h> provides them and the comments in the file
indicate that it is supposed to be used by user-level code.  But the file
is not installed.

Install the headers and mark the unstable flags as out-of-bounds.  The
page-type tool is also adjusted to not duplicate the definitions

Signed-off-by: Ulrich Drepper<drepper@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Bjorn Helgaas [Thu, 3 May 2012 05:43:53 +0000 (15:43 +1000)]

mm: print physical addresses consistently with other parts of kernel

Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel.  For example:

    -Zone PFN ranges:
    +Zone ranges:
    -  DMA32    0x00000010 -> 0x00100000
    +  DMA32    [mem 0x00010000-0xffffffff]
    -  Normal   0x00100000 -> 0x01080000
    +  Normal   [mem 0x100000000-0x107fffffff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Bjorn Helgaas [Thu, 3 May 2012 05:43:52 +0000 (15:43 +1000)]

swiotlb: print physical addresses consistently with other parts of kernel

Print swiotlb info in a style consistent with the %pR style used elsewhere
in the kernel.  For example:

    -Placing 64MB software IO TLB between ffff88007a662000 - ffff88007e662000
    -software IO TLB at phys 0x7a662000 - 0x7e662000
    +software IO TLB [mem 0x7a662000-0x7e661fff] (64MB) mapped at [ffff88007a662000-ffff88007e661fff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Bjorn Helgaas [Thu, 3 May 2012 05:43:52 +0000 (15:43 +1000)]

x86: print physical addresses consistently with other parts of kernel

Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel.  For example:

    -found SMP MP-table at [ffff8800000fce90] fce90
    +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90]
    -initial memory mapped : 0 - 20000000
    +initial memory mapped: [mem 0x00000000-0x1fffffff]
    -Base memory trampoline at [ffff88000009c000] 9c000 size 8192
    +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000]
    -SRAT: Node 0 PXM 0 0-80000000
    +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Bjorn Helgaas [Thu, 3 May 2012 05:43:52 +0000 (15:43 +1000)]

x86: print e820 physical addresses consistently with other parts of kernel

Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel.  For example:

    -BIOS-provided physical RAM map:
    +e820: BIOS-provided physical RAM map:
    - BIOS-e820: 0000000000000100 - 000000000009e000 (usable)
    +BIOS-e820: [mem 0x0000000000000100-0x000000000009dfff] usable
    -Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
    +e820: [mem 0x90000000-0xfed1bfff] available for PCI devices
    -reserve RAM buffer: 000000000009e000 - 000000000009ffff
    +e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:51 +0000 (15:43 +1000)]

bug: completely remove code generated by disabled VM_BUG_ON()

Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON()

for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in
do_huge_pmd_wp_page() generates 114 bytes of code.
But they mostly disappears when I split this VM_BUG_ON into two:
-VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+VM_BUG_ON(!PageCompound(page));
+VM_BUG_ON(!PageHead(page));
weird... but anyway after this patch code disappears completely.

add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:51 +0000 (15:43 +1000)]

bug: introduce BUILD_BUG_ON_INVALID() macro

Sometimes we want to check some expressions correctness at compile time.
"(void)(e);" or "if (e);" can be dangerous if the expression has
side-effects, and gcc sometimes generates a lot of code, even if the
expression has no effect.

This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it
forces a compilation error if expression is invalid without any extra
code.

[Cast to "long" required because sizeof does not work for bit-fields.]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christopher Yeoh [Thu, 3 May 2012 05:43:51 +0000 (15:43 +1000)]

Cross Memory Attach: make it Kconfigurable

Add a Kconfig option to allow people who don't want cross memory attach to
not have it included in their build.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Thu, 3 May 2012 05:43:50 +0000 (15:43 +1000)]

Documentation: memcg: future proof hierarchical statistics documentation

The hierarchical versions of per-memcg counters in memory.stat are all
calculated the same way and are all named total_<counter>.

Documenting the pattern is easier for maintenance than listing each
counter twice.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Randy Dunlap <rdunlap@xenotime.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Thu, 3 May 2012 05:43:50 +0000 (15:43 +1000)]

mm, thp: drop page_table_lock to uncharge memcg pages

mm->page_table_lock is hotly contested for page fault tests and isn't
necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page().

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:50 +0000 (15:43 +1000)]

mm-rename-is_mlocked_vma-to-mlocked_vma_newpage-fix

s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan.

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Ying Han [Thu, 3 May 2012 05:43:49 +0000 (15:43 +1000)]

mm: rename is_mlocked_vma() to mlocked_vma_newpage()

Andrew pointed out that the is_mlocked_vma() is misnamed. A function with
name like that would expect bool return and no side-effects.

Since it is called on the fault path for new page, rename it in this
patch.

Signed-off-by: Ying Han <yinghan@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:49 +0000 (15:43 +1000)]

mm-memcg-count-pte-references-from-every-member-of-the-reclaimed-hierarchy-fix

name the args in the declaration

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Thu, 3 May 2012 05:43:49 +0000 (15:43 +1000)]

mm: memcg: count pte references from every member of the reclaimed hierarchy

The rmap walker checking page table references has historically
ignored references from VMAs that were not part of the memcg that was
being reclaimed during memcg hard limit reclaim.

When transitioning global reclaim to memcg hierarchy reclaim, I missed
that bit and now references from outside a memcg are ignored even
during global reclaim.

Reverting back to traditional behaviour - count all references during
global reclaim and only mind references of the memcg being reclaimed
during limit reclaim would be one option.

However, the more generic idea is to ignore references exactly then
when they are outside the hierarchy that is currently under reclaim;
because only then will their reclamation be of any use to help the
pressure situation. It makes no sense to ignore references from a
sibling memcg and then evict a page that will be immediately refaulted
by that sibling which contributes to the same usage of the common
ancestor under reclaim.

The solution: make the rmap walker ignore references from VMAs that
are not part of the hierarchy that is being reclaimed.

Flat limit reclaim will stay the same, hierarchical limit reclaim will
mind the references only to pages that the hierarchy owns. Global
reclaim, since it reclaims from all memcgs, will be fixed to regard
all references.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Thu, 3 May 2012 05:43:48 +0000 (15:43 +1000)]

kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite

Library functions should not grab locks when the callsites can do it, even
if the lock nests like the rcu read-side lock does.

Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:48 +0000 (15:43 +1000)]

mm: do_migrate_pages(): rename arguments

s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it
duplicates the argument's type.

Done in a fit of irritation over 80-col issues :(

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <mkosaki@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:47 +0000 (15:43 +1000)]

mm-do_migrate_pages-calls-migrate_to_node-even-if-task-is-already-on-a-correct-node-fix

clean up comment layout

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <mkosaki@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Larry Woodman [Thu, 3 May 2012 05:43:47 +0000 (15:43 +1000)]

mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node

While running an application that moves tasks from one cpuset to another I
noticed that it takes much longer and moves many more pages than expected.
The reason for this is do_migrate_pages() does its best to preserve the
relative node differential from the first node of the cpuset because the
application may have been written with that in mind.  If memory was
interleaved on the nodes of the source cpuset by an application
do_migrate_pages() will try its best to maintain that interleaving on the
nodes of the destination cpuset.  This means copying the memory from all
source nodes to the destination nodes even if the source and destination
nodes overlap.

This is a problem for userspace NUMA placement tools.  The amount of time
spent doing extra memory moves cancels out some of the NUMA performance
improvements.  Furthermore, if the number of source and destination nodes
are to maintain the previous interleaving layout anyway.

This patch changes do_migrate_pages() to only preserve the relative layout
inside the program if the number of NUMA nodes in the source and
destination mask are the same.  If the number is different, we do a much
more efficient migration by not touching memory that is in an allowed
node.

This preserves the old behaviour for programs that want it, while allowing
a userspace NUMA placement tool to use the new, faster migration.  This
improves performance in our tests by up to a factor of 7.

Without this change migrating tasks from a cpuset containing nodes 0-7 to
a cpuset containing nodes 3-4, we migrate from ALL the nodes even if they
are in the both the source and destination nodesets:

   Migrating 7 to 4
   Migrating 6 to 3
   Migrating 5 to 4
   Migrating 4 to 3
   Migrating 1 to 4
   Migrating 3 to 4
   Migrating 0 to 3
   Migrating 2 to 3

With this change we only migrate from nodes that are not in the
destination nodesets:

   Migrating 7 to 4
   Migrating 6 to 3
   Migrating 5 to 4
   Migrating 2 to 3
   Migrating 1 to 4
   Migrating 0 to 3

Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
containing 3,4,5 we still
do move everything so that we preserve the desired NUMA offsets:

Migrating 4 to 5
Migrating 3 to 4
Migrating 2 to 3

As far as performance is concerned this simple patch improves the time it
takes to move 14, 20 and 26 large tasks from a cpuset containing nodes 0-7
to a cpuset containing nodes 1 & 3 by up to a factor of 7.  Here are the
timings with and without the patch:

BEFORE PATCH -- Move times: 59, 140, 651 seconds
============

Moving 14 tasks from nodes (0-7) to nodes (1,3)
numad(8780) do_migrate_pages (mm=0xffff88081d414400
from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1
flags=0x4)
(Above moves repeated for each of the 14 tasks...)
PID 8890 moved to node(s) 1,3 in 59.2 seconds

Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
numad(8780) do_migrate_pages (mm=0xffff88081d88c700
from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1
flags=0x4)
(Above moves repeated for each of the 20 tasks...)
PID 8962 moved to node(s) 1,4-5 in 139.88 seconds

Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1
flags=0x4)
numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1
flags=0x4)
(Above moves repeated for each of the 26 tasks...)
PID 9058 moved to node(s) 1-3,5 in 651.45 seconds

AFTER PATCH -- Move times: 42, 56, 93 seconds
===========

Moving 14 tasks from nodes (0-7) to nodes (5,7)
numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5
flags=0x4)
(Above moves repeated for each of the 14 tasks...)
PID 33221 moved to node(s) 5,7 in 41.67 seconds

Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1
flags=0x4)
(Above moves repeated for each of the 20 tasks...)
PID 33289 moved to node(s) 1,3,5 in 56.3 seconds

Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
numad(33209) do_migrate_pages (mm=0xffff88101d924400
from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5
flags=0x4)
numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1
flags=0x4)
(Above moves repeated for each of the 26 tasks...)
PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <mkosaki@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Thu, 3 May 2012 05:43:47 +0000 (15:43 +1000)]

thp, memcg: split hugepage for memcg oom on cow

On COW, a new hugepage is allocated and charged to the memcg. If the
system is oom or the charge to the memcg fails, however, the fault handler
will return VM_FAULT_OOM which results in an oom kill.

Instead, it's possible to fallback to splitting the hugepage so that the
COW results only in an order-0 page being allocated and charged to the
memcg which has a higher liklihood to succeed. This is expensive because
the hugepage must be split in the page fault handler, but it is much
better than unnecessarily oom killing a process.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Thu, 3 May 2012 05:43:46 +0000 (15:43 +1000)]

mm: correctly synchronize rss-counters at exit/exec

mm->rss_stat counters have per-task delta: task->rss_stat.  Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats, audit
and other stuff.  Unfortunately the kernel does this before calling
mm_release(), which can call put_user() for processing
task->clear_child_tid.  So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier.  After mm_release() there should be
no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sasikantha babu [Thu, 3 May 2012 05:43:46 +0000 (15:43 +1000)]

mm/vmstat.c: remov debug fs entries on failure of file creation and made extfrag_debug_root dentry local

Removed debug fs files and directory on failure. Since no one using
"extfrag_debug_root" dentry outside of function extfrag_debug_init made it
local to the function.

Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Siddhesh Poyarekar [Thu, 3 May 2012 05:43:45 +0000 (15:43 +1000)]

mm/fork: fix overflow in vma length when copying mmap on clone

The vma length in dup_mmap is calculated and stored in a unsigned int,
which is insufficient and hence overflows for very large maps (beyond
16TB). The following program demonstrates this:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>

#define GIG 1024 * 1024 * 1024L
#define EXTENT 16393

int main(void)
{
        int i, r;
        void *m;
        char buf[1024];

        for (i = 0; i < EXTENT; i++) {
                m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
                         PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

                if (m == (void *)-1)
                        printf("MMAP Failed: %d\n", m);
                else
                        printf("%d : MMAP returned %p\n", i, m);

                r = fork();

                if (r == 0) {
                        printf("%d: successed\n", i);
                        return 0;
                } else if (r < 0)
                        printf("FORK Failed: %d\n", r);
                else if (r > 0)
                        wait(NULL);
        }
        return 0;
}

Increase the storage size of the result to unsigned long, which is
sufficient for storing the difference between addresses.

Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:45 +0000 (15:43 +1000)]

mm-mmapc-find_vma-remove-unnecessary-ifmm-check-fix

add remove-me comment

Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kautuk Consul <consul.kautuk@gmail.com>
Cc: Rajman Mekaco <rajman.mekaco@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Rajman Mekaco [Thu, 3 May 2012 05:43:45 +0000 (15:43 +1000)]

mm/mmap.c: find_vma(): remove unnecessary if(mm) check

The if(mm) check is not required in find_vma, as the kernel code calls
find_vma only when it is absolutely sure that the mm_struct arg to it is
non-NULL.

Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now.  This
will serve the purpose of mandating that the execution
context(user-mode/kernel-mode) be known before find_vma is called.  Also
fixed 2 checkpatch.pl errors in the declaration of the rb_node and vma_tmp
local variables.

I was browsing through the internet and read a discussion at
https://lkml.org/lkml/2012/3/27/342 which discusses removal of the
validation check within find_vma.  Since no-one responded, I decided to
send this patch with Andrew's suggestions.

Signed-off-by: Rajman Mekaco <rajman.mekaco@gmail.com>
Cc: Kautuk Consul <consul.kautuk@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Thomas Meyer [Thu, 3 May 2012 05:43:44 +0000 (15:43 +1000)]

mm: use kcalloc() instead of kzalloc() to allocate array

The advantage of kcalloc is, that will prevent integer overflows which
could result from the multiplication of number of elements and size and it
is also a bit nicer to read.

The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Ryota Ozaki [Thu, 3 May 2012 05:43:44 +0000 (15:43 +1000)]

mm: fix off-by-one bug in print_nodes_state()

/sys/devices/system/node/{online,possible} outputs a garbage byte because
print_nodes_state() returns content size + 1. To fix the bug, the patch
changes the use of cpuset_sprintf_cpulist to follow the use at other
places, which is clearer and safer.

This bug was introduced since v2.6.24 (bde631a51876f23e9).

Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:43 +0000 (15:43 +1000)]

memcg: add memory controller documentation for hugetlb management

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:43 +0000 (15:43 +1000)]

hugetlb-use-mmu_gather-instead-of-a-temporary-linked-list-for-accumulating-pages-fix-2

Remove further strange double space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:43 +0000 (15:43 +1000)]

hugetlb-migrate-memcg-info-from-oldpage-to-new-page-during-migration-fix

remove strange double-spaces

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:42 +0000 (15:43 +1000)]

hugetlb: migrate memcg info from oldpage to new page during migration

With HugeTLB pages, memcg is uncharged in compound page destructor. Since
we are holding a hugepage reference, we can be sure that old page won't
get uncharged till the last put_page(). On successful migrate, we can
move the memcg information to new page's page_cgroup and mark the old
page's page_cgroup unused.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:42 +0000 (15:43 +1000)]

memcg-move-hugetlb-resource-count-to-parent-cgroup-on-memcg-removal-fix-fix

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:41 +0000 (15:43 +1000)]

memcg-move-hugetlb-resource-count-to-parent-cgroup-on-memcg-removal-fix

fix CONFIG_MEM_RES_CTLR_HUGETLB=n warnings

include/linux/memcontrol.h:504: warning: 'struct cgroup' declared inside parameter list
include/linux/memcontrol.h:504: warning: its scope is only this definition or declaration, which is probably not what you want
include/linux/memcontrol.h:509: warning: 'struct cgroup' declared inside parameter list

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:41 +0000 (15:43 +1000)]

memcg: move HugeTLB resource count to parent cgroup on memcg removal

Add support for memcg removal with HugeTLB resource usage.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:41 +0000 (15:43 +1000)]

hugetlbfs: add a list for tracking in-use HugeTLB pages

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support memcg removal. On
memcg removal we update the page's memory cgroup to point to parent
cgroup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:40 +0000 (15:43 +1000)]

hugetlbfs-add-memcg-control-files-for-hugetlbfs-use-scnprintf-instead-of-sprintf-fix

s/scnprintf/snprintf/

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:40 +0000 (15:43 +1000)]

memcg: use scnprintf instead of sprintf

Make sure we don't overflow.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:39 +0000 (15:43 +1000)]

hugetlbfs: add memcg control files for hugetlbfs

Add the control files for hugetlbfs in memcg.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:39 +0000 (15:43 +1000)]

memcg: track resource index in cftype private

This patch adds a new charge type _MEMHUGETLB for tracking hugetlb
resources. We also use cftype to encode the hugetlb resource index. This
helps in using same memcg callbacks for hugetlb control files.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:39 +0000 (15:43 +1000)]

hugetlb: add charge/uncharge calls for HugeTLB alloc/free

This adds necessary charge/uncharge calls in the HugeTLB code. We do
memcg charge in page alloc and uncharge in compound page destructor. We
also need to ignore HugeTLB pages in __mem_cgroup_uncharge_common because
that get called from delete_from_page_cache

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:38 +0000 (15:43 +1000)]

memcg-add-hugetlb-extension-fix-fix

optimise CONFIG_HUGETLB_PAGE=y, CONFIG_MEM_RES_CTLR_HUGETLB=n

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Thu, 3 May 2012 05:43:37 +0000 (15:43 +1000)]

memcg-add-hugetlb-extension-fix

Fix UML build:

mm/memcontrol.c:256:30: error: 'HUGE_MAX_HSTATE' undeclared here (not in a function)

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:36 +0000 (15:43 +1000)]

memcg: add HugeTLB extension

This patch implements a memcg extension that allows us to control HugeTLB
allocations via memory controller.  The extension allows to limit the
HugeTLB usage per control group and enforces the controller limit during
page fault.  Since HugeTLB doesn't support page reclaim, enforcing the
limit at page fault time implies that, the application will get SIGBUS
signal if it tries to access HugeTLB pages beyond its limit.  This
requires the application to know beforehand how much HugeTLB pages it
would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for memcg removal will be added in later patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:36 +0000 (15:43 +1000)]

hugetlb: simplify migrate_huge_page()

Since we migrate only one hugepage don't use linked list for passing the
page around. Directly pass page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:36 +0000 (15:43 +1000)]

hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb

i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
fix linked list corruption in unmap_hugepage_range()") but we don't use
page->lru in unmap_hugepage_range any more.  Also the lock was taken
higher up in the stack in some code path.  That would result in deadlock.

unmap_mapping_range (i_mmap_mutex)
-> unmap_mapping_range_tree
    -> unmap_mapping_range_vma
       -> zap_page_range_single
         -> unmap_single_vma
      -> unmap_hugepage_range (i_mmap_mutex)

For shared pagetable support for huge pages, since pagetable pages are ref
counted we don't need any lock during huge_pmd_unshare.  We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
(39dde65c9940c97f ("shared page table for hugetlb page")).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:35 +0000 (15:43 +1000)]

hugetlb-use-mmu_gather-instead-of-a-temporary-linked-list-for-accumulating-pages-fix-2

Further cleanup of the above patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:35 +0000 (15:43 +1000)]

hugetlb-use-mmu_gather-instead-of-a-temporary-linked-list-for-accumulating-pages-fix-fix

In file included from fs/proc/meminfo.c:2:
include/linux/hugetlb.h:126: warning: 'struct mmu_gather' declared inside parameter list
include/linux/hugetlb.h:126: warning: its scope is only this definition or declaration, which is probably not what you want

what a mess :(

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Thu, 3 May 2012 05:43:35 +0000 (15:43 +1000)]

hugetlb-use-mmu_gather-instead-of-a-temporary-linked-list-for-accumulating-pages-fix

fix CONFIG_HUGETLB_PAGE=n build

mm/memory.c: In function 'unmap_single_vma':
mm/memory.c:1334: error: implicit declaration of function '__unmap_hugepage_range'

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:34 +0000 (15:43 +1000)]

hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Use an mmu_gather instead of a temporary linked list for accumulating
pages when we unmap a hugepage range. This also allows us to get rid of
i_mmap_mutex unmap_hugepage_range in the following patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:34 +0000 (15:43 +1000)]

hugetlbfs: add an inline helper for finding hstate index

Add an inline helper and use it in the code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:33 +0000 (15:43 +1000)]

hugetlbfs: don't use ERR_PTR with VM_FAULT* values

The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Aneesh Kumar K.V [Thu, 3 May 2012 05:43:33 +0000 (15:43 +1000)]

hugetlb: rename max_hstate to hugetlb_max_hstate

This patchset implements a memory controller extension to control HugeTLB
allocations.  The extension allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault.  Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time
implies that, the application will get SIGBUS signal if it tries to access
HugeTLB pages beyond its limit.  This requires the application to know
beforehand how much HugeTLB memory it would require for its use.

The goal is to control how many HugeTLB pages a group of tasks can
allocate.  It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock.  An HPC job scheduler requires jobs to specify their resource
requirements in the job file.  Once their requirements can be met, job
schedulers (such as SLURM) will schedule the job.  We need to make sure
that the jobs won't consume more resources than requested.  If they do we
should either error out or kill the application.

This patch:

Rename max_hstate to hugetlb_max_hstate.  We will be using this from other
subsystems like memcg in later patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Thu, 3 May 2012 05:43:33 +0000 (15:43 +1000)]

mm: vmscan: remove reclaim_mode_t

There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
and lumpy reclaim have been removed. This patch gets rid of
reclaim_mode_t as well and improves the documentation about what
reclaim/compaction is and when it is triggered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Thu, 3 May 2012 05:43:32 +0000 (15:43 +1000)]

mm: vmscan: do not stall on writeback during memory compaction

This patch stops reclaim/compaction entering sync reclaim as this was only
intended for lumpy reclaim and an oversight. Page migration has its own
logic for stalling on writeback pages if necessary and memory compaction
is already using it.

Waiting on page writeback is bad for a number of reasons but the primary
one is that waiting on writeback to a slow device like USB can take a
considerable length of time. Page reclaim instead uses
wait_iff_congested() to throttle if too many dirty pages are being
scanned.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Thu, 3 May 2012 05:43:32 +0000 (15:43 +1000)]

mm: vmscan: remove lumpy reclaim

This series removes lumpy reclaim and some stalling logic that was
unintentionally being used by memory compaction.  The end result is that
stalling on dirty pages during page reclaim now depends on
wait_iff_congested().

Four kernels were compared

3.3.0     vanilla
3.4.0-rc2 vanilla
3.4.0-rc2 lumpyremove-v2 is patch one from this series
3.4.0-rc2 nosync-v2r3 is the full series

Removing lumpy reclaim saves almost 900 bytes of text whereas the full
series removes 1200 bytes.

   text    data     bss     dec     hex filename
6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla
6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2
6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2

There are behaviour changes in the series and so tests were run with
monitoring of ftrace events.  This disrupts results so the performance
results are distorted but the new behaviour should be clearer.

fs-mark running in a threaded configuration showed little of interest as
it did not push reclaim aggressively

FS-Mark Multi Threaded
                        3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73

MMTests Statistics: vmstat
Page Ins                                       80532       82212       81420       79480
Page Outs                                  111434984   111456240   111437376   111582628
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                           44881       27889       27453       34843
Kswapd pages scanned                        25841428    25860774    25861233    25843212
Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
Direct pages reclaimed                         44881       27889       27453       34843
Kswapd efficiency                                99%         99%         99%         99%
Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
Direct efficiency                               100%        100%        100%        100%
Direct velocity                               37.783      23.375      23.031      29.188
Percentage direct scans                           0%          0%          0%          0%

ftrace showed that there was no stalling on writeback or pages submitted
for IO from reclaim context.

postmark was similar and while it was more interesting, it also did not
push reclaim heavily.

POSTMARK
                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27

MMTests Statistics: vmstat
Page Ins                                    13710192    13729032    13727944    13760136
Page Outs                                   43071140    42987228    42733684    42931624
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                               0           0           0           0
Kswapd pages scanned                         9941613     9937443     9939085     9929154
Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
Direct pages reclaimed                             0           0           0           0
Kswapd efficiency                                99%         99%         99%         99%
Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
Direct efficiency                               100%        100%        100%        100%
Direct velocity                                0.000       0.000       0.000       0.000

It looks like here that the full series regresses performance but as
ftrace showed no usage of wait_iff_congested() or sync reclaim I am
assuming it's a disruption due to monitoring.  Other data such as memory
usage, page IO, swap IO all looked similar.

Running a benchmark with a plain DD showed nothing very interesting.  The
full series stalled in wait_iff_congested() slightly less but stall times
on vanilla kernels were marginal.

Running a benchmark that hammered on file-backed mappings showed stalls
due to congestion but not in sync writebacks

MICRO
                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
MMTests Statistics: duration
Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91

MMTests Statistics: vmstat
Page Ins                                      108712      120708       97224      110344
Page Outs                                  155514576   156017404   155813676   156193256
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                         2599253     1550480     2512822     2414760
Kswapd pages scanned                        69742364    71150694    68839041    69692533
Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
Direct pages reclaimed                         53693       94750       61792       75205
Kswapd efficiency                                49%         48%         50%         49%
Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
Direct efficiency                                 2%          6%          2%          3%
Direct velocity                             1432.174     845.464    1379.807    1317.446
Percentage direct scans                           3%          2%          3%          3%
Page writes by reclaim                             0           0           0           0
Page writes file                                   0           0           0           0
Page writes anon                                   0           0           0           0
Page reclaim immediate                             0           0           0        1218
Page rescued immediate                             0           0           0           0
Slabs scanned                                  15360       16384       13312       16384
Direct inode steals                                0           0           0           0
Kswapd inode steals                             4340        4327        1630        4323

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 0          0          0          0
Direct time   congest     waited               0ms        0ms        0ms        0ms
Direct full   congest     waited                 0          0          0          0
Direct number conditional waited               900        870        754        789
Direct time   conditional waited               0ms        0ms        0ms       20ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited              2106       2308       2116       1915
KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
KSwapd full   congest     waited              1346       1530       1202       1278
KSwapd number conditional waited             12922      16320      10943      14670
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0

Reclaim statistics are not radically changed.  The stall times in kswapd
are massive but it is clear that it is due to calls to congestion_wait()
and that is almost certainly the call in balance_pgdat().  Otherwise
stalls due to dirty pages are non-existant.

I ran a benchmark that stressed high-order allocation.  This is very
artifical load but was used in the past to evaluate lumpy reclaim and
compaction.  Generally I look at allocation success rates and latency
figures.

STRESS-HIGHALLOC
                 3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44

MMTests Statistics: vmstat
Page Ins                                     4486020     2807256     2855944     2876244
Page Outs                                    7261600     7973688     7975320     7986120
Swap Ins                                       31694           0           0           0
Swap Outs                                      98179           0           0           0
Direct pages scanned                           53494       57731       34406      113015
Kswapd pages scanned                         6271173     1287481     1278174     1219095
Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
Direct pages reclaimed                          1468       14564       16649       92456
Kswapd efficiency                                32%         99%         98%         98%
Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
Direct efficiency                                 2%         25%         48%         81%
Direct velocity                               46.047      50.092      29.672      97.306
Percentage direct scans                           0%          4%          2%          8%
Page writes by reclaim                       1616049           0           0           0
Page writes file                             1517870           0           0           0
Page writes anon                               98179           0           0           0
Page reclaim immediate                        103778       27339        9796       17831
Page rescued immediate                             0           0           0           0
Slabs scanned                                1096704      986112      980992      998400
Direct inode steals                              223      215040      216736      247881
Kswapd inode steals                           175331       61548       68444       63066
Kswapd skipped wait                            21991           0           1           0
THP fault alloc                                    1         135         125         134
THP collapse alloc                               393         311         228         236
THP splits                                        25          13           7           8
THP fault fallback                                 0           0           0           0
THP collapse fail                                  3           5           7           7
Compaction stalls                                865        1270        1422        1518
Compaction success                               370         401         353         383
Compaction failures                              495         869        1069        1135
Compaction pages moved                        870155     3828868     4036106     4423626
Compaction move failure                        26429       23865       29742       27514

Success rates are completely hosed for 3.4-rc2 which is almost certainly
due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled].
I expected this would happen for kswapd and impair allocation success
rates (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this
much a difference: 80% less scanning, 37% less reclaim by kswapd

In comparison, reclaim/compaction is not aggressive and gives up easily
which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would be
much more aggressive about reclaim/compaction than THP allocations are.
The stress test above is allocating like neither THP or hugetlbfs but is
much closer to THP.

Mainline is now impaired in terms of high order allocation under heavy
load although I do not know to what degree as I did not test with
__GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
resizing, THP allocation and high order atomic allocation failures from
network devices.

In terms of congestion throttling, I see the following for this test

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 3          0          0          0
Direct time   congest     waited               0ms        0ms        0ms        0ms
Direct full   congest     waited                 0          0          0          0
Direct number conditional waited               957        512       1081       1075
Direct time   conditional waited               0ms        0ms        0ms        0ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited                36          4          3          5
KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
KSwapd full   congest     waited                30          4          3          5
KSwapd number conditional waited             88514        197        332        542
KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
KSwapd full   conditional waited                49          0          0          0

The "conditional waited" times are the most interesting as this is
directly impacted by the number of dirty pages encountered during scan.
As lumpy reclaim is no longer scanning contiguous ranges, it is finding
fewer dirty pages.  This brings wait times from about 5 seconds to 0.
kswapd itself is still calling congestion_wait() so it'll still stall but
it's a lot less.

In terms of the type of IO we were doing, I see this

FTrace Reclaim Statistics: mm_vmscan_writepage
Direct writes anon  sync                         0          0          0          0
Direct writes anon  async                        0          0          0          0
Direct writes file  sync                         0          0          0          0
Direct writes file  async                        0          0          0          0
Direct writes mixed sync                         0          0          0          0
Direct writes mixed async                        0          0          0          0
KSwapd writes anon  sync                         0          0          0          0
KSwapd writes anon  async                    91682          0          0          0
KSwapd writes file  sync                         0          0          0          0
KSwapd writes file  async                   822629          0          0          0
KSwapd writes mixed sync                         0          0          0          0
KSwapd writes mixed async                        0          0          0          0

In 3.2, kswapd was doing a bunch of async writes of pages but
reclaim/compaction was never reaching a point where it was doing sync IO.
This does not guarantee that reclaim/compaction was not calling
wait_on_page_writeback() but I would consider it unlikely.  It indicates
that merging patches 2 and 3 to stop reclaim/compaction calling
wait_on_page_writeback() should be safe.

This patch:

Lumpy reclaim had a purpose but in the mind of some, it was to kick the
system so hard it trashed.  For others the purpose was to complicate
vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
memory compaction needs to step up and replace it so this patch sends
lumpy reclaim to the farm.

The tracepoint format changes for isolating LRU pages with this patch
applied.  Furthermore reclaim/compaction can no longer queue dirty pages
in pageout() if the underlying BDI is congested.  Lumpy reclaim used this
logic and reclaim/compaction was using it in error.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Rik van Riel [Thu, 3 May 2012 05:43:32 +0000 (15:43 +1000)]

mm: remove swap token code

The swap token code no longer fits in with the current VM model.  It does
not play well with cgroups or the better NUMA placement code in
development, since we have only one swap token globally.

It also has the potential to mess with scalability of the system, by
increasing the number of non-reclaimable pages on the active and inactive
anon LRU lists.

Last but not least, the swap token code has been broken for a year without
complaints, as reported by Konstantin Khlebnikov.  This suggests we no
longer have much use for it.

The days of sub-1G memory systems with heavy use of swap are over.  If we
ever need thrashing reducing code in the future, we will have to implement
something that does scale.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Bob Picco <bpicco@meloft.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Thu, 3 May 2012 05:43:31 +0000 (15:43 +1000)]

mm, thp: allow fallback when pte_alloc_one() fails for huge pmd

The transparent hugepages feature is careful to not invoke the oom killer
when a hugepage cannot be allocated.

pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however,
currently results in VM_FAULT_OOM which invokes the pagefault oom killer
to kill a memory-hogging task.

This is unnecessary since it's possible to drop the reference to the
hugepage and fallback to allocating a small page.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Thu, 3 May 2012 05:43:31 +0000 (15:43 +1000)]

mm, thp: remove unnecessary ret variable

The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so
remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Wang Sheng-Hui [Thu, 3 May 2012 05:43:30 +0000 (15:43 +1000)]

mm/hugetlb.c: use long vars instead of int in region_count()

args f & t and fields from & to of struct file_region are defined as long.
Use long instead of int to type the temp vars.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Wang Sheng-Hui [Thu, 3 May 2012 05:43:30 +0000 (15:43 +1000)]

mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy()

We have enum definition in mempolicy.h: MPOL_REBIND_ONCE. It should
replace the magic number 0 for step comparison in function
mpol_rebind_policy.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Linux kernel for KaRo TX COM modules