Michal Hocko [Thu, 8 Dec 2011 04:32:05 +0000 (15:32 +1100)]
procfs: do not overflow get_{idle,iowait}_time for nohz
Since a25cac51 ("proc: Consider NO_HZ when printing idle and iowait
times") we are reporting idle/io_wait time also while a CPU is tickless.
We rely on get_{idle,iowait}_time functions to retrieve proper data.
These functions, however, use usecs_to_cputime to translate micro seconds
time to cputime64_t. This is just an alias to usecs_to_jiffies which
reduces the data type from u64 to unsigned int and also checks whether the
given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET) and returns
MAX_JIFFY_OFFSET in that case.
When do we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
>3000s! until we overflow unsigned int. Just for reference CONFIG_HZ_100
has an overflow window around 20s, CONFIG_HZ_250 ~8s and CONFIG_HZ_1000
~2s.
This results in a bug when people saw [h]top going mad reporting 100% CPU
usage even though there was basically no CPU load. The reason was simply
that /proc/stat stopped reporting idle/io_wait changes (and reported
MAX_JIFFY_OFFSET) and so the only change happening was for user system
time.
Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision to
32b type and it is much more appropriate for cumulative time values
(unlike usecs_to_jiffies which intended for timeout calculations).
Signed-off-by: Michal Hocko <mhocko@suse.cz> Tested-by: Artem S. Tashkinov <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Thu, 8 Dec 2011 04:32:04 +0000 (15:32 +1100)]
mm: vmalloc: check for page allocation failure before vmlist insertion
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after it
is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
If accepted, this should be considered a -stable candidate.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com> Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.1.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 8 Dec 2011 04:32:04 +0000 (15:32 +1100)]
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Reported-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Thu, 8 Dec 2011 04:32:03 +0000 (15:32 +1100)]
mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks
setup_zone_migrate_reserve expects that zone->start_pfn starts
at pageblock_nr_pages aligned pfn otherwise we could access
beyond an existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page in
a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to ensure
that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Dang Bo <bdang@vmware.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Arve Hjnnevg <arve@android.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Youquan Song [Thu, 8 Dec 2011 04:32:03 +0000 (15:32 +1100)]
thp: set compound tail page _count to zero
70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not set
page_tail->_count to zero if a 1GB page is utilized. So when an IOMMU 1GB
page is used at KVM, it wil result in a kernel oops because a tail page's
_count does not equal zero.
Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Youquan Song [Thu, 8 Dec 2011 04:32:02 +0000 (15:32 +1100)]
thp: add compound tail page _mapcount when mapped
With the 3.2-rc kernel, the IOMMU 2M page in KVM works. While I try to us
IOMMU 1GB page in KVM, I encounter a oops and 1GB page total fail to be
used. The root cause is that 1GB page allocation calls gup_huge_pud()
while 2M page calls gup_huge_pmd. If compound pages are used and the page
is tail page, gup_huge_pmd increase _mapcount to record tail page are
mapped while gup_huge_pud does not include this process. So when the
mapped page is relesed, it will result in kernel oops because the page
does not mark mapped.
This patch add tail process for compound page in 1GB huge page which keeps
the same process as 2M page.
Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Zijlstra [Thu, 8 Dec 2011 04:32:02 +0000 (15:32 +1100)]
printk: avoid double lock acquire
Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
introduced another silly bug where we would want to acquire an already
held lock. Avoid this.
Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
More players joined to memory cgroup developments and Johannes' great work
changed internal design of memory cgroup dramatically. And he will do
more works. Michal Hokko did many bug fixes and know memory cgroup very
well. Daisuke Nishimura helped us very much but he seems busy now.
Thanks to his works.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.
khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.
Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Jiri Slaby <jslaby@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@suse.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause really
big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>