Daniel Mack [Fri, 21 Sep 2012 01:01:57 +0000 (11:01 +1000)]
lis3: lis3lv02d_spi.c: include linux/of.h
This include is needed to define of_match:ptr() for !CONFIG_OF &&
1CONFIG_DTC.
Signed-off-by: Daniel Mack <zonque@gmail.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
orderly_poweroff is trying to poweroff platform by two steps:
step 1: Call userspace application to poweroff
step 2: If userspace poweroff fail, then do a force power off if force
param is set.
The bug here is, step 1 is always successful with param UMH_NO_WAIT,
should change to UMH_WAIT_EXEC which will monitor whether user application
successful run.
Signed-off-by: Feng Hong <hongfeng@marvell.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
have kernel_restart() call disable_nonboot_cpus(). Doing so can help
machines that require boot cpu be the last alive cpu during reboot to
survive with kernel restart.
This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
requires that the restart routine be run on the primary cpu rather than
secondary ones. Otherwise, the secondary core running the restart routine
will fail to come to online after reboot.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Historically, the top three bytes of personality have been used for things
such as ADDR_NO_RANDOMIZE, which made sense only for specific
architectures.
We now however have a flag there that is general no matter the
architecture (UNAME26); generally we have to be careful to preserve the
personality flags across exec().
This patch fixes tile architecture not to forcefully overwrite personality
flags during exec().
In addition to that, we fix two other things along the way:
- exec_domain switching is fixed -- set_personality() should always
be used instead of directly assigning to current->personality.
- as pointed out by Arnd Bergmann, PER_LINUX_32BIT is not used anywhere
by tile, so let's just drop that in favor of PER_LINUX
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cross-arch: don't corrupt personality flags upon exec()
Historically, the top three bytes of personality have been used for things
such as ADDR_NO_RANDOMIZE, which made sense only for specific
architectures.
We now however have a flag there that is general no matter the
architecture (UNAME26); generally we have to be careful to preserve the
personality flags across exec().
This patch tries to fix all architectures that forcefully overwrite
personality flags during exec() (ppc32 and s390 have been fixed recently
by commits f9783ec86 and 59e4c3a2f in a similar way already).
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Mark Salter <msalter@redhat.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 6afe1a1fe8ff83f6a ("PM: Remove legacy PM") removed the
initialization of retval, causing:
arch/frv/kernel/pm.c: In function 'sysctl_pm_do_suspend':
arch/frv/kernel/pm.c:165:5: warning: 'retval' may be used uninitialized in this function [-Wuninitialized]
Remove the variable completely to fix this, and convert to a proper
switch (...) { ... } construct to improve readability.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Mario [Fri, 21 Sep 2012 01:01:54 +0000 (11:01 +1000)]
sectons: fix const sections for crc32 table
Fix the const sections for the code generated by crc32 table. There's no
ro version of the cacheline aligned section, so we cannot put in const
data without a conflict Just don't make the crc tables const for now.
[ak@linux.intel.com: some fixes and new description] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sections: fix section conflicts in drivers/net/wan
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Krzysztof Halasa <khc@pm.waw.pl> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The PA-RISC tool chain seems to have some problem with correct read/write
attributes on sections. This causes problems when the const sections
are fixed up for other architecture to only contain truly read-only
data.
Minchan Kim [Fri, 21 Sep 2012 01:01:44 +0000 (11:01 +1000)]
memory-hotplug: fix zone stat mismatch
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.
The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.
In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.
Minchan Kim [Fri, 21 Sep 2012 01:01:43 +0000 (11:01 +1000)]
mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range")
Revert 0def08e3 because check_range can't fail in migrate_to_node with
considering current usecases.
Quote from Johannes
: I think it makes sense to revert. Not because of the semantics, but I
: just don't see how check_range() could even fail for this callsite:
:
: 1. we pass mm->mmap->vm_start in there, so we should not fail due to
: find_vma()
:
: 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
: and so can not fail
:
: 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
: continue until addr == end, so we never fail with -EIO
And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
which might pass to MPOL_MF_STRICT.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vasiliy Kulikov <segooon@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haggai Eran [Fri, 21 Sep 2012 01:01:43 +0000 (11:01 +1000)]
mm: call invalidate_range_end in do_wp_page even for zero pages
The previous patch "mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end" only called the
invalidate_range_end mmu notifier function in do_wp_page when the new_page
variable wasn't NULL. This was done in order to only call
invalidate_range_end after invalidate_range_start was called.
Unfortunately, there are situations where new_page is NULL and
invalidate_range_start is called. This caused invalidate_range_start to
be called without a matching invalidate_range_end, causing kvm to loop
indefinitely on the first page fault.
This patch adds a flag variable to do_wp_page that marks whether the
invalidate_range_start notifier was called. invalidate_range_end is then
called if the flag is true.
Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haggai Eran [Fri, 21 Sep 2012 01:01:43 +0000 (11:01 +1000)]
mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.
This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.
Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.
Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: move all mmu notifier invocations to be done outside the PT lock
In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing the lock
(but before freeing the page itself), or by wrapping the page invalidation
with calls to invalidate_range_begin and invalidate_range_end.
To prevent accidental changes to the invalidate_range_end arguments after
the call to invalidate_range_begin, the patch introduces a convention of
saving the arguments in consistently named locals:
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
The patch changes code to use this convention for all calls to
mmu_notifier_invalidate_range_start/end, except those where the calls are
close enough so that anyone who glances at the code can see the values
aren't changing.
This patchset is a preliminary step towards on-demand paging design to be
added to the RDMA stack.
Why do we want on-demand paging for Infiniband?
Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.
Why should anyone care? What problems are users currently experiencing?
This can make programming with RDMA much simpler. Today, developers
that are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we
might be able to provide a single memory access key for each process
that would provide the entire process's address as one large memory
region, and the developers wouldn't need to register memory regions at
all.
Is there any prospect that any other subsystems will utilise these
infrastructural changes? If so, which and how, etc?
As for other subsystems, I understand that XPMEM wanted to sleep in
MMU notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
perhaps Andrea knows about other use cases.
Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.
Avi said:
kvm may be a buyer. kvm::mmu_lock, which serializes guest page
faults, also protects long operations such as destroying large ranges.
It would be good to convert it into a spinlock, but as it is used inside
mmu notifiers, this cannot be done.
(there are alternatives, such as keeping the spinlock and using a
generation counter to do the teardown in O(1), which is what the "may"
is doing up there).
[akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 21 Sep 2012 01:01:42 +0000 (11:01 +1000)]
hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach
0c176d5 ("mm: hugetlb: fix pgoff computation when unmapping page from
vma") fixed pgoff calculation but it has replaced it by
vma_hugecache_offset() which is not approapriate for offsets used for
vma_prio_tree_foreach() because that one expects index in page units
rather than in huge_page_shift.
Using vma_hugecache_offset() is not incorrect because the pgoff will fit
into the same vmas but it is confusing so the standard PAGE_SHIFT based
index calculation is used instead.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Fri, 21 Sep 2012 01:01:42 +0000 (11:01 +1000)]
mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP
In many places !pmd_present has been converted to pmd_none. For pmds
that's equivalent and pmd_none is quicker so using pmd_none is better.
However (unless we delete pmd_present) we should provide an accurate
pmd_present too. This will avoid the risk of code thinking the pmd is non
present because it's under __split_huge_page_map, see the pmd_mknotpresent
there and the comment above it.
If the page has been mprotected as PROT_NONE, it would also lead to a
pmd_present false negative in the same way as the race with
split_huge_page.
Because the PSE bit stays on at all times (both during split_huge_page and
when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
but checking the PROTNONE bit too is still good to remember pmd_present
must always keep PROT_NONE into account.
This explains a not reproducible BUG_ON that was seldom reported on the
lists.
The same issue is in pmd_large, it would go wrong with both PROT_NONE and
if it races with split_huge_page.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi removed some outedated documentation from Documentation/memory.txt
back in 2009 by 3b2b9a875dd ("Documentation/memory.txt: remove some very
outdated recommendations"), but the resulting document is not in a nice
shape either.
It seems to me like we are not losing anything by completely removing the
file now.
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 21 Sep 2012 01:01:41 +0000 (11:01 +1000)]
mm, numa: reclaim from all nodes within reclaim distance
RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.
To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.
What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We had thought that pages could no longer get freed while still marked as
mlocked; but Johannes Weiner posted this program to demonstrate that
truncating an mlocked private file mapping containing COWed pages is still
mishandled:
The anon COWed pages are not caught by truncation's clear_page_mlock() of
the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
look out for them there in page_remove_rmap(). Indeed, why should
truncation or invalidation be doing the clear_page_mlock() when removing
from pagecache? mlock is a property of mapping in userspace, not a
property of pagecache: an mlocked unmapped page is nonsensical.
Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:
isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.
Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.
Reported-by: Sasha Levin <levinsasha928@gmail.com> Deciphered-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 21 Sep 2012 01:01:39 +0000 (11:01 +1000)]
memcg: move mem_cgroup_is_root upwards
kmem code uses this function and it is better to not use forward
declarations for static inline functions as some (older) compilers don't
like it:
gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)
mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 21 Sep 2012 01:01:39 +0000 (11:01 +1000)]
memcg: cleanup kmem tcp ifdefs
TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
the code is not used if !CONFIG_INET so we should rather test for both.
The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
let's keep those outside of any ifdefs because it is considered safer wrt.
future maintainability.
Michael Kerrisk [Fri, 21 Sep 2012 01:01:39 +0000 (11:01 +1000)]
memcg: trivial fixes for Documentation/cgroups/memory.txt
While reading through Documentation/cgroups/memory.txt, I found a number
of minor wordos and typos. The patch below is a conservative handling of
some of these: it provides just a number of "obviously correct" fixes to
the English that improve the readability of the document somewhat.
Obviously some more significant fixes need to be made to the document, but
some of those may not be in the "obvious correct" category.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.
spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.
This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).
Now that lumpy reclaim has been removed, compaction is the only way left
to free up contiguous memory areas. It is time to just enable
CONFIG_COMPACTION by default.
Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c
The update_mmu_cache() takes a pointer (to pte_t by default) as the last
argument but the huge_memory.c passes a pmd_t value. The patch changes
the argument to the pmd_t * pointer.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: compaction: move fatal signal check out of compact_checklock_irqsave
c67fe3752 ("mm: compaction: Abort async compaction if locks are contended
or taking too long") addressed a lock contention problem in compaction by
introducing compact_checklock_irqsave() that effecively aborting async
compaction in the event of compaction.
To preserve existing behaviour it also moved a fatal_signal_pending()
check into compact_checklock_irqsave() but that is very misleading. It
"hides" the check within a locking function but has nothing to do with
locking as such. It just happens to work in a desirable fashion.
This patch moves the fatal_signal_pending() check to
isolate_migratepages_range() where it belongs. Arguably the same check
should also happen when isolating pages for freeing but it's overkill.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Fri, 21 Sep 2012 01:00:26 +0000 (11:00 +1000)]
memory-hotplug: don't replace lowmem pages with highmem
The changelog for 6a6dccba2 ("mm: cma: don't replace lowmem pages with
highmem") mentioned that lowmem pages can be replaced by highmem pages
during CMA migration. 6a6dccba2 fixed that issue.
Quote from that changelog:
: The filesystem layer expects pages in the block device's mapping to not
: be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
: currently replace lowmem pages with highmem pages, leading to crashes in
: filesystem code such as the one below:
:
: Unable to handle kernel NULL pointer dereference at virtual address 00000400
: pgd = c0c98000
: [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
: Internal error: Oops: 817 [#1] PREEMPT SMP ARM
: CPU: 0 Not tainted (3.5.0-rc5+ #80)
: PC is at __memzero+0x24/0x80
: ...
: Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
: Backtrace:
: [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
: [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
: r4:c15337f0
: [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
: [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
: r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
: [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
: r6:beccdcf0 r5:00074000 r4:beccdbbc
: [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Memory-hotplug has same problem as CMA has so the same fix can be applied
to memory-hotplug as well.
Fix it by reusing.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 21 Sep 2012 01:00:25 +0000 (11:00 +1000)]
mm: compaction: check lock contention first before taking lock
isolate_migratepages_range will take zone->lru_lock first and check if the
lock is contented, if yes, it will release the lock. This isn't
efficient. If the lock is truly contented, a lock/unlock pair will
increase the lock contention. We'd better check if the lock is contended
first. compact_trylock_irqsave perfectly meets the requirement.
Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
make compact_zone_order() require non-NULL arg `contended'
Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fusionio.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 21 Sep 2012 01:00:24 +0000 (11:00 +1000)]
mm: compaction: abort compaction loop if lock is contended or run too long
isolate_migratepages_range() might isolate no pages if for example when
zone->lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone->lru_lock is even contended.
An additional check is added to ensure that cc.migratepages and
cc.freepages get properly drained whan compaction is aborted.
[minchan@kernel.org: Putback pages isolated for migration if aborting]
[akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Fri, 21 Sep 2012 01:00:23 +0000 (11:00 +1000)]
mm/memblock: reduce overhead in binary search
When checking that the indicated address belongs to the memory region, the
memory regions are checked one by one through a binary search, which will
be time consuming.
If the indicated address isn't in the memory region, then we needn't do
the time-consuming search. Add a check on the indicated address for that
purpose.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 21 Sep 2012 00:57:50 +0000 (10:57 +1000)]
swap: add a simple detector for inappropriate swapin readahead
The swapin readahead does a blind readahead whether or not the swapin is
sequential. This is ok for harddisk because large reads have relatively
small costs and if the readahead pages are unneeded they can be reclaimed
easily. But for SSD devices large reads are more expensive than small
one. If readahead pages are unneeded, reading them in caused significant
overhead
This patch addes a simple random read detection similar to file mmap
readahead. If a random read is detected, swapin readahead will be
skipped. This improves a lot for a swap workload with random IO in a fast
SSD.
I run anonymous mmap write micro benchmark, which will triger swapin/swapout.
For both harddisk and SSD, the randwrite swap workload run time is reduced
significantly. Sequential write swap workload hasn't chanage.
Interestingly, the randwrite harddisk test is improved too. This might be
because swapin readahead needs to allocate extra memory, which further
tights memory pressure, so more swapout/swapin.
Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
do the "#define foo foo" trick in the conventional manner
Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michal Simek <monstr@monstr.eu> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Shaohua Li <shli@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The x86 implementation of atomic_dec_if_positive is quite generic, so make
it available to all architectures.
This is needed for "swap: add a simple detector for inappropriate swapin
readahead".
Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Fri, 21 Sep 2012 00:57:48 +0000 (10:57 +1000)]
memory-hotplug: fix pages missed by race rather than failing
If race between allocation and isolation in memory-hotplug offline
happens, some pages could be in MIGRATE_MOVABLE of free_list although the
pageblock's migratetype of the page is MIGRATE_ISOLATE.
The race could be detected by get_freepage_migratetype in
__test_page_isolated_in_pageblock. If it is detected, now EBUSY gets
bubbled all the way up and the hotplug operations fails.
But better idea is instead of returning and failing memory-hotremove, move
the free page to the correct list at the time it is detected. It could
enhance memory-hotremove operation success ratio although the race is
really rare.
free_hot_cold_page(Page A)
/* without zone->lock */
migratetype = get_pageblock_migratetype(Page A);
/*
* Page could be moved into MIGRATE_MOVABLE
* of per_cpu_pages
*/
list_add_tail(&page->lru, &pcp->lists[migratetype]);
/* Page A could be in MIGRATE_MOVABLE of free_list. */
check_pages_isolated
__test_page_isolated_in_pageblock
/*
* We can't catch freed page which
* is free_list[MIGRATE_MOVABLE]
*/
if (PageBuddy(page A))
pfn += 1 << page_order(page A);
/* So, Page A could be allocated */
__offline_isolated_pages
/*
* BUG_ON hit or offline page
* which is used by someone
*/
BUG_ON(!PageBuddy(page A));
This patch checks page's migratetype in freelist in
__test_page_isolated_in_pageblock. So now
__test_page_isolated_in_pageblock can check the page caused by above race
and can fail of memory offlining.
Minchan Kim [Fri, 21 Sep 2012 00:57:48 +0000 (10:57 +1000)]
mm: remain migratetype in freed page
The page allocator caches the pageblock information in page->private while
it is in the PCP freelists but this is overwritten with the order of the
page when freed to the buddy allocator. This patch stores the migratetype
of the page in the page->index field so that it is available at all times
when the page remain in free_list.
This patch adds a new call site in __free_pages_ok so it might be overhead
a bit but it's for high order allocation. So I believe damage isn't hurt.
Minchan Kim [Fri, 21 Sep 2012 00:57:47 +0000 (10:57 +1000)]
mm: page_alloc: use get_freepage_migratetype() instead of page_private()
The page allocator uses set_page_private and page_private for handling
migratetype when it frees page. Let's replace them with [set|get]
_freepage_migratetype to make it more clear.
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
(from Minchan Kim).
* During watermark check decrease available free pages number by
free CMA pages number if necessary (unmovable allocations cannot
use pages from CMA areas).
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix
it by properly decreasing/increasing NR_FREE_PAGES counter in
set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
adjustment for isolated pages from free_one_page() and split_free_page().
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page->private gets re-used in __free_one_page() to store page order
(so trace_mm_page_pcpu_drain() may print order instead of migratetype)
thus migratetype value must be cached locally.
Fixes regression introduced in a701623 ("mm: fix migratetype bug which
slowed swapping"). This caused incorrect data to be attached to the
mm_page_pcpu_drain trace event.
Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is possible for pages to be dirty after the check
in reclaim_clean_pages_from_list so that it ends up
paging out the pages, which is never what we want for speed up.
This patch fixes it.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Fri, 21 Sep 2012 00:57:44 +0000 (10:57 +1000)]
mm: cma: discard clean pages during contiguous allocation instead of migration
Drop clean cache pages instead of migration during alloc_contig_range() to
minimise allocation latency by reducing the amount of migration that is
necessary. It's useful for CMA because latency of migration is more
important than evicting the background process's working set. In
addition, as pages are reclaimed then fewer free pages for migration
targets are required so it avoids memory reclaiming to get free pages,
which is a contributory factor to increased latency.
I measured elapsed time of __alloc_contig_migrate_range() which migrates
10M in 40M movable zone in QEMU machine.
Before - 146ms, After - 7ms
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rik van Riel <riel@redhat.com> Tested-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memory-hotplug: build zonelists when offlining pages
online_pages() does build_all_zonelists() and zone_pcp_update(), I think
offline_pages() should do it too.
When the zone has no memory to allocate, remove it from other nodes'
zonelists. zone_batchsize() depends on zone's present pages, if zone's
present pages are changed, zone's pcp should be updated.
During mremap(), the destination VMA is generally placed after the
original vma in rmap traversal order: in move_vma(), we always have
new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
When the destination VMA is placed after the original in rmap traversal
order, we can avoid taking the rmap locks in move_ptes().
Essentially, this reintroduces the optimization that had been disabled in
"mm anon rmap: remove anon_vma_moveto_tail". The difference is that we
don't try to impose the rmap traversal order; instead we just rely on
things being in the desired order in the common case and fall back to
taking locks in the uncommon case. Also we skip the i_mmap_mutex in
addition to the anon_vma lock: in both cases, the vmas are traversed in
increasing vm_pgoff order with ties resolved in tree insertion order.
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a CONFIG_DEBUG_VM_RB build option for the previously existing
DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using
recursive algorithms, we can expose it a bit more.
Also extend this code to validate_mm() after stack expansion, and to check
that the vma's start and last pgoffs have not changed since the nodes were
inserted on the anon vma interval tree (as it is important that the nodes
be reindexed after each such update).
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>