Hugh Dickins [Fri, 6 Nov 2015 02:50:02 +0000 (18:50 -0800)]
mm: page migration avoid touching newpage until no going back
We have had trouble in the past from the way in which page migration's
newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
fix direct reclaim writeback regression") which proposed a cleanup.
We have no actual problem now, but I think the procedure would be clearer
(and alternative get_new_page pools safer to implement) if we assert that
newpage is not touched until we are sure that it's going to be used -
except for taking the trylock on it in __unmap_and_move().
So shift the early initializations from move_to_new_page() into
migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.
Adjust stages 3 to 8 in the Documentation file accordingly.
Hugh Dickins [Fri, 6 Nov 2015 02:49:59 +0000 (18:49 -0800)]
mm: page migration use migration entry for swapcache too
Hitherto page migration has avoided using a migration entry for a
swapcache page mapped into userspace, apparently for historical reasons.
So any page blessed with swapcache would entail a minor fault when it's
next touched, which page migration otherwise tries to avoid. Swapcache in
an mlocked area is rare, so won't often matter, but still better fixed.
Just rearrange the block in try_to_unmap_one(), to handle TTU_MIGRATION
before checking PageAnon, that's all (apart from some reindenting).
Well, no, that's not quite all: doesn't this by the way fix a soft_dirty
bug, that page migration of a file page was forgetting to transfer the
soft_dirty bit? Probably not a serious bug: if I understand correctly,
soft_dirty afficionados usually have to handle file pages separately
anyway; but we publish the bit in /proc/<pid>/pagemap on file mappings as
well as anonymous, so page migration ought not to perturb it.
Hugh Dickins [Fri, 6 Nov 2015 02:49:56 +0000 (18:49 -0800)]
mm: simplify page migration's anon_vma comment and flow
__unmap_and_move() contains a long stale comment on page_get_anon_vma()
and PageSwapCache(), with an odd control flow that's hard to follow.
Mostly this reflects our confusion about the lifetime of an anon_vma, in
the early days of page migration, before we could take a reference to one.
Nowadays this seems quite straightforward: cut it all down to essentials.
I cannot see the relevance of swapcache here at all, so don't treat it any
differently: I believe the old comment reflects in part our anon_vma
confusions, and in part the original v2.6.16 page migration technique,
which used actual swap to migrate anon instead of swap-like migration
entries. Why should a swapcache page not be migrated with the aid of
migration entry ptes like everything else? So lose that comment now, and
enable migration entries for swapcache in the next patch.
Hugh Dickins [Fri, 6 Nov 2015 02:49:53 +0000 (18:49 -0800)]
mm: page migration remove_migration_ptes at lock+unlock level
Clean up page migration a little more by calling remove_migration_ptes()
from the same level, on success or on failure, from __unmap_and_move() or
from unmap_and_move_huge_page().
Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
leave that to when the page is freed. Except for here in page migration,
it has been an invariant that a PageAnon (bit set in page->mapping) page
stays PageAnon until it is freed, and I think we're safer to keep to that.
And with the above rearrangement, it's necessary because zap_pte_range()
wants to identify whether a migration entry represents a file or an anon
page, to update the appropriate rss stats without waiting on it.
Hugh Dickins [Fri, 6 Nov 2015 02:49:49 +0000 (18:49 -0800)]
mm: page migration trylock newpage at same level as oldpage
Clean up page migration a little by moving the trylock of newpage from
move_to_new_page() into __unmap_and_move(), where the old page has been
locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
accordingly.
But make one kind-of-functional change on the way: whereas trylock of
newpage used to BUG() if it failed, now simply return -EAGAIN if so.
Cutting out BUG()s is good, right? But, to be honest, this is really to
extend the usefulness of the custom put_new_page feature, allowing a pool
of new pages to be shared perhaps with racing uses.
Hugh Dickins [Fri, 6 Nov 2015 02:49:46 +0000 (18:49 -0800)]
mm: page migration use the put_new_page whenever necessary
I don't know of any problem from the way it's used in our current tree,
but there is one defect in page migration's custom put_new_page feature.
An unused newpage is expected to be released with the put_new_page(), but
there was one MIGRATEPAGE_SUCCESS (0) path which released it with
putback_lru_page(): which can be very wrong for a custom pool.
Fixed more easily by resetting put_new_page once it won't be needed, than
by adding a further flag to modify the rc test.
Hugh Dickins [Fri, 6 Nov 2015 02:49:40 +0000 (18:49 -0800)]
mm: rename mem_cgroup_migrate to mem_cgroup_replace_page
After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
mem_cgroup_migrate() doesn't have much to offer in page migration: convert
migrate_misplaced_transhuge_page() to set_page_memcg() instead.
Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
remaining callers are replace_page_cache_page() and shmem_replace_page():
both of whom passed lrucare true, so just eliminate that argument.
Hugh Dickins [Fri, 6 Nov 2015 02:49:37 +0000 (18:49 -0800)]
mm: page migration fix PageMlocked on migrated pages
Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
unmaps the page from userspace before migrating, and that commit clears
PageMlocked on the final unmap, leaving mlock_migrate_page() with
nothing to do. Not a serious bug, the next attempt at reclaiming the
page would fix it up; but a betrayal of page migration's intent - the
new page ought to emerge as PageMlocked.
I don't see how to fix it for mlock_migrate_page() itself; but easily
fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
is VM_LOCKED - under pte lock as in try_to_unmap_one().
Delete mlock_migrate_page()? Not quite, it does still serve a purpose for
migrate_misplaced_transhuge_page(): where we could replace it by a test,
clear_page_mlock(), mlock_vma_page() sequence; but would that be an
improvement? mlock_migrate_page() is fairly lean, and let's make it
leaner by skipping the irq save/restore now clearly not needed.
Hugh Dickins [Fri, 6 Nov 2015 02:49:33 +0000 (18:49 -0800)]
mm: rmap use pte lock not mmap_sem to set PageMlocked
KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
found mapped in a VM_LOCKED vma) is ineffective against races with
exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
tearing down an mm.
But that's okay, those races are benign; and although we've believed for
years in that ugly down_read_trylock(), it's unsuitable for the job, and
frustrates the good intention of setting PageMlocked when it fails.
It just doesn't matter if here we read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
syscalls (or exit) work their way up the address space (taking pt locks
after updating vm_flags) to establish the final state.
We do still need to be careful never to mark a page Mlocked (hence
unevictable) by any race that will not be corrected shortly after. The
page lock protects from many of the races, but not all (a page is not
necessarily locked when it's unmapped). But the pte lock we just dropped
is good to cover the rest (and serializes even with
munlock_vma_pages_all(), so no special barriers required): now hold on to
the pte lock while calling mlock_vma_page(). Is that lock ordering safe?
Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
calls the complementary clear_page_mlock().
This fixes the following case (though not a case which anyone has
complained of), which mmap_sem did not: truncation's preliminary
unmap_mapping_range() is supposed to remove even the anonymous COWs of
filecache pages, and that might race with try_to_unmap_one() on a
VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
freed. The pte lock protects against this.
You could say that it also protects against the more ordinary case, racing
with the preliminary unmapping of a filecache page itself: but in our
current tree, that's independently protected by i_mmap_rwsem; and that
race would be why "Bad page state (mlocked)" was seen before commit 48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").
Vlastimil Babka points out another race which this patch protects against.
try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
leaving PageMlocked and unevictable when it should be evictable. mmap_sem
is ineffective because exit_mmap() does not hold it; page lock ineffective
because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
is effective because __munlock_pagevec_fill() takes it to get the page,
after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
Kirill Shutemov points out that if the compiler chooses to implement a
"vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
with an intermediate store of unrelated bits set, since I'm here foregoing
its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does
not appear to be an immediate problem, but we may want to define vm_flags
accessors in future, to guard against such a possibility.
While we're here, make a related optimization in try_to_munmap_one(): if
it's doing TTU_MUNLOCK, then there's no point at all in descending the
page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes,
that can change racily, but it can change racily even without the
optimization: it's not critical. Far better not to waste time here.
Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
on this occasion, but that's probably the sensible next step - with a
rename, given that try_to_munlock()'s business is to try to set Mlocked.
Updated the unevictable-lru Documentation, to remove its reference to mmap
semaphore, but found a few more updates needed in just that area.
Hugh Dickins [Fri, 6 Nov 2015 02:49:30 +0000 (18:49 -0800)]
mm Documentation: undoc non-linear vmas
While updating some mm Documentation, I came across a few straggling
references to the non-linear vmas which were happily removed in v4.0.
Delete them.
If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
shouldn't increment NR_PAGETABLE.
Since small allocations, such as ptlock, normally do not fail (currently
they can fail if kmemcg is used though), this patch does not really fix
anything and should be considered as a code cleanup.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laurent Dufour [Fri, 6 Nov 2015 02:49:21 +0000 (18:49 -0800)]
mm: clear pte in clear_soft_dirty()
As mentioned in the commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa()
for updating _PAGE_NUMA bit"), architectures like ppc64 don't do tlb
flush in set_pte/pmd functions.
So when dealing with existing pte in clear_soft_dirty, the pte must be
cleared before being modified.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_mergeable_page() can only return NULL (also in case of errors) or the
pinned mergeable page. It can't return an error different than NULL.
This optimizes away the unnecessary error check.
Add a return after the "out:" label in the callee to make it more
readable.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksm: use find_mergeable_vma in try_to_merge_with_ksm_page
Doing the VM_MERGEABLE check after the page == kpage check won't provide
any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
the only superfluous bit in using find_mergeable_vma because the !PageAnon
check of try_to_merge_one_page() implicitly checks for that, but it still
looks cleaner to share the same find_mergeable_vma().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksm: don't fail stable tree lookups if walking over stale stable_nodes
The stable_nodes can become stale at any time if the underlying pages gets
freed. The stable_node gets collected and removed from the stable rbtree
if that is detected during the rbtree lookups.
Don't fail the lookup if running into stale stable_nodes, just restart the
lookup after collecting the stale stable_nodes. Otherwise the CPU spent
in the preparation stage is wasted and the lookup must be repeated at the
next loop potentially failing a second time in a second stale stable_node.
If we don't prune aggressively we delay the merging of the unstable node
candidates and at the same time we delay the freeing of the stale
stable_nodes. Keeping stale stable_nodes around wastes memory and it
can't provide any benefit.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before the previous patch ("memcg: unify slab and other kmem pages
charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
pages and pages allocated with alloc_kmem_pages - memcg in the page
struct. Now we can unify it. Since after it, this function becomes tiny
we can fold it into mem_cgroup_from_kmem.
[hughd@google.com: move mem_cgroup_from_kmem into list_lru.c] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.
To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.
Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.
Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.
[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Charging kmem pages proceeds in two steps. First, we try to charge the
allocation size to the memcg the current task belongs to, then we allocate
a page and "commit" the charge storing the pointer to the memcg in the
page struct.
Such a design looks overcomplicated, because there is not much sense in
trying charging the allocation before actually allocating a page: we won't
be able to consume much memory over the limit even if we charge after
doing the actual allocation, besides we already charge user pages post
factum, so being pedantic with kmem pages just looks pointless.
So this patch simplifies the design by merging the "charge" and the
"commit" steps into the same function, which takes the allocated page.
Also, rename the charge and uncharge methods to memcg_kmem_charge and
memcg_kmem_uncharge and make the charge method return error code instead
of bool to conform to mem_cgroup_try_charge.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xishi Qiu [Fri, 6 Nov 2015 02:48:56 +0000 (18:48 -0800)]
mm/page_alloc.c: skip ZONE_MOVABLE if required_kernelcore is larger than totalpages
If kernelcore was not specified, or the kernelcore size is zero
(required_movablecore >= totalpages), or the kernelcore size is larger
than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
both kernel memory and movable memory.
Davidlohr Bueso [Fri, 6 Nov 2015 02:48:52 +0000 (18:48 -0800)]
mm/vmacache: inline vmacache_valid_mm()
This function incurs in very hot paths and merely does a few loads for
validity check. Lets inline it, such that we can save the function call
overhead.
(akpm: this is cosmetic - the compiler already inlines vmacache_valid_mm())
Laura Abbott [Fri, 6 Nov 2015 02:48:46 +0000 (18:48 -0800)]
mm: Don't offset memmap for flatmem
Srinivas Kandagatla reported bad page messages when trying to remove the
bottom 2MB on an ARM based IFC6410 board
BUG: Bad page state in process swapper pfn:fffa8
page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x200041(locked|active|mlocked)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
Hardware name: Qualcomm (Flattened Device Tree)
unwind_backtrace
show_stack
dump_stack
bad_page
free_pages_prepare
free_hot_cold_page
__free_pages
free_highmem_page
mem_init
start_kernel
Disabling lock debugging due to kernel taint
Removing the lower 2MB made the start of the lowmem zone to no longer be
page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
allocates memory for the mem_map. alloc_node_mem_map will offset for
unaligned nodes with the assumption the pfn/page translation functions
will account for the offset. The functions for CONFIG_FLATMEM do not
offset however, resulting in overrunning the memmap array. Just use the
allocated memmap without any offset when running with CONFIG_FLATMEM to
avoid the overrun.
Signed-off-by: Laura Abbott <laura@labbott.name> Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Reported-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Tested-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Cc: Santosh Shilimkar <ssantosh@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Kevin Hilman <khilman@linaro.org> Cc: Arnd Bergman <arnd@arndb.de> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Andy Gross <agross@codeaurora.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 6 Nov 2015 02:48:43 +0000 (18:48 -0800)]
mm/vmstat.c: uninline node_page_state()
With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
(4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
stack. Uninlining node_page_state() reduces this to 440 bytes.
The stack consumption issue is fixed by newer gcc (4.8.4) however with
that compiler this patch reduces the node.o text size from 7314 bytes to
4578.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Fri, 6 Nov 2015 02:48:41 +0000 (18:48 -0800)]
mm/mmap.c: change __install_special_mapping() args order
Make __install_special_mapping() args order match the caller, so the
caller can pass their register args directly to callee with no touch.
For most of architectures, args (at least the first 5th args) are in
registers, so this change will have effect on most of architectures.
For -O2, __install_special_mapping() may be inlined under most of
architectures, but for -Os, it should not. So this change can get a
little better performance for -Os, at least.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geliang Tang [Fri, 6 Nov 2015 02:48:38 +0000 (18:48 -0800)]
mm/nommu.c: drop unlikely inside BUG_ON()
(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
care less and the change is ok.
(2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
predictions as it seems to be pointless[1] and thus callers should not
be trying to push an optimization in the first place.
(3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
unlikely compiler flag already.
Vineet Gupta [Fri, 6 Nov 2015 02:48:29 +0000 (18:48 -0800)]
mm: optimize PageHighMem() check
This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
kmap_atomic() generated code seemed needlessly bloated due to the way
PageHighMem() macro is implemented. It derives the exact zone for page
and then does pointer subtraction with first zone to infer the zone_type.
The pointer arithmatic in turn generates the code bloat.
Oleg Nesterov [Fri, 6 Nov 2015 02:48:26 +0000 (18:48 -0800)]
mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()
Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
wrong. task->mm can be NULL if the task is the exited group leader. This
means in particular that "kill sharing same memory" loop can miss a
process with a zombie leader which uses the same ->mm.
Note: the process_has_mm(child, p->mm) check is still not 100% correct,
p->mm can be NULL too. This is minor, but probably deserves a fix or a
comment anyway.
[akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:23 +0000 (18:48 -0800)]
mm/oom_kill: cleanup the "kill sharing same memory" loop
Purely cosmetic, but the complex "if" condition looks annoying to me.
Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
which adds another if/continue.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:20 +0000 (18:48 -0800)]
mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process()
The fatal_signal_pending() was added to suppress unnecessary "sharing same
memory" message, but it can't 100% help anyway because it can be
false-negative; SIGKILL can be already dequeued.
And worse, it can be false-positive due to exec or coredump. exec is
mostly fine, but coredump is not. It is possible that the group leader
has the pending SIGKILL because its sub-thread originated the coredump, in
this case we must not skip this process.
We could probably add the additional ->group_exit_task check but this
patch just removes the wrong check along with pr_info().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:14 +0000 (18:48 -0800)]
mm: fix the racy mm->locked_vm change in
"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.
Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.
Alexandru Moise [Fri, 6 Nov 2015 02:48:08 +0000 (18:48 -0800)]
mm/vmscan.c: fix types of some locals
In zone_reclaimable_pages(), `nr' is returned by a function which is
declared as returning "unsigned long", so declare it such. Negative
values are meaningless here.
In zone_pagecache_reclaimable() we should also declare `delta' and
`nr_pagecache_reclaimable' as being unsigned longs because they're used to
store the values returned by zone_page_state() and
zone_unmapped_file_pages() which also happen to return unsigned integers.
[akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long] Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Fri, 6 Nov 2015 02:48:05 +0000 (18:48 -0800)]
mm, oom: remove task_lock protecting comm printing
The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.
A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.
The comm will always be NULL-terminated, so the worst race scenario would
only be during update. We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.
Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.
Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:48:02 +0000 (18:48 -0800)]
mm, compaction: distinguish contended status in tracepoints
Compaction returns prematurely with COMPACT_PARTIAL when contended or has
fatal signal pending. This is ok for the callers, but might be misleading
in the traces, as the usual reason to return COMPACT_PARTIAL is that we
think the allocation should succeed. After this patch we distinguish the
premature ending condition in the mm_compaction_finished and
mm_compaction_end tracepoints.
The contended status covers the following reasons:
- lock contention or need_resched() detected in async compaction
- fatal signal pending
- too many pages isolated in the zone (only for async compaction)
Further distinguishing the exact reason seems unnecessary for now.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:47:59 +0000 (18:47 -0800)]
mm, compaction: export tracepoints zone names to userspace
Some compaction tracepoints use zone->name to print which zone is being
compacted. This works for in-kernel printing, but not userspace trace
printing of raw captured trace such as via trace-cmd report.
This patch uses zone_idx() instead of zone->name as the raw value, and
when printing, converts the zone_type to string using the appropriate EM()
macros and some ugly tricks to overcome the problem that half the values
depend on CONFIG_ options and one does not simply use #ifdef inside of
#define.
Vlastimil Babka [Fri, 6 Nov 2015 02:47:56 +0000 (18:47 -0800)]
mm, compaction: export tracepoints status strings to userspace
Some compaction tracepoints convert the integer return values to strings
using the compaction_status_string array. This works for in-kernel
printing, but not userspace trace printing of raw captured trace such as
via trace-cmd report.
This patch converts the private array to appropriate tracepoint macros
that result in proper userspace support.
Tetsuo Handa [Fri, 6 Nov 2015 02:47:54 +0000 (18:47 -0800)]
mm/oom_kill.c: suppress unnecessary "sharing same memory" message
oom_kill_process() sends SIGKILL to other thread groups sharing victim's
mm. But printing
"Kill process %d (%s) sharing same memory\n"
lines makes no sense if they already have pending SIGKILL. This patch
reduces the "Kill process" lines by printing that line with info level
only if SIGKILL is not pending.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 6 Nov 2015 02:47:51 +0000 (18:47 -0800)]
mm/oom_kill.c: fix potentially killing unrelated process
At the for_each_process() loop in oom_kill_process(), we are comparing
address of OOM victim's mm without holding a reference to that mm. If
there are a lot of processes to compare or a lot of "Kill process %d (%s)
sharing same memory" messages to print, for_each_process() loop could take
very long time.
It is possible that meanwhile the OOM victim exits and releases its mm,
and then mm is allocated with the same address and assigned to some
unrelated process. When we hit such race, the unrelated process will be
killed by error. To make sure that the OOM victim's mm does not go away
until for_each_process() loop finishes, get a reference on the OOM
victim's mm before calling task_unlock(victim).
[oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 6 Nov 2015 02:47:44 +0000 (18:47 -0800)]
mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL
It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.
The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it was
free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is
set, the oom-depleter can terminate without touching memory reserves.
Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
step between and allow memory allocations which are not needed for
terminating the OOM victim.
Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A higher value can cause excessive swap IO and waste memory. A lower
value can prevent THPs from being collapsed, resulting fewer pages being
collapsed into THPs, and lower memory access performance.
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jerome Marchand [Fri, 6 Nov 2015 02:47:29 +0000 (18:47 -0800)]
mm/memcontrol.c: fix order calculation in try_charge()
Since commit 6539cc053869 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
the order to pass to mem_cgroup_oom() is calculated by passing the
number of pages to get_order() instead of the expected size in bytes.
AFAICT, it only affects the value displayed in the oom warning message.
This patch fix this.
Michal said:
: We haven't noticed that just because the OOM is enabled only for page
: faults of order-0 (single page) and get_order work just fine. Thanks for
: noticing this. If we ever start triggering OOM on different orders this
: would be broken.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:26 +0000 (18:47 -0800)]
mm: hwpoison: ratelimit messages from unpoison_memory()
Currently kernel prints out results of every single unpoison event, which
i= s not necessary because unpoison is purely a testing feature and
testers can = get little or no information from lots of lines of unpoison
log storm. So this patch ratelimits printk in unpoison_memory().
This patch introduces a file local ratelimit_state, which adds 64 bytes to
memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below,
2= 56 bytes is added, so it's a win.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junichi Nomura [Fri, 6 Nov 2015 02:47:23 +0000 (18:47 -0800)]
mm/filemap.c: make global sync not clear error status of individual inodes
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.
The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).
However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.
As a result, fsync() may not be able to detect writeback error if events
happen in the following order:
Application System admin
----------------------------------------------------------
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)
This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:14 +0000 (18:47 -0800)]
mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
Currently there's no easy way to get per-process usage of hugetlb pages,
which is inconvenient because userspace applications which use hugetlb
typically want to control their processes on the basis of how much memory
(including hugetlb) they use. So this patch simply provides easy access
to the info via /proc/PID/status.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 6 Nov 2015 02:47:11 +0000 (18:47 -0800)]
mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps
Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB),
which is inconvenient when we want to know per-task or per-vma base
hugetlb usage. To solve this, this patch adds new fields for hugetlb
usage like below:
Roman Gushchin [Fri, 6 Nov 2015 02:47:08 +0000 (18:47 -0800)]
mm: use only per-device readahead limit
Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)
There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.
Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).
If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.
Also, using right readahead size for RAID disks can significantly
increase i/o performance:
before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
(It's an 8-disks RAID5 storage).
This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by 6d2be915e589b58 ("mm/readahead.c: fix readahead
failure for memoryless NUMA nodes and limit readahead pages").
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yaowei Bai [Fri, 6 Nov 2015 02:47:06 +0000 (18:47 -0800)]
mm/page_alloc: remove unused parameter in init_currently_empty_zone()
Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime by introducing an 'enum memmap_context'
parameter to memmap_init_zone() and init_currently_empty_zone(). This
parameter is intended to be used to tell whether the call of these two
functions is being made on behalf of a hotplug event, or happening at
boot-time. However, init_currently_empty_zone() does not use this
parameter at all, so remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 6 Nov 2015 02:47:03 +0000 (18:47 -0800)]
mm, migrate: count pages failing all retries in vmstat and tracepoint
Migration tries up to 10 times to migrate pages that return -EAGAIN until
it gives up. If some pages fail all retries, they are counted towards the
number of failed pages that migrate_pages() returns. They should also be
counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages
tracepoint.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Raghavendra K T [Fri, 6 Nov 2015 02:46:29 +0000 (18:46 -0800)]
arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes
With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.
There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained the same for
powerpc.
But, since sparse numa node ids possible on powerpc, we unnecessarily
allocate memory for non existent numa nodes.
For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
we allocate memory for nodes 2-14. This patch we allocate memory for
only existing numa nodes.
The patch is boot tested on a 4 node tuleta, confirming with printks
that it works as expected.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Raghavendra K T [Fri, 6 Nov 2015 02:46:26 +0000 (18:46 -0800)]
mm/list_lru.c: replace nr_node_ids for loop with for_each_node()
The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.
Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is
common), this patch saves some unnecessary allocations for non existing
numa nodes.
Even without that saving, perhaps patch makes code more readable.
[vdavydov@parallels.com: take memcg_aware check outside for_each loop] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonathan Corbet [Fri, 6 Nov 2015 02:46:23 +0000 (18:46 -0800)]
mm: fix docbook comment for get_vaddr_frames()
get_vaddr_frames() has a comment that's *almost* a docbook comment; add
the missing star so that the tools will find it properly.
Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:20 +0000 (18:46 -0800)]
memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
__memcg_kmem_bypass() decides whether a kmem allocation should be bypassed
to the root memcg. Some conditions that it tests are valid criteria
regarding who should be held accountable; however, there are a couple
unnecessary tests for cold paths - __GFP_FAIL and fatal_signal_pending().
The previous patch updated try_charge() to handle both __GFP_FAIL and
dying tasks correctly and the only thing these two tests are doing is
making accounting less accurate and sprinkling tests for cold path
conditions in the hot paths. There's nothing meaningful gained by these
extra tests.
This patch removes the two unnecessary tests from __memcg_kmem_bypass().
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:17 +0000 (18:46 -0800)]
memcg: ratify and consolidate over-charge handling
try_charge() is the main charging logic of memcg. When it hits the limit
but either can't fail the allocation due to __GFP_NOFAIL or the task is
likely to free memory very soon, being OOM killed, has SIGKILL pending or
exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
While this is one approach which can be taken for these situations, it has
several issues.
* It unnecessarily lies about the reality. The number itself doesn't
go over the limit but the actual usage does. memcg is either forced
to or actively chooses to go over the limit because that is the
right behavior under the circumstances, which is completely fine,
but, if at all avoidable, it shouldn't be misrepresenting what's
happening by sneaking the charges into the root memcg.
* Despite trying, we already do over-charge. kmemcg can't deal with
switching over to the root memcg by the point try_charge() returns
-EINTR, so it open-codes over-charing.
* It complicates the callers. Each try_charge() user has to handle
the weird -EINTR exception. memcg_charge_kmem() does the manual
over-charging. mem_cgroup_do_precharge() performs unnecessary
uncharging of root memcg, which BTW is inconsistent with what
memcg_charge_kmem() does but not broken as [un]charging are noops on
root memcg. mem_cgroup_try_charge() needs to switch the returned
cgroup to the root one.
The reality is that in memcg there are cases where we are forced and/or
willing to go over the limit. Each such case needs to be scrutinized and
justified but there definitely are situations where that is the right
thing to do. We alredy do this but with a superficial and inconsistent
disguise which leads to unnecessary complications.
This patch updates try_charge() so that it over-charges and returns 0 when
deemed necessary. -EINTR return is removed along with all special case
handling in the callers.
While at it, remove the local variable @ret, which was initialized to zero
and never changed, along with done: label which just returned the always
zero @ret.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:14 +0000 (18:46 -0800)]
memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
same series of conditions to decide whether to bypass kmem accounting.
Collect the tests into __memcg_kmem_bypass().
This is pure refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:11 +0000 (18:46 -0800)]
memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped. If a process performs only
speculative allocations, it can blow way past the high limit. This is
actually easily reproducible by simply doing "find /". slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.
This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path. If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.
As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.
- All over-high reclaims can use GFP_KERNEL regardless of the specific
gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
- It copes with prio inversion. Previously, a low-prio task with
small memory.high might perform over-high reclaim with a bunch of
locks held. If a higher prio task needed any of these locks, it
would have to wait until the low prio task finished reclaim and
released the locks. By handing over-high reclaim to the task exit
path this issue can be avoided.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Fri, 6 Nov 2015 02:46:09 +0000 (18:46 -0800)]
memcg: flatten task_struct->memcg_oom
task_struct->memcg_oom is a sub-struct containing fields which are used
for async memcg oom handling. Most task_struct fields aren't packaged
this way and it can lead to unnecessary alignment paddings. This patch
flattens it.
Andrew Morton [Fri, 6 Nov 2015 02:46:03 +0000 (18:46 -0800)]
uaccess: reimplement probe_kernel_address() using probe_kernel_read()
probe_kernel_address() is basically the same as the (later added)
probe_kernel_read().
The return value on EFAULT is a bit different: probe_kernel_address()
returns number-of-bytes-not-copied whereas probe_kernel_read() returns
-EFAULT. All callers have been checked, none cared.
probe_kernel_read() can be overridden by the architecture whereas
probe_kernel_address() cannot. parisc, blackfin and um do this, to insert
additional checking. Hence this patch possibly fixes obscure bugs,
although there are only two probe_kernel_address() callsites outside
arch/.
My first attempt involved removing probe_kernel_address() entirely and
converting all callsites to use probe_kernel_read() directly, but that got
tiresome.
This patch shrinks mm/slab_common.o by 218 bytes. For a single
probe_kernel_address() callsite.
Cc: Steven Miao <realmz6@gmail.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catalin Marinas [Fri, 6 Nov 2015 02:45:54 +0000 (18:45 -0800)]
mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE
On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
configurations defining ARCH_DMA_MINALIGN to 128), the first
kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
"kmalloc-128" with index 7. Depending on the debug kernel configuration,
sizeof(struct kmem_cache) can be larger than 128 resulting in an
INDEX_NODE of 8.
Commit 8fc9cf420b36 ("slab: make more slab management structure off the
slab") enables off-slab management objects for sizes starting with
PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
of the "kmalloc-128" cache would try to place the management objects
off-slab. However, since KMALLOC_MIN_SIZE is already 128 and
freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
returns NULL (kmalloc_caches[7] not populated yet). This triggers the
following bug on arm64:
kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
Hardware name: Juno (DT)
PC is at __kmem_cache_create+0x21c/0x280
LR is at __kmem_cache_create+0x210/0x280
[...]
Call trace:
__kmem_cache_create+0x21c/0x280
create_boot_cache+0x48/0x80
create_kmalloc_cache+0x50/0x88
create_kmalloc_caches+0x4c/0xf4
kmem_cache_init+0x100/0x118
start_kernel+0x214/0x33c
This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.
Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:51 +0000 (18:45 -0800)]
mm/slub: calculate start order with reserved in consideration
In slub_order(), the order starts from max(min_order,
get_order(min_objects * size)). When (min_objects * size) has different
order from (min_objects * size + reserved), it will skip this order via a
check in the loop.
This patch optimizes this a little by calculating the start order with
`reserved' in consideration and removing the check in loop.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:48 +0000 (18:45 -0800)]
mm/slub: use get_order() instead of fls()
get_order() is more easy to understand.
This patch just replaces it.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Fri, 6 Nov 2015 02:45:46 +0000 (18:45 -0800)]
mm/slub: correct the comment in calculate_order()
In calculate_order(), it tries to calculate the best order by adjusting
the fraction and min_objects. On each iteration on min_objects, fraction
iterates on 16, 8, 4. Which means the acceptable waste increases with
1/16, 1/8, 1/4.
This patch corrects the comment according to the code.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandru Moise [Fri, 6 Nov 2015 02:45:43 +0000 (18:45 -0800)]
mm/slab_common.c: initialize kmem_cache pointer to NULL
The assignment to NULL within the error condition was written in a 2014
patch to suppress a compiler warning. However it would be cleaner to just
initialize the kmem_cache to NULL and just return it in case of an error
condition.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per Cache Average Min Max Total
----------------------------------------------------------------------------
#Objects 5147 1 89068 324301
#Slabs 199 1 3886 12537
#PartSlab 12 0 240 778
%PartSlab 32% 0% 100% 6%
PartObjs 5 0 4569 18151
% PartObj 26% 0% 100% 5%
Memory 3171409 8192 127336448199798784
Used 3001736 160 121429728189109408
Loss 169672 0 590672010689376
Per Object Average Min Max
-----------------------------------------------------------
Memory 585 8 8192
User 583 8 8192
Loss 2 0 64
Slabs sorted by size
--------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
dentry 89068 288 26058752 3164/0/17 28 1 0 98 a
Slabs sorted by loss
--------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
inode_cache 11628 864 537472 642/0/4 18 2 0 94 a
Besides, store_size() does not use powers of two for G/M/K
Per Cache Average Min Max Total
---------------------------------------------------------
#Objects 14.1K 1 227.8K 920.1K
#Slabs 533 1 11.7K 34.7K
#PartSlab 86 0 4.3K 5.6K
%PartSlab 24% 0% 100% 16%
PartObjs 17 0 129.3K 161.2K
% PartObj 17% 0% 100% 17%
Memory 8.7M 8.1K 384.7M 568.3M
Used 8.2M 160 366.5M 537.9M
Loss 468.8K 0 18.2M 30.4M
Per Object Average Min Max
---------------------------------------------
Memory 587 8 8.1K
User 584 8 8.1K
Loss 2 0 64
Slabs sorted by size
----------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 211142 1736 384.7M 11732/40/10 18 3 0 95 a
Slabs sorted by loss
----------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 211142 1736 18.2M 11732/40/10 18 3 0 95 a
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per Cache Average Min Max Total
----------------------------------------------------------------------------
#Objects 5147 1 89068 324301
#Slabs 199 1 3886 12537
#PartSlab 12 0 240 778
%PartSlab 32% 0% 100% 6%
PartObjs 5 0 4569 18151
% PartObj 26% 0% 100% 5%
Memory 3171409 8192 127336448199798784
Used 3001736 160 121429728189109408
Loss 169672 0 590672010689376
Per Object Average Min Max
-----------------------------------------------------------
Memory 585 8 8192
User 583 8 8192
Loss 2 0 64
Slabs sorted by size
--------------------
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
dentry 89068 288 26058752 3164/0/17 28 1 0 98 a
Slabs sorted by loss
--------------------
Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
inode_cache 11628 864 537472 642/0/4 18 2 0 94 a
The last patch in the series addresses Linus' comment from
http://marc.info/?l=linux-mm&m=144148518703321&w=2
(well, it's been some time. sorry.)
gnuplot script takes the slabinfo records file, where every record is a `slabinfo -X'
output. So the basic workflow is, for example, as follows:
while [ 1 ]; do slabinfo -X -N 2 >> stats; sleep 1; done
^C
slabinfo-gnuplot.sh stats
The last command will produce 3 png files (and 3 stats files)
-- graph of slabinfo totals
-- graph of slabs by size
-- graph of slabs by loss
It's also possible to select a range of records for plotting (a range of collected
slabinfo outputs) via `-r 10,100` (for example); and compare totals from several
measurements (to visially compare slabs behaviour (10,50 range)) using
pre-parsed totals files:
slabinfo-gnuplot.sh -r 10,50 -t stats-totals1 .. stats-totals2
This also, technically, supports ktest. Upload new slabinfo to target,
collect the stats and give the resulting stats file to slabinfo-gnuplot
This patch (of 8):
Use getopt constants in `struct option' ->has_arg instead of numerical
representations.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab_common.c: do not warn that cache is busy on destroy more than once
Currently, when kmem_cache_destroy() is called for a global cache, we
print a warning for each per memcg cache attached to it that has active
objects (see shutdown_cache). This is redundant, because it gives no new
information and only clutters the log. If a cache being destroyed has
active objects, there must be a memory leak in the module that created the
cache, and it does not matter if the cache was used by users in memory
cgroups or not.
This patch moves the warning from shutdown_cache(), which is called for
shutting down both global and per memcg caches, to kmem_cache_destroy(),
so that the warning is only printed once if there are objects left in the
cache being destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab_common.c: clear pointers to per memcg caches on destroy
Currently, we do not clear pointers to per memcg caches in the
memcg_params.memcg_caches array when a global cache is destroyed with
kmem_cache_destroy.
This is fine if the global cache does get destroyed. However, a cache can
be left on the list if it still has active objects when kmem_cache_destroy
is called (due to a memory leak). If this happens, the entries in the
array will point to already freed areas, which is likely to result in data
corruption when the cache is reused (via slab merging).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_kmem_cache_create(), do_kmem_cache_shutdown(), and
do_kmem_cache_release() sound awkward for static helper functions that are
not supposed to be used outside slab_common.c. Rename them to
create_cache(), shutdown_cache(), and release_caches(), respectively.
This patch is a pure cleanup and does not introduce any functional
changes.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse apparently pretends to be gcc >= 4.9, yet isn't prepared to handle
all the function attributes supported by those gccs and complains loudly.
So hide the definition of __assume_aligned from it (so that the generic
one in compiler.h gets used).
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Christopher Li <sparse@chrisli.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>