Dan Streetman [Wed, 14 May 2014 00:02:12 +0000 (10:02 +1000)]
swap: change swap_info singly-linked list to list_head
The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:
-it stores the entries in priority order
-there is a pointer to the highest priority entry
-there is a pointer to the highest priority not-full entry
-there is a highest_priority_index variable set outside the swap_lock
-swap entries of equal priority should be used equally
this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.
That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.
The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce unnecessary
list iteration (i.e. Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.
The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions. These
are used by the next two patches.
The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.
The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically. The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head. A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.
This patch (of 4):
Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads. Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.
The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:02:12 +0000 (10:02 +1000)]
mm: fold mlocked_vma_newpage() into its only call site
In previous commit(mm: use the light version __mod_zone_page_state in
mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
suggested by Andrew, to reduce the risks that new call sites incorrectly
using mlocked_vma_newpage() without knowing they are adding racing, this
patch folds mlocked_vma_newpage() into its only call site,
page_add_new_anon_rmap, to make it open-cocded for people to know what is
going on.
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:02:11 +0000 (10:02 +1000)]
mm: use the light version __mod_zone_page_state in mlocked_vma_newpage()
mlocked_vma_newpage() is called with pte lock held(a spinlock), which
implies preemtion disabled, and the vm stat counter is not modified from
interrupt context, so we need not use an irq-safe mod_zone_page_state()
here, using a light-weight version __mod_zone_page_state() would be OK.
This patch also documents __mod_zone_page_state() and some of its
callsites. The comment above __mod_zone_page_state() is from Hugh
Dickins, and acked by Christoph.
Most credits to Hugh and Christoph for the clarification on the usage of
the __mod_zone_page_state().
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of huge
file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no legitimate
use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in strace,
gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which demonstrate
pretty much worst case scenario: map 4G shmfs file, write to begin of
every page pgoff of the page, remap pages in reverse order, read every
page.
The test creates 1 million of VMAs if emulation is in use, so I had to set
vm.max_map_count to 1100000 to avoid -ENOMEM.
The remap_file_pages() system call is used to create a nonlinear mapping,
that is, a mapping in which the pages of the file are mapped into a
nonsequential order in memory. The advantage of using remap_file_pages()
over using repeated calls to mmap(2) is that the former approach does not
require the kernel to create additional VMA (Virtual Memory Area) data
structures.
Supporting of nonlinear mapping requires significant amount of non-trivial
code in kernel virtual memory subsystem including hot paths. Also to get
nonlinear mapping work kernel need a way to distinguish normal page table
entries from entries with file offset (pte_file). Kernel reserves flag in
PTE for this purpose. PTE flags are scarce resource especially on some
CPU architectures. It would be nice to free up the flag for other usage.
Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the syscall
on 32-bit systems to map files bigger than can linearly fit into 32-bit
virtual address space. This use-case is not critical anymore since 64-bit
systems are widely available.
The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.
One side effect of emulation (apart from performance) is that user can hit
vm.max_map_count limit more easily due to additional VMAs. See comment
for DEFAULT_MAX_MAP_COUNT for more details on the limit.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Wed, 14 May 2014 00:02:10 +0000 (10:02 +1000)]
mm/compaction: avoid rescanning pageblocks in isolate_freepages
The compaction free scanner in isolate_freepages() currently remembers PFN
of the highest pageblock where it successfully isolates, to be used as the
starting pageblock for the next invocation. The rationale behind this is
that page migration might return free pages to the allocator when
migration fails and we don't want to skip them if the compaction
continues.
Since migration now returns free pages back to compaction code where they
can be reused, this is no longer a concern. This patch changes
isolate_freepages() so that the PFN for restarting is updated with each
pageblock where isolation is attempted. Using stress-highalloc from
mmtests, this resulted in 10% reduction of the pages scanned by the free
scanner.
Note that the somewhat similar functionality that records highest
successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
This cache is used when the whole compaction is restarted, not for
multiple invocations of the free scanner during single compaction.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Wed, 14 May 2014 00:02:10 +0000 (10:02 +1000)]
mm/compaction: do not count migratepages when unnecessary
During compaction, update_nr_listpages() has been used to count remaining
non-migrated and free pages after a call to migrage_pages(). The
freepages counting has become unneccessary, and it turns out that
migratepages counting is also unnecessary in most cases.
The only situation when it's needed to count cc->migratepages is when
migrate_pages() returns with a negative error code. Otherwise, the
non-negative return value is the number of pages that were not migrated,
which is exactly the count of remaining pages in the cc->migratepages
list.
Furthermore, any non-zero count is only interesting for the tracepoint of
mm_compaction_migratepages events, because after that all remaining
unmigrated pages are put back and their count is set to 0.
This patch therefore removes update_nr_listpages() completely, and changes
the tracepoint definition so that the manual counting is done only when
the tracepoint is enabled, and only when migrate_pages() returns a
negative error code.
Furthermore, migrate_pages() and the tracepoints won't be called when
there's nothing to migrate. This potentially avoids some wasted cycles
and reduces the volume of uninteresting mm_compaction_migratepages events
where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
was about 75% of the events. The mm_compaction_isolate_migratepages event
is better for determining that nothing was isolated for migration, and
this one was just duplicating the info.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Wed, 14 May 2014 00:02:09 +0000 (10:02 +1000)]
mm, compaction: terminate async compaction when rescheduling
Async compaction terminates prematurely when need_resched(), see
compact_checklock_irqsave(). This can never trigger, however, if the
cond_resched() in isolate_migratepages_range() always takes care of the
scheduling.
If the cond_resched() actually triggers, then terminate this pageblock
scan for async compaction as well.
David Rientjes [Wed, 14 May 2014 00:02:09 +0000 (10:02 +1000)]
mm, thp: avoid excessive compaction latency during fault fix
mm-thp-avoid-excessive-compaction-latency-during-fault.patch excludes sync
compaction for all high order allocations other than thp. What we really
want to do is suppress sync compaction for thp, but only during the page
fault path.
Orders greater than PAGE_ALLOC_COSTLY_ORDER aren't necessarily going to
loop again so this is the only way to exhaust our capabilities before
declaring that we can't allocate.
David Rientjes [Wed, 14 May 2014 00:02:09 +0000 (10:02 +1000)]
mm, thp: avoid excessive compaction latency during fault
Synchronous memory compaction can be very expensive: it can iterate an
enormous amount of memory without aborting, constantly rescheduling,
waiting on page locks and lru_lock, etc, if a pageblock cannot be
defragmented.
Unfortunately, it's too expensive for transparent hugepage page faults and
it's much better to simply fallback to pages. On 128GB machines, we find
that synchronous memory compaction can take O(seconds) for a single thp
fault.
Now that async compaction remembers where it left off without strictly
relying on sync compaction, this makes thp allocations best-effort without
causing egregious latency during fault. We still need to retry async
compaction after reclaim, but this won't stall for seconds.
David Rientjes [Wed, 14 May 2014 00:02:08 +0000 (10:02 +1000)]
mm, compaction: embed migration mode in compact_control
We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.
Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool. Convert the bool to enum
migrate_mode and pass the migration mode in directly. Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.
This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.
David Rientjes [Wed, 14 May 2014 00:02:08 +0000 (10:02 +1000)]
mm, compaction: add per-zone migration pfn cache for async compaction
Each zone has a cached migration scanner pfn for memory compaction so that
subsequent calls to memory compaction can start where the previous call
left off.
Currently, the compaction migration scanner only updates the per-zone
cached pfn when pageblocks were not skipped for async compaction. This
creates a dependency on calling sync compaction to avoid having subsequent
calls to async compaction from scanning an enormous amount of non-MOVABLE
pageblocks each time it is called. On large machines, this could be
potentially very expensive.
This patch adds a per-zone cached migration scanner pfn only for async
compaction. It is updated everytime a pageblock has been scanned in its
entirety and when no pages from it were successfully isolated. The cached
migration scanner pfn for sync compaction is updated only when called for
sync compaction.
David Rientjes [Wed, 14 May 2014 00:02:08 +0000 (10:02 +1000)]
mm, compaction: return failed migration target pages back to freelist
Greg reported that he found isolated free pages were returned back to the
VM rather than the compaction freelist. This will cause holes behind the
free scanner and cause it to reallocate additional memory if necessary
later.
He detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
constantly visited by compaction calls of migrate_pages(). These pages
had a non-zero b_count which caused fallback_migrate_page() ->
try_to_release_page() -> try_to_free_buffers() to fail.
Memory compaction works by having a "freeing scanner" scan from one end of
a zone which isolates pages as migration targets while another "migrating
scanner" scans from the other end of the same zone which isolates pages
for migration.
When page migration fails for an isolated page, the target page is
returned to the system rather than the freelist built by the freeing
scanner. This may require the freeing scanner to continue scanning memory
after suitable migration targets have already been returned to the system
needlessly.
This patch returns destination pages to the freeing scanner freelist when
page migration fails. This prevents unnecessary work done by the freeing
scanner but also encourages memory to be as compacted as possible at the
end of the zone.
David Rientjes [Wed, 14 May 2014 00:02:08 +0000 (10:02 +1000)]
mm, migration: add destination page freeing callback
Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.
This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.
If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.
Vladimir Davydov [Wed, 14 May 2014 00:02:07 +0000 (10:02 +1000)]
memcg: memcg_kmem_create_cache: make memcg_name_buf statically allocated
It isn't worth complicating the code by allocating it on the first access,
because it only takes 256 bytes.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Wed, 14 May 2014 00:02:07 +0000 (10:02 +1000)]
memcg: get rid of memcg_create_cache_name
Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
order to just create the name of a per memcg cache, let's allocate it in
place. We only need to pass the memcg name to kmem_cache_create_memcg for
that - everything else can be done in slab_common.c.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qiang Huang [Wed, 14 May 2014 00:02:06 +0000 (10:02 +1000)]
memcg: fold mem_cgroup_stolen
It is only used in __mem_cgroup_begin_update_page_stat(), the name is
confusing and 2 routines for one thing also confuse people, so fold this
function seems more clear.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Emil Medve [Wed, 14 May 2014 00:02:06 +0000 (10:02 +1000)]
arch/x86/mm/numa.c: use for_each_memblock()
Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:02:05 +0000 (10:02 +1000)]
mm: memcontrol: clean up memcg zoneinfo lookup
Memcg zoneinfo lookup sites have either the page, the zone, or the node id
and zone index, but sites that only have the zone have to look up the node
id and zone index themselves, whereas sites that already have those two
integers use a function for a simple pointer chase.
Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let sites
that already have node id and zone index - all for each node, for each
zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid].
Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 14 May 2014 00:02:04 +0000 (10:02 +1000)]
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
Kmemleak could ignore memory blocks allocated via memblock_alloc() leading
to false positives during scanning. This patch adds the corresponding
callbacks and removes kmemleak_free_* calls in mm/nobootmem.c to avoid
duplication. The kmemleak_alloc() in mm/nobootmem.c is kept since
__alloc_memory_core_early() does not use memblock_alloc() directly.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 14 May 2014 00:02:04 +0000 (10:02 +1000)]
mm/mempool.c: update the kmemleak stack trace for mempool allocations
When mempool_alloc() returns an existing pool object, kmemleak_alloc() is
no longer called and the stack trace corresponds to the original object
allocation. This patch updates the kmemleak allocation stack trace for
such objects to make it more useful for debugging.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 14 May 2014 00:02:04 +0000 (10:02 +1000)]
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
Since radix_tree_preload() stack trace is not always useful for debugging
an actual radix tree memory leak, this patch updates the kmemleak
allocation stack trace in the radix_tree_node_alloc() function.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 14 May 2014 00:02:04 +0000 (10:02 +1000)]
mm: introduce kmemleak_update_trace()
The memory allocation stack trace is not always useful for debugging a
memory leak (e.g. radix_tree_preload). This function, when called,
updates the stack trace for an already allocated object.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cyrill Gorcunov [Wed, 14 May 2014 00:02:03 +0000 (10:02 +1000)]
mm: pgtable -- Require X86_64 for soft-dirty tracker, v2
v2 (by akpm@):
- guard helpers with CONFIG_HAVE_ARCH_SOFT_DIRTY on i386, otherwise
it fails to build because we've a generic definitions in
asm-generic/pgtable.h
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cyrill Gorcunov [Wed, 14 May 2014 00:02:03 +0000 (10:02 +1000)]
mm: x86 pgtable: require X86_64 for soft-dirty tracker
Tracking dirty status on 2 level pages requires very ugly macros and
taking into account how old the machines who can operate without PAE mode
only are, lets drop soft dirty tracker from them for code simplicity (note
I can't drop all the macros from 2 level pages by now since
_PAGE_BIT_PROTNONE and _PAGE_BIT_FILE are still used even without
tracker).
Linus proposed to completely rip off softdirty support on x86-32 (even
with PAE) and since for CRIU we're not planning to support native x86-32
mode, lets do that.
(Softdirty tracker is relatively new feature which is mostly used by CRIU
so I don't expect if such API change would cause problems for userspace).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cleanups:
- move pte-related code to separate function. It's about half of the
function;
- get rid of some goto-logic;
- use 'return NULL' instead of 'return page' where page can only be
NULL;
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 14 May 2014 00:02:01 +0000 (10:02 +1000)]
vmscan: memcg: always use swappiness of the reclaimed memcg
Memory reclaim always uses swappiness of the reclaim target memcg (origin
of the memory pressure) or vm_swappiness for global memory reclaim. This
behavior was consistent (except for difference between global and hard
limit reclaim) because swappiness was enforced to be consistent within
each memcg hierarchy.
After "mm: memcontrol: remove hierarchy restrictions for swappiness and
oom_control" each memcg can have its own swappiness independent of
hierarchical parents, though, so the consistency guarantee is gone. This
can lead to an unexpected behavior. Say that a group is explicitly
configured to not swapout by memory.swappiness=0 but its memory gets
swapped out anyway when the memory pressure comes from its parent with a
It is also unexpected that the knob is meaningless without setting the
hard limit which would trigger the reclaim and enforce the swappiness.
There are setups where the hard limit is configured higher in the
hierarchy by an administrator and children groups are under control of
somebody else who is interested in the swapout behavior but not
necessarily about the memory limit.
From a semantic point of view swappiness is an attribute defining anon vs.
file proportional scanning of LRU which is memcg specific (unlike charges
which are propagated up the hierarchy) so it should be applied to the
particular memcg's LRU regardless where the memory pressure comes from.
This patch removes vmscan_swappiness() and stores the swappiness into the
scan_control structure. mem_cgroup_swappiness is then used to provide the
correct value before shrink_lruvec is called. The global vm_swappiness is
used for the root memcg.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Hansen [Wed, 14 May 2014 00:02:01 +0000 (10:02 +1000)]
mm: shrinker: add nid to tracepoint output
Now that we are doing NUMA-aware shrinking, and can have
shrinkers running in parallel, or working on individual nodes, it
seems like we should also be sticking the node in the output.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
and noticed that "total_scan" can go negative in some cases. We
used to dump out the "total_scan" variable directly, but some of
the shrinker modifications along the way changed that.
This patch just dumps it out directly, again. It doesn't make
any sense to derive it from new_nr and nr any more since there
are now other shrinkers that can be running in parallel and
mucking with those values.
Here's an example of the negative numbers in the output:
> kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
> kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
> kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
> kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
> kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
> kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
> kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
> kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
> kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
> kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
> kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 14 May 2014 00:01:59 +0000 (10:01 +1000)]
memcg: allow setting low_limit
Export memory.low_limit_in_bytes knob with the same rules as the hard
limit represented by limit_in_bytes knob (e.g. no limit to be set for the
root cgroup). There is no memsw alternative for low_limit_in_bytes
because the primary motivation behind this limit is to protect the working
set of the group and so considering swap doesn't make much sense. There
is also no kmem variant exported because we do not have any easy way to
protect kernel allocations now.
Please note that the low limit might exceed the hard limit which basically
means that the group is not reclaimable if there is other reclaim target
in the hierarchy under pressure.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 14 May 2014 00:01:59 +0000 (10:01 +1000)]
memcg, mm: introduce lowlimit reclaim
Previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach for protecting memory allocated to processes executing within a
memory cgroup controller. It is based on a new tunable that was discussed
with Johannes and Tejun held during the kernel summit 2013 and at LSF
2014.
This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not
considered eligible for the reclaim (both global and hardlimit) unless all
groups under the reclaimed hierarchy are below the low limit when all of
them are considered eligible.
The previous version of the patchset posted as a RFC
(http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a hard
guarantee without any fallback. More discussions led me to reconsidering
the default behavior and come up a more relaxed one. The hard requirement
can be added later based on a use case which really requires. It would be
controlled by memory.reclaim_flags knob which would specify whether to OOM
or fallback (default) when all groups are bellow low limit.
The default value of the limit is 0 so all groups are eligible by default
and an interested party has to explicitly set the limit.
The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the use of
memory overcommit as the application must explicitly manage memory
residency.
With the low limit, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.
The hierarchical behavior of the lowlimit is described in the first patch.
The second patch allows setting the lowlimit. The last 2 patches clarify
documentation about the memcg reclaim in gereneral (3rd patch) and low
limit (4th patch).
This patch (of 4)
This patch introduces low limit reclaim. The low_limit acts as a reclaim
protection because groups which are under their low_limit are considered
ineligible for reclaim. While hardlimit protects from using more memory
than allowed lowlimit protects from getting below memory assigned to the
group due to external memory pressure.
More precisely a group is considered eligible for the reclaim under a
specific hierarchy represented by its root only if the group is above its
low limit and the same applies to all parents up the hierarchy to the
root. Nevertheless the limit still might be ignored if all groups under
the reclaimed hierarchy are under their low limits. This will prevent
from OOM rather than protecting the memory.
Consider the following hierarchy with memory pressure coming from the
group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes,
h-limit_in_bytes):
root_mem_cgroup
.
_____/
/
A (l = 80 u=90 h=90)
/
/ \_________
/ \
B (l=0 u=50) C (l=50 u=40)
\
D (l=0 u=30)
A and B are reclaimable but C and D are not (D is protected by C).
The low_limit is 0 by default so every group is eligible. This patch
doesn't provide a way to set the limit yet although the core
infrastructure is there already.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daeseok Youn [Wed, 14 May 2014 00:01:59 +0000 (10:01 +1000)]
mm/dmapool.c: remove redundant NULL check for dev in dma_pool_create()
"dev" cannot be NULL because it is already checked before calling
dma_pool_create().
If dev ever was NULL, the code would oops in dev_to_node() after enabling
CONFIG_NUMA.
It is possible that some driver is using dev==NULL and has never been run
on a NUMA machine. Such a driver is probably outdated, possibly buggy and
will need some attention if it starts triggering NULL derefs.
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:01:58 +0000 (10:01 +1000)]
mm: introdule compound_head_by_tail()
Currently, in put_compound_page(), we have
======
if (likely(!PageTail(page))) { <------ (1)
if (put_page_testzero(page)) {
/*
¦* By the time all refcounts have been released
¦* split_huge_page cannot run anymore from under us.
¦*/
if (PageHead(page))
__put_compound_page(page);
else
__put_single_page(page);
}
return;
}
/* __split_huge_page_refcount can run under us */
page_head = compound_head(page); <------------ (2)
======
if at (1) , we fail the check, this means page is *likely* a tail page.
Then at (2), as compoud_head(page) is inlined, it is :
smp_rmb();
if (likely(PageTail(page)))
return head;
}
return page;
}
======
here, the (3) unlikely in the case is a negative hint, because it is
*likely* a tail page. So the check (3) in this case is not good, so I
introduce a helper for this case.
So this patch introduces compound_head_by_tail() which deals with a
possible tail page(though it could be spilt by a racy thread), and make
compound_head() a wrapper on it.
This patch has no functional change, and it reduces the object
size slightly:
text data bss dec hex filename
11003 1328 16 12347 303b mm/swap.o.orig
10971 1328 16 12315 301b mm/swap.o.patched
I've ran "perf top -e branch-miss" to observe branch-miss in this case.
As Michael points out, it's a slow path, so only very few times this case
happens. But I grep'ed the code base, and found there still are some
other call sites could be benifited from this helper. And given that it
only bloating up the source by only 5 lines, but with a reduced object
size. I still believe this helper deserves to exsit.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:01:58 +0000 (10:01 +1000)]
mm/swap.c: split put_compound_page()
Currently, put_compound_page() carefully handles tricky cases to avoid
racing with compound page releasing or splitting, which makes it quite
lenthy (about 200+ lines) and needs deep tab indention, which makes it
quite hard to follow and maintain.
Now based on two helpers introduced in the previous patch ("mm/swap.c:
introduce put_[un]refcounted_compound_page helpers for spliting
put_compound_page"), this patch replaces those two lengthy code paths with
these two helpers, respectively. Also, it has some comment rephrasing.
After this patch, the put_compound_page() is very compact, thus easy to
read and maintain.
After splitting, the object file is of same size as the original one.
Actually, I've diff'ed put_compound_page()'s orginal disassemble code and
the patched disassemble code, the are 100% the same!
This fact shows that this splitting has no functional change, but it
brings readability.
This patch and the previous one blow the code by 32 lines, mostly due to
comments.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:01:58 +0000 (10:01 +1000)]
mm/swap.c: introduce put_[un]refcounted_compound_page helpers for splitting put_compound_page()
Currently, put_compound_page() carefully handles tricky cases to avoid
racing with compound page releasing or splitting, which makes it quite
lenthy (about 200+ lines) and needs deep tab indention, which makes it
quite hard to follow and maintain.
This patch and the next patch refactor this function.
Based on the code skeleton of put_compound_page:
put_compound_pge:
if !PageTail(page)
put head page fastpath;
return;
/* else PageTail */
page_head = compound_head(page)
if !__compound_tail_refcounted(page_head)
put head page optimal path; <---(1)
return;
else
put head page slowpath; <--- (2)
return;
This patch introduces two helpers, put_[un]refcounted_compound_page,
handling the code path (1) and code path (2), respectively. They both are
tagged __always_inline, thus elmiating function call overhead, making them
operating the same way as before.
They are almost copied verbatim(except one place, a "goto out_put_single"
is expanded), with some comments rephrasing.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rasmus Villemoes [Wed, 14 May 2014 00:01:57 +0000 (10:01 +1000)]
mm: constify nmask argument to set_mempolicy()
The nmask argument to set_mempolicy() is const according to the user-space
header numaif.h, and since the kernel does indeed not modify it, it might
as well be declared const in the kernel.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rasmus Villemoes [Wed, 14 May 2014 00:01:57 +0000 (10:01 +1000)]
mm: constify nmask argument to mbind()
The nmask argument to mbind() is const according to the userspace header
numaif.h, and since the kernel does indeed not modify it, it might as well
be declared const in the kernel.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This doesn't really hurt but unnecessary and misleading. init_task is the
"swapper" thread == current, its ->mm is always NULL. And init_mm can
only be used as ->active_mm, not as ->mm.
mm_init_owner() has a single caller with this patch, perhaps it should
die. mm_init() can initialize ->owner under #ifdef.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Peter Chiang <pchiang@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Wed, 14 May 2014 00:01:56 +0000 (10:01 +1000)]
memcg: optimize the "Search everything else" loop in mm_update_next_owner()
for_each_process_thread() is sub-optimal. All threads share the same
->mm, we can swicth to the next process once we found a thread with
->mm != NULL and ->mm != mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Peter Chiang <pchiang@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Wed, 14 May 2014 00:01:56 +0000 (10:01 +1000)]
memcg: mm_update_next_owner() should skip kthreads
"Search through everything else" in mm_update_next_owner() can hit a
kthread which adopted this "mm" via use_mm(), it should not be used as
mm->owner. Add the PF_KTHREAD check.
While at it, change this code to use for_each_process_thread()
instead of deprecated do_each_thread/while_each_thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Peter Chiang <pchiang@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Wed, 14 May 2014 00:01:55 +0000 (10:01 +1000)]
brd: return -ENOSPC rather than -ENOMEM on page allocation failure
brd is effectively a thinly provisioned device. Thinly provisioned
devices return -ENOSPC when they can't write a new block. -ENOMEM is an
implementation detail that callers shouldn't know.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Wed, 14 May 2014 00:01:54 +0000 (10:01 +1000)]
fs/block_dev.c: add bdev_read_page() and bdev_write_page()
A block device driver may choose to provide a rw_page operation. These
will be called when the filesystem is attempting to do page sized I/O to
page cache pages (ie not for direct I/O). This does preclude I/Os that
are larger than page size, so this may only be a performance gain for some
devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Wed, 14 May 2014 00:01:54 +0000 (10:01 +1000)]
fs/mpage.c: factor page_endio() out of mpage_end_io()
page_endio() takes care of updating all the appropriate page flags once
I/O has finished to a page. Switch to using mapping_set_error() instead
of setting AS_EIO directly; this will handle thin-provisioned devices
correctly.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Wed, 14 May 2014 00:01:54 +0000 (10:01 +1000)]
fs/mpage.c: factor clean_buffers() out of __mpage_writepage()
__mpage_writepage() is over 200 lines long, has 20 local variables, four
goto labels and could desperately use simplification. Splitting
clean_buffers() into a helper function improves matters a little, removing
20+ lines from it.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Wed, 14 May 2014 00:01:54 +0000 (10:01 +1000)]
fs/buffer.c: remove block_write_full_page_endio()
The last in-tree caller of block_write_full_page_endio() was removed in
January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
block_write_full_page() as the only caller of
block_write_full_page_endio(), so inline block_write_full_page_endio()
into block_write_full_page().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NeilBrown [Wed, 14 May 2014 00:01:53 +0000 (10:01 +1000)]
mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads
When a loopback NFS mount is active and the backing device for the NFS
mount becomes congested, that can impose throttling delays on the nfsd
threads.
These delays significantly reduce throughput and so the NFS mount remains
congested.
This results in a livelock and the reduced throughput persists.
This livelock has been found in testing with the 'wait_iff_congested'
call, and could possibly be caused by the 'congestion_wait' call.
This livelock is similar to the deadlock which justified the introduction
of PF_LESS_THROTTLE, and the same flag can be used to remove this
livelock.
To minimise the impact of the change, we still throttle nfsd when the
filesystem it is writing to is congested, but not when some separate
filesystem (e.g. the NFS filesystem) is congested.
Signed-off-by: NeilBrown <neilb@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:01:53 +0000 (10:01 +1000)]
mm: numa: add migrated transhuge pages to LRU the same way as base pages
Migration of misplaced transhuge pages uses page_add_new_anon_rmap() when
putting the page back as it avoided an atomic operations and added the new
page to the correct LRU. A side-effect is that the page gets marked
activated as part of the migration meaning that transhuge and base pages
are treated differently from an aging perspective than base page
migration.
This patch uses page_add_anon_rmap() and putback_lru_page() on completion
of a transhuge migration similar to base page migration. It would require
fewer atomic operations to use lru_cache_add without taking an additional
reference to the page. The downside would be that it's still different to
base page migration and unevictable pages may be added to the wrong LRU
for cleaning up later. Testing of the usual workloads did not show any
adverse impact to the change.
Vladimir Davydov [Wed, 14 May 2014 00:01:53 +0000 (10:01 +1000)]
memcg, slab: simplify synchronization scheme
At present, we have the following mutexes protecting data related to per
memcg kmem caches:
- slab_mutex. This one is held during the whole kmem cache creation
and destruction paths. We also take it when updating per root cache
memcg_caches arrays (see memcg_update_all_caches). As a result, taking
it guarantees there will be no changes to any kmem cache (including per
memcg). Why do we need something else then? The point is it is
private to slab implementation and has some internal dependencies with
other mutexes (get_online_cpus). So we just don't want to rely upon it
and prefer to introduce additional mutexes instead.
- activate_kmem_mutex. Initially it was added to synchronize
initializing kmem limit (memcg_activate_kmem). However, since we can
grow per root cache memcg_caches arrays only on kmem limit
initialization (see memcg_update_all_caches), we also employ it to
protect against memcg_caches arrays relocation (e.g. see
__kmem_cache_destroy_memcg_children).
- We have a convention not to take slab_mutex in memcontrol.c, but we
want to walk over per memcg memcg_slab_caches lists there (e.g. for
destroying all memcg caches on offline). So we have per memcg
slab_caches_mutex's protecting those lists.
Such a syncrhonization scheme has a number of flaws, for instance:
- We can't call kmem_cache_{destroy,shrink} while walking over a
memcg::memcg_slab_caches list due to locking order. As a result, in
mem_cgroup_destroy_all_caches we schedule the
memcg_cache_params::destroy work shrinking and destroying the cache.
- We don't have a mutex to synchronize per memcg caches destruction
between memcg offline (mem_cgroup_destroy_all_caches) and root cache
destruction (__kmem_cache_destroy_memcg_children). Currently we just
don't bother about it.
This patch simplifies it by substituting per memcg slab_caches_mutex's
with the global memcg_slab_mutex. It will be held whenever a new per
memcg cache is created or destroyed, so it protects per root cache
memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
order is following:
This allows us to call kmem_cache_{create,shrink,destroy} under the
memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
work any more - we can simply destroy caches while iterating over a per
memcg slab caches list.
Also using the global mutex simplifies synchronization between concurrent
per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
vs __kmem_cache_destroy_memcg_children.
The downside of this is that we substitute per-memcg slab_caches_mutex's
with a hummer-like global mutex, but since we already take either the
slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
shouldn't hurt concurrency a lot.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Wed, 14 May 2014 00:01:53 +0000 (10:01 +1000)]
memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab
Currently we have two pairs of kmemcg-related functions that are called on
slab alloc/free. The first is memcg_{bind,release}_pages that count the
total number of pages allocated on a kmem cache. The second is
memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
counter. Let's just merge them to keep the code clean.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Wed, 14 May 2014 00:01:52 +0000 (10:01 +1000)]
memcg, slab: do not schedule cache destruction when last page goes away
This patchset is a part of preparations for kmemcg re-parenting. It
targets at simplifying kmemcg work-flows and synchronization.
First, it removes async per memcg cache destruction (see patches 1, 2).
Now caches are only destroyed on memcg offline. That means the caches
that are not empty on memcg offline will be leaked. However, they are
already leaked, because memcg_cache_params::nr_pages normally never drops
to 0 so the destruction work is never scheduled except kmem_cache_shrink
is called explicitly. In the future I'm planning reaping such dead caches
on vmpressure or periodically.
Second, it substitutes per memcg slab_caches_mutex's with the global
memcg_slab_mutex, which should be taken during the whole per memcg cache
creation/destruction path before the slab_mutex (see patch 3). This
greatly simplifies synchronization among various per memcg cache
creation/destruction paths.
I'm still not quite sure about the end picture, in particular I don't know
whether we should reap dead memcgs' kmem caches periodically or try to
merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
more details), but whichever way we choose, this set looks like a
reasonable change to me, because it greatly simplifies kmemcg work-flows
and eases further development.
This patch (of 3):
After a memcg is offlined, we mark its kmem caches that cannot be deleted
right now due to pending objects as dead by setting the
memcg_cache_params::dead flag, so that memcg_release_pages will schedule
cache destruction (memcg_cache_params::destroy) as soon as the last slab
of the cache is freed (memcg_cache_params::nr_pages drops to zero).
I guess the idea was to destroy the caches as soon as possible, i.e.
immediately after freeing the last object. However, it just doesn't work
that way, because kmem caches always preserve some pages for the sake of
performance, so that nr_pages never gets to zero unless the cache is
shrunk explicitly using kmem_cache_shrink. Of course, we could account
the total number of objects on the cache or check if all the slabs
allocated for the cache are empty on kmem_cache_free and schedule
destruction if so, but that would be too costly.
Thus we have a piece of code that works only when we explicitly call
kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
it's racy in fact. For instance, kmem_cache_shrink may free the last slab
and thus schedule cache destruction before it finishes checking that the
cache is empty, which can lead to use-after-free.
So I propose to remove this async cache destruction from
memcg_release_pages, and check if the cache is empty explicitly after
calling kmem_cache_shrink instead. This will simplify things a lot w/o
introducing any functional changes.
And regarding dead memcg caches (i.e. those that are left hanging around
after memcg offline for they have objects), I suppose we should reap them
either periodically or on vmpressure as Glauber suggested initially. I'm
going to implement this later.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 14 May 2014 00:01:52 +0000 (10:01 +1000)]
memcg: do not hang on OOM when killed by userspace OOM access to memory reserves
Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to
echo 0 > $GROUP/memory.oom_control
His usecase is:
- Setup a hierarchy with memory and the freezer (disable kernel oom and
have a process watch for oom).
- In that memory cgroup add a process with one thread per cpu.
- In one thread slowly allocate once per second I think it is 16M of ram
and mlock and dirty it (just to force the pages into ram and stay
there).
- When oom is achieved loop:
* attempt to freeze all of the tasks.
* if frozen send every task SIGKILL, unfreeze, remove the directory in
cgroupfs.
Eric has then pinpointed the issue to be memcg specific.
All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that the exit path might
trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
while its memcg is still under OOM because nobody has released any charges
yet.
Unlike with the in-kernel OOM handler the exiting task doesn't get
TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
and falls to the memcg OOM again without any way out of it as there are no
fatal signals pending anymore.
This patch fixes the issue by checking PF_EXITING early in
mem_cgroup_try_charge and bypass the charge same as if it had fatal
signal pending or TIF_MEMDIE set.
Normally exiting tasks (aka not killed) will bypass the charge now but
this should be OK as the task is leaving and will release memory and
increasing the memory pressure just to release it in a moment seems
dubious wasting of cycles. Besides that charges after exit_signals should
be rare.
I am bringing this patch again (rebased on the current mmotm tree). I
hope we can move forward finally. If there is still an opposition then
I would really appreciate a concurrent approach so that we can discuss
alternatives.
http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
to the followup discussion when the patch has been dropped from the mmotm
last time.
Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 14 May 2014 00:01:51 +0000 (10:01 +1000)]
mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL
throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.
The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.
On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Wed, 14 May 2014 00:01:51 +0000 (10:01 +1000)]
memcg: kill CONFIG_MM_OWNER
CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
selected by CONFIG_MEMCG automatically. So we can kill this option in
init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Wed, 14 May 2014 00:01:51 +0000 (10:01 +1000)]
mm/swap.c: clean up *lru_cache_add* functions
In mm/swap.c, __lru_cache_add() is exported, but actually there are no
users outside this file.
This patch unexports __lru_cache_add(), and makes it static. It also
exports lru_cache_add_file(), as it is use by cifs and fuse, which can
loaded as modules.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (exclusively)
in order to avoid races while iterating through the vmacache and/or
rbtree.
Signed-off-by: Jonathan Gonzalez V <zeus@gnu.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Wed, 14 May 2014 00:01:50 +0000 (10:01 +1000)]
arc: call find_vma with the mmap_sem held
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (shared) in
order to avoid races while iterating through the vmacache and/or rbtree.
Davidlohr Bueso [Wed, 14 May 2014 00:01:50 +0000 (10:01 +1000)]
mips: call find_vma with the mmap_sem held
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (exclusively)
in order to avoid races while iterating through the vmacache and/or
rbtree.
Updates two functions:
- process_fpemu_return()
- cteon_flush_cache_sigtramp()
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Ralf Baechle <ralf@linux-mips.org> Tested-by: Andreas Herrmann <andreas.herrmann@caviumnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Wed, 14 May 2014 00:01:49 +0000 (10:01 +1000)]
m68k: call find_vma with the mmap_sem held in sys_cacheflush()
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (shared) in
order to avoid races while iterating through the vmacache and/or rbtree.
Vladimir Davydov [Wed, 14 May 2014 00:01:49 +0000 (10:01 +1000)]
Documentation/memcg: warn about incomplete kmemcg state
Kmemcg is currently under development and lacks some important features.
In particular, it does not have support of kmem reclaim on memory pressure
inside cgroup, which practically makes it unusable in real life. Let's
warn about it in both Kconfig and Documentation to prevent complaints
arising.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Hansen [Wed, 14 May 2014 00:01:49 +0000 (10:01 +1000)]
mm: debug: make bad_range() output more usable and readable
Nobody outputs memory addresses in decimal. PFNs are essentially
addresses, and they're gibberish in decimal. Output them in hex.
Also, add the nid and zone name to give a little more context to the
message.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Wed, 14 May 2014 00:01:49 +0000 (10:01 +1000)]
mm-compaction-cleanup-isolate_freepages-fix3
What I did here is taking end_pfn out of the loop and considering zone
boundary once. After then, we can just set previous pfn to end_pfn on
every iteration to move scanning window. With this change, we can remove
local variable, z_end_pfn.
Another things I did are removing max() operation and un-needed assignment
to isolate variable.
In addition, I change both the variable names, from pfn and end_pfn to
block_start_pfn and block_end_pfn, respectively. They represent their
meaning perfectly.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Wed, 14 May 2014 00:01:48 +0000 (10:01 +1000)]
mm-compaction-cleanup-isolate_freepages-fix 2
Cleanup detection of compaction scanners crossing in isolate_freepages().
To make sure compact_finished() observes scanners crossing, we can just
set free_pfn to migrate_pfn instead of confusing max() construct.
Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>