Dan Streetman [Thu, 26 Jun 2014 00:42:46 +0000 (10:42 +1000)]
mm/zpool: prevent zbud/zsmalloc from unloading when used
Add try_module_get() to zpool_create_pool(), and module_put() to
zpool_destroy_pool(). Without module usage counting, the driver module(s)
could be unloaded while their pool(s) were active, resulting in an oops
when zpool tried to access them.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Thu, 26 Jun 2014 00:42:46 +0000 (10:42 +1000)]
mm/zpool: update zswap to use zpool
Change zswap to use the zpool api instead of directly using zbud. Add a
boot-time param to allow selecting which zpool implementation to use, with
zbud as the default.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Thu, 26 Jun 2014 00:42:45 +0000 (10:42 +1000)]
mm/zpool: implement common zpool api to zbud/zsmalloc
Add zpool api.
zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Thu, 26 Jun 2014 00:42:45 +0000 (10:42 +1000)]
mm/zbud: change zbud_alloc size type to size_t
Change the type of the zbud_alloc() size param from unsigned int
to size_t.
Technically, this should not make any difference, as the zbud
implementation already restricts the size to well within either
type's limits; but as zsmalloc (and kmalloc) use size_t, and
zpool will use size_t, this brings the size parameter type
in line with zsmalloc/zpool.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Thu, 26 Jun 2014 00:42:45 +0000 (10:42 +1000)]
mm/zbud: zbud_alloc() minor param change
Change zbud to store gfp_t flags passed at pool creation to use for
each alloc; this allows the api to be closer to the existing zsmalloc
interface, and the only current zbud user (zswap) uses the same gfp
flags for all allocs. Update zswap to use changed interface.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Thu, 26 Jun 2014 00:42:44 +0000 (10:42 +1000)]
m68k: call find_vma with the mmap_sem held in sys_cacheflush()
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (shared) in
order to avoid races while iterating through the vmacache and/or rbtree.
Roman Pen [Thu, 26 Jun 2014 00:42:43 +0000 (10:42 +1000)]
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
write, thus mark request as WRITE_SYNC.
akpm: afaict this change will cause the data integrity write bios to be
placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
in special treatment. The documentation for REQ_SYNC is horrid.
Signed-off-by: Roman Pen <r.peniaev@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A shared anonymous mapping created without MAP_NORESERVE holds memory
reservation for whole range of shmem segment. Usually there is no way to
change its size, but /proc/<pid>/map_files/... (available if
CONFIG_CHECKPOINT_RESTORE=y) allows that.
This patch adjusts the memory reservation in shmem_setattr().
shmem: fix double uncharge in __shmem_file_setup()
If __shmem_file_setup() fails on struct file allocation it uncharges
memory commitment twice: first by shmem_unacct_size() and second time
implicitly in shmem_evict_inode() when it kills the newly created inode.
This patch removes shmem_unacct_size() from error path if the inode was
already there.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 26 Jun 2014 00:42:42 +0000 (10:42 +1000)]
mm, vmalloc: constify allocation mask
tmp_mask in the __vmalloc_area_node() iteration never changes so it can be
moved into function scope and marked with const. This causes the movl and
orl to only be done once per call rather than area->nr_pages times.
nested_gfp can also be marked const.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Max Asbock [Thu, 26 Jun 2014 00:42:41 +0000 (10:42 +1000)]
mm tracing: tell mm_migrate_pages event about numa_misplaced
The mm_migrate_pages trace event reports a reason for the migration,
typically as a symbolic string. The exception is the reason
MR_NUMA_MISPLACED for which it just displays the numeric value:
mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=0x5
This patch makes the output consistent by introducing a string value for
MR_NUMA_MISPLACED. The event is then reported as: mm_migrate_pages:
nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=numa_misplaced
Signed-off-by: Max Asbock <masbock@linux.vnet.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:41 +0000 (10:42 +1000)]
mm: vmscan: move swappiness out of scan_control
Swappiness is determined for each scanned memcg individually in
shrink_zone() and is not a parameter that applies throughout the reclaim
scan. Move it out of struct scan_control to prevent accidental use of a
stale value.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:41 +0000 (10:42 +1000)]
mm: vmscan: remove all_unreclaimable()
Direct reclaim currently calls shrink_zones() to reclaim all members of a
zonelist, and if that wasn't successful it does another pass through the
same zonelist to check overall reclaimability.
Just check reclaimability in shrink_zones() directly and propagate the
result through the return value. Then remove all_unreclaimable().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:41 +0000 (10:42 +1000)]
mm: vmscan: rework compaction-ready signaling in direct reclaim
Page reclaim for a higher-order page runs until compaction is ready, then
aborts and signals this situation through the return value of
shrink_zones(). This is an oddly specific signal to encode in the return
value of shrink_zones(), though, and can be quite confusing.
Introduce sc->compaction_ready and signal the compactability of the zones
out-of-band to free up the return value of shrink_zones() for actual zone
reclaimability.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:41 +0000 (10:42 +1000)]
mm: vmscan: remove remains of kswapd-managed zone->all_unreclaimable
shrink_zones() has a special branch to skip the all_unreclaimable() check
during hibernation, because a frozen kswapd can't mark a zone
unreclaimable.
But ever since 6e543d5780e3 ("mm: vmscan: fix do_try_to_free_pages()
livelock"), determining a zone to be unreclaimable is done by directly
looking at its scan history and no longer relies on kswapd setting the
per-zone flag.
Remove this branch and let shrink_zones() check the reclaimability of the
target zones regardless of hibernation state.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <Kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not
highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages, it
extracts pages from the previous zone before ZONE_MOVABLE. Which is
logically inconsistent:
If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on,
zone_movable_is_highmem() makes movable zone not highmem, but
online_pages() extracts pages from ZONE_HIGHMEM.
This inconsistency doesn't cause real problem currently, because all
architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP.
However, fixing it makes code clear, and also helps futher coding.
Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Zhen <zhangzhen@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiang Liu <liuj97@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:40 +0000 (10:42 +1000)]
page-cgroup: get rid of NR_PCG_FLAGS
It's not used anywhere today, so let's remove it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:39 +0000 (10:42 +1000)]
page-cgroup: trivial cleanup
Add forward declarations for struct pglist_data, mem_cgroup.
Remove __init, __meminit from function prototypes and inline functions.
Remove redundant inclusion of bit_spinlock.h.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:39 +0000 (10:42 +1000)]
page-cgroup: fix flags definition
Since commit a9ce315aaec1f ("mm: memcontrol: rewrite uncharge API"),
PCG_* flags are used as bit masks, but they are still defined in a enum
as bit numbers. Fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:39 +0000 (10:42 +1000)]
mm: memcontrol: rewrite uncharge API fix 2
It's not entirely clear whether do_swap_account or PCG_MEMSW is the
authoritative answer to whether a page is swap-accounted or not. This
currently leads to the following memsw counter underflow when swap
accounting is disabled:
Don't set PCG_MEMSW when swap accounting is disabled, so that uncharging
only has to look at this per-page flag.
mem_cgroup_swapout() could also fully rely on this flag, but as it can
bail out before even looking up the page_cgroup, check do_swap_account as
a performance optimization and only sanity test for PCG_MEMSW.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Hocko <mhocko@suse.cz> Tested-by: Jet Chen <jet.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:39 +0000 (10:42 +1000)]
mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.
Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:
- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.
- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.
- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().
But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.
For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.
mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
gets rid of mem_cgroup_replace_page_cache().
Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.
Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:38 +0000 (10:42 +1000)]
mm: memcontrol: rewrite charge API
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:38 +0000 (10:42 +1000)]
mm: memcontrol: do not acquire page_cgroup lock for kmem pages
Kmem page charging and uncharging is serialized by means of exclusive
access to the page. Do not take the page_cgroup lock and don't set
pc->flags atomically.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:38 +0000 (10:42 +1000)]
mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
There is a write barrier between setting pc->mem_cgroup and
PageCgroupUsed, which was added to allow LRU operations to lookup the
memcg LRU list of a page without acquiring the page_cgroup lock.
But ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"),
pages are ensured to be off-LRU while charging, so nobody else is changing
LRU state while pc->mem_cgroup is being written, and there are no read
barriers anymore.
Remove the unnecessary write barrier.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:37 +0000 (10:42 +1000)]
mm: memcontrol: use root_mem_cgroup res_counter
Due to an old optimization to keep expensive res_counter changes at a
minimum, the root_mem_cgroup res_counter is never charged; there is no
limit at that level anyway, and any statistics can be generated on demand
by summing up the counters of all other cgroups.
However, with per-cpu charge caches, res_counter operations do not even
show up in profiles anymore, so this optimization is no longer necessary.
Remove it to simplify the code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:37 +0000 (10:42 +1000)]
mm: memcontrol: catch root bypass in move precharge
When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to the
root memcg. But move precharging does not catch this and treats this case
as if no charge had happened, thus leaking a charge against root. Because
of an old optimization, the root memcg's res_counter is not actually
charged right now, but it's still an imbalance and subsequent patches will
charge the root memcg again.
Catch those bypasses to the root memcg and properly cancel them before
giving up the move.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:37 +0000 (10:42 +1000)]
mm: memcontrol: simplify move precharge function
The move precharge function does some baroque things: it tries raw
res_counter charging of the entire amount first, and then falls back to a
loop of one-by-one charges, with checks for pending signals and
cond_resched() batching.
Just use mem_cgroup_try_charge() without __GFP_WAIT for the first bulk
charge attempt. In the one-by-one loop, remove the signal check (this is
already checked in try_charge), and simply call cond_resched() after every
charge - it's not that expensive.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Thu, 26 Jun 2014 00:42:37 +0000 (10:42 +1000)]
mm: memcontrol: remove explicit OOM parameter in charge path
For the page allocator, __GFP_NORETRY implies that no OOM should be
triggered, whereas memcg has an explicit parameter to disable OOM.
The only callsites that want OOM disabled are THP charges and charge
moving. THP already uses __GFP_NORETRY and charge moving can use it as
well - one full reclaim cycle should be plenty. Switch it over, then
remove the OOM parameter.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:37 +0000 (10:42 +1000)]
mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
There is no reason why oom-disabled and __GFP_NOFAIL charges should try to
reclaim only once when every other charge tries several times before
giving up. Make them all retry the same number of times.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: huge_memory: use GFP_TRANSHUGE when charging huge pages
Transparent huge page charges prefer falling back to regular pages rather
than spending a lot of time in direct reclaim.
Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled for
THP charges, and that OOM-disabled charges don't retry reclaim. Needless
to say, this is anything but obvious and quite error prone.
Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: reclaim at least once for __GFP_NORETRY
Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim. Bring the behavior on par with the page allocator and
reclaim at least once before giving up.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: rearrange charging fast path
The charging path currently starts out with OOM condition checks when OOM
is the rarest possible case.
Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim. Attempting a charge does not hurt an (oom-)killed task as much
as every charge attempt having to check OOM conditions. Also, only check
__GFP_NOFAIL when the charge would actually fail.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: fold mem_cgroup_do_charge()
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 13):
This function was split out because mem_cgroup_try_charge() got too big.
But having essentially one sequence of operations arbitrarily split in
half is not good for reworking the code. Fold it back in.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Waiman Long [Thu, 26 Jun 2014 00:42:35 +0000 (10:42 +1000)]
mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic
In some architectures like x86, atomic_add() is a full memory barrier. In
that case, an additional smp_mb() is just a waste of time. This patch
replaces that smp_mb() by smp_mb__after_atomic() which will avoid the
redundant memory barrier in some architectures.
With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us. A
reduction of 19% which is quite sizeable. It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from 2.18%
to 1.15%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Waiman Long [Thu, 26 Jun 2014 00:42:35 +0000 (10:42 +1000)]
mm, thp: move invariant bug check out of loop in __split_huge_page_map
In __split_huge_page_map(), the check for page_mapcount(page) is invariant
within the for loop. Because of the fact that the macro is implemented
using atomic_read(), the redundant check cannot be optimized away by the
compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the patch
respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- add comment on page_size_order()
- use compound_order(compound_head(page)) instead of huge_page_size_order()
- use page_pgoff() in rmap_walk_file() too
- use page_size_order() in kill_proc()
- fix space indent
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
to avoid macro definition.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
page->index stores pagecache index when the page is mapped into file
mapping region, and the index is in pagecache size unit, so it depends on
the page size. Some of users of reverse mapping obviously assumes that
page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous
hugepage.
For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node. Then the vma is split and
migrate_page() is called for the 2nd hugepage (belonging to the middle
vma.) In migrate operation, rmap_walk_anon() tries to find the relevant
vma to which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires
VM_BUG_ON.
This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index. Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
mm, CMA: change cma_declare_contiguous() to obey coding convention
Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.
Additionally, move down cma_areas reference code to the position
where it is really needed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
PPC, KVM: fix build failure due to removed file
Commit ('PPC, KVM, CMA: use general CMA reserved area management
framework') removes book3s_hv_cma.c, but, missed to remove entry
in Makefile. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: fix ARM build failure related to MAX_CMA_AREAS definition
If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: generalize CMA reserved area management functionality
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants to
maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.
When I implement CMA related patches, I should change those two places to
apply my change and it seem to be painful to me. I want to change this
situation and reduce future code management overhead through this patch.
This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.
In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.
There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: support arbitrary bitmap granularity
PPC KVM's CMA area management requires arbitrary bitmap granularity, since
they want to reserve very large memory and manage this region with bitmap
that one bit for several pages to reduce management overheads. So support
arbitrary bitmap granularity for following generalization.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: separate core CMA management codes from DMA APIs
To prepare future generalization work on CMA area management code, we need
to separate core CMA management codes from DMA APIs. We will extend these
core functions to cover requirements of PPC KVM's CMA area management
functionality in following patches. This separation helps us not to touch
DMA APIs while extending core functions.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
slab: set free_limit for dead caches to 0
We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.
Currently, we check if the cache is dead in free_block and drop empty slab
if so irrespective of the node's free_limit. This can be pretty
expensive. So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink. This way no additional
overhead is added to free hot path.
Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slab: do not keep free objects/slabs on dead memcg caches
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.
For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.
To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).
After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: kmem_cache_shrink: check if partial list is empty under list_lock
SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.
This is unacceptable for kmemcg, which needs to be sure that there will be
no empty slabs on dead memcg caches after kmem_cache_shrink was called,
because otherwise we may leak a dead cache.
Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink does
nothing if the list is empty, we can simply remove the nr_partial=0 check.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make dead memcg caches discard free slabs immediately
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.
This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for dead
caches (see put_cpu_partial) and forbids keeping empty slabs on per node
partial lists by setting cache's min_partial to 0 on kmem_cache_shrink,
which is always called on memcg offline (see memcg_unregister_all_caches).
Thanks to Joonsoo Kim.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
memcg: wait for kfree's to finish before destroying cache
kmem_cache_free doesn't expect that the cache can be destroyed as soon as
the object is freed, e.g. SLUB's implementation may want to update cache
stats after putting the object to the free list.
Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period to
elapse.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make slab_free non-preemptable
Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.
The former approach isn't a good option from the future development point
of view, because every modifications to kmem_cache_free must be done with
great care then. Hence we should provide a method to wait for all
currently executing kmem_cache_free's to finish.
This patch makes SLUB's implementation of kmem_cache_free non-preemptable.
As a result, synchronize_sched() will work as a barrier against
kmem_cache_free's in flight, so that issuing it before cache destruction
will protect us against the use-after-free.
This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to the
end of the function. Neither should it affect the system latency, because
kmem_cache_free is extremely short, even in its slow path.
SLAB's version of kmem_cache_free already proceeds with irqs disabled, so
we only add a comment explaining why it's necessary for kmemcg there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
slub: don't fail kmem_cache_shrink if slab placement optimization fails
SLUB's kmem_cache_shrink not only removes empty slabs from the cache, but
also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.
This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip the
slab placement optimization step if the allocation fails, but still get
rid of empty slabs.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: mark caches that belong to offline memcgs as dead
This will be used by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: destroy kmem caches when last slab is freed
When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.
Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: cleanup memcg_cache_params refcnt usage
When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.
It would be perfect if we could move all slab pages of such dead caches to
the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)
The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would make
alloc/free take per node list locks more often.
After pondering about this problem for some time, I think we should return
to dead caches self-destruction, i.e. scheduling cache destruction work
when the last slab page is freed.
This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way that
they don't discard empty slabs immediately, but prefer keeping them cached
for indefinite time to speed up further allocations.
However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become empty.
Since dead caches should never be allocated from, removing empty slabs
from them shouldn't result in noticeable performance degradation.
So, this patch set reintroduces dead cache self-destruction and adds some
tweaks to SL[AU]B to prevent dead caches from hanging around indefinitely.
It is organized as follows:
- patches 1-3 reintroduce dead memcg cache self-destruction;
- patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
slabs, even if it fails to allocate a temporary array;
- patches 5 and 6 fix possible use-after-free connected with
asynchronous cache destruction;
- patches 7 and 8 disable caching of empty slabs for dead memcg caches
for SLUB and SLAB respectively.
Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.
This patch (of 8):
Currently, we count the number of pages allocated to a per memcg cache in
memcg_cache_params->nr_pages. We only use this counter to find out if the
cache is empty and can be destroyed. So let's rename it to refcnt and
make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.
Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: page_alloc: simplify drain_zone_pages by using min()
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change type
of batch local variable to match type of per_cpu_pages::batch (which is
int).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte
upon which all subsequent decisions and pte_same() tests will be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity,
and I am not optimistic that it will fix it. But I have found no other
explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped in
a second time on top of itself: mapcount 2 for a single pte).
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
vmalloc: use rcu list iterator to reduce vmap_area_lock contention
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a solution
reducing vmap_area_lock contention with no side-effect. That is just to
use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite
vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Yucong [Thu, 26 Jun 2014 00:42:28 +0000 (10:42 +1000)]
hwpoison: fix the handling path of the victimized page frame that belong to non-LRU
Until now, the kernel has the same policy to handle victimized page frames
that belong to kernel-space(reserved/slab-subsystem) or non-LRU(unknown
page state). In other word, the result of handling either of these
victimized page frames is (IGNORED | FAILED), and the return value of
memory_failure() is -EBUSY.
This patch is to avoid that memory_failure() returns very soon due to the
"true" value of (!PageLRU(p)), and it also ensures that action_result()
can report more precise information("reserved kernel", "kernel slab", and
"unknown page state") instead of "non LRU", especially for memory errors
which are detected by memory-scrubbing.
Wei Yang [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
slub: reduce duplicate creation on the first object
When a kmem_cache is created with ctor, each object in the kmem_cache will
be initialized before use. In the slub implementation, the first object
will be initialized twice.
This patch avoids the duplication of initialization of the first object.
Fixes commit 7656c72b5a63: ("SLUB: add macros for scanning objects in a
slab").
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y
There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.
I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.
This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It simply
has not worked before because should_failslab() call was in a hook hidden
under "#ifdef CONFIG_SLUB_DEBUG #else".
Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled. The might_sleep_if() call can
generate some code even if DEBUG_ATOMIC_SLEEP=n. For PREEMPT_VOLUNTARY=y
might_sleep() inserts _cond_resched() call, but I think it should be ok.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm, slub: mark resiliency_test as init text
resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slab.h: wrap the whole file with guarding macro
Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.
Wrap the whole file by moving closing #endif to the end of it.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/slab.c: In function 'slab_set_debugobj_lock_classes':
mm/slab.c:524: error: 'h' undeclared (first use in this function)
mm/slab.c:524: error: (Each undeclared identifier is reported only once
mm/slab.c:524: error: for each function it appears in.)
mm/slab.c:524: warning: left-hand operand of comma expression has no effect
mm/slab.c: In function 'cpuup_prepare':
mm/slab.c:1308: warning: passing argument 2 of 'slab_set_debugobj_lock_classes_node' makes pointer from integer without a cast
mm/slab.c:513: note: expected 'struct kmem_cache_node *' but argument is of type 'int'
Cc: Christoph Lameter <cl@gentwo.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > @@ -3759,8 +3746,8 @@ fail:
> > /* Cache is not active yet. Roll back what we did */
> > node--;
> > while (node >= 0) {
> > - if (cachep->node[node]) {
> > - n = cachep->node[node];
> > + if (get_node(cachep, node)) {
> > + n = get_node(cachep, node);
>
> Could you do this as following?
>
> n = get_node(cachep, node);
> if (n) {
> ...
> }
Sure....
Subject: slab: Fixes to earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: use get_node() and kmem_cache_node() functions
Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.
Get rid of various repeated calculations of kmem_cache_node structures.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ok got through the file and removed all the lines after
for_each_kmem_cache_node.
>
> > @@ -3407,11 +3401,7 @@ int __kmem_cache_shrink(struct kmem_cach
> > return -ENOMEM;
> >
> > flush_all(s);
> > - for_each_node_state(node, N_NORMAL_MEMORY) {
> > - n = get_node(s, node);
> > -
> > - if (!n->nr_partial)
> > - continue;
> > + for_each_kmem_cache_node(s, node, n) {
> >
> > for (i = 0; i < objects; i++)
> > INIT_LIST_HEAD(slabs_by_inuse + i);
>
> Is there any reason not to keep the !n->nr_partial check to avoid taking
> n->list_lock unnecessarily?
No this was simply a mistake the check needs to be preserved.
Subject: slub: Fix up earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make use of the new node functions in mm/slab.h to reduce code size and
simplify.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>