Tim Chen [Wed, 22 Feb 2017 23:45:36 +0000 (15:45 -0800)]
mm/swap: free swap slots in batch
Add new functions that free unused swap slots in batches without the
need to reacquire swap info lock. This improves scalability and reduce
lock contention.
Link: http://lkml.kernel.org/r/c25e0fcdfd237ec4ca7db91631d3b9f6ed23824e.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Chen [Wed, 22 Feb 2017 23:45:33 +0000 (15:45 -0800)]
mm/swap: allocate swap slots in batches
Currently, the swap slots are allocated one page at a time, causing
contention to the swap_info lock protecting the swap partition on every
page being swapped.
This patch adds new functions get_swap_pages and scan_swap_map_slots to
request multiple swap slots at once. This will reduces the lock
contention on the swap_info lock. Also scan_swap_map_slots can operate
more efficiently as swap slots often occurs in clusters close to each
other on a swap device and it is quicker to allocate them together.
Link: http://lkml.kernel.org/r/9fec2845544371f62c3763d43510045e33d286a6.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang, Ying [Wed, 22 Feb 2017 23:45:26 +0000 (15:45 -0800)]
mm/swap: split swap cache into 64MB trunks
The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache. In current kernel, one address
space will be used for each swap device. And in the common
configuration, the number of the swap device is very small (one is
typical). This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.
But in fact, there is no dependency between pages in the swap cache. So
that, we can split the one shared address space for each swap device
into several address spaces to reduce the lock contention. In the
patch, the shared address space is split into 64MB trunks. 64MB is
chosen to balance the memory space usage and effect of lock contention
reduction.
The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.
One address space is still shared for the swap entries in the same 64M
trunks. To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed. The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M. After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.
Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang, Ying [Wed, 22 Feb 2017 23:45:22 +0000 (15:45 -0800)]
mm/swap: add cluster lock
This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations. swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously. Because there is only one lock
for each swap device. While in common configuration, there is only one
or several swap devices in the system. The lock protects almost all
swap related operations.
In fact, many swap operations only access one element of
swap_info_struct->swap_map array. And there is no dependency between
different elements of swap_info_struct->swap_map. So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.
In this patch, a spinlock is added to swap_cluster_info to protect the
elements of swap_info_struct->swap_map in the swap cluster and the
fields of swap_cluster_info. This reduced locking contention for
swap_info_struct->swap_map access greatly.
Because of the added spinlock, the size of swap_cluster_info increases
from 4 bytes to 8 bytes on the 64 bit and 32 bit system. This will use
additional 4k RAM for every 1G swap space.
Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between spinlocks in swap_cluster_info. To avoid the
false sharing in the first round of the swap cluster allocation, the
order of the swap clusters in the free clusters list is changed. So
that, the swap_cluster_info sharing the same cache line will be placed
as far as possible. After the first round of allocation, the order of
the clusters in free clusters list is expected to be random. So the
false sharing should be not serious.
Compared with a previous implementation using bit_spin_lock, the
sequential swap out throughput improved about 3.2%. Test was done on a
Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
(persistent memory) device. To test the sequential swapping out, the
test case created 32 processes, which sequentially allocate and write to
the anonymous pages until the RAM and part of the swap device is used.
[ying.huang@intel.com: v5] Link: http://lkml.kernel.org/r/878tqeuuic.fsf_-_@yhuang-dev.intel.com
[minchan@kernel.org: initialize spinlock for swap_cluster_info] Link: http://lkml.kernel.org/r/1486434945-29753-1-git-send-email-minchan@kernel.org
[hughd@google.com: annotate nested locking for cluster lock] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702161050540.21773@eggly.anvils Link: http://lkml.kernel.org/r/dbb860bbd825b1aaba18988015e8963f263c3f0d.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang, Ying [Wed, 22 Feb 2017 23:45:19 +0000 (15:45 -0800)]
mm/swap: fix kernel message in swap_info_get()
Patch series "mm/swap: Regular page swap optimizations", v5.
Times have changed. Coming generation of Solid state Block device
latencies are getting down to sub 100 usec, which is within an order of
magnitude of DRAM, and their performance is orders of magnitude higher
than the single- spindle rotational media we've swapped to historically.
This could benefit many usage scenearios. For example cloud providers
who overcommit their memory (as VM don't use all the memory
provisioned). Having a fast swap will allow them to be more aggressive
in memory overcommit and fit more VMs to a platform.
In our testing [see footnote], the median latency that the kernel adds
to a page fault is 15 usec, which comes quite close to the amount that
will be contributed by the underlying I/O devices.
The software latency comes mostly from contentions on the locks
protecting the radix tree of the swap cache and also the locks
protecting the individual swap devices. The lock contentions already
consumed 35% of cpu cycles in our test. In the very near future,
software latency will become the bottleneck to swap performnace as block
device I/O latency gets within the shouting distance of DRAM speed.
This patch set, reduced the median page fault latency from 15 usec to 4
usec (375% reduction) for DRAM based pmem block device.
This patch (of 9):
swap_info_get() is used not only in swap free code path but also in
page_swapcount(), etc. So the original kernel message in swap_info_get()
is not correct now. Fix it via replacing "swap_free" to "swap_info_get"
in the message.
Link: http://lkml.kernel.org/r/9b5f8bd6266f9da978c373f2384c8044df5e262c.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping
ends at 0002aff8, so the PLT falls in the 0 fill section described by
the load header, and after a page boundary.
Unfortunately the generic ELF loader ignores the X bit in the load
headers when it creates the 0 filled non-file backed mappings. It
assumes all of these mappings are RW BSS sections, which is not the case
for PPC.
gcc/ld has an option (--secure-plt) to not do this, this is said to
incur a small performance penalty.
Currently, to support 32-bit binaries with PLT in BSS kernel maps
*entire brk area* with executable rights for all binaries, even
--secure-plt ones.
Stop doing that.
Teach the ELF loader to check the X bit in the relevant load header and
create 0 filled anonymous mappings that are executable if the load
header requests that.
Test program showing the difference in /proc/$PID/maps:
int main() {
char buf[16*1024];
char *p = malloc(123); /* make "[heap]" mapping appear */
int fd = open("/proc/self/maps", O_RDONLY);
int len = read(fd, buf, sizeof(buf));
write(1, buf, len);
printf("%p\n", p);
return 0;
}
mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next
To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.
But page->lru list is initialized in reserve_bootmem_region(). So when
calling free_pagetable(), the function cannot find the magic number of
pages. And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().
But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag. So before freeing the pages, we
should clear the private flag by put_page_bootmem().
Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
page fields with a single branch"), we could find the following visible
issue:
BUG: Bad page state in process kworker/u1024:1
page:ffffea103cfd8040 count:0 mapcount:0 mappi
flags: 0x6fffff80000800(private)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x800(private)
<snip>
Call Trace:
[...] dump_stack+0x63/0x87
[...] bad_page+0x114/0x130
[...] free_pages_prepare+0x299/0x2d0
[...] free_hot_cold_page+0x31/0x150
[...] __free_pages+0x25/0x30
[...] free_pagetable+0x6f/0xb4
[...] remove_pagetable+0x379/0x7ff
[...] vmemmap_free+0x10/0x20
[...] sparse_remove_one_section+0x149/0x180
[...] __remove_pages+0x2e9/0x4f0
[...] arch_remove_memory+0x63/0xc0
[...] remove_memory+0x8c/0xc0
[...] acpi_memory_device_remove+0x79/0xa5
[...] acpi_bus_trim+0x5a/0x8d
[...] acpi_bus_trim+0x38/0x8d
[...] acpi_device_hotplug+0x1b7/0x418
[...] acpi_hotplug_work_fn+0x1e/0x29
[...] process_one_work+0x152/0x400
[...] worker_thread+0x125/0x4b0
[...] kthread+0xd8/0xf0
[...] ret_from_fork+0x22/0x40
And the issue still silently occurs.
Until freeing the pages of page table allocated from bootmem allocator,
the page->freelist is never used. So the patch sets magic number to
page->freelist instead of page->lru.next.
[isimatu.yasuaki@jp.fujitsu.com: fix merge issue] Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Wed, 22 Feb 2017 23:45:07 +0000 (15:45 -0800)]
mm/memblock.c: check return value of memblock_reserve() in memblock_virt_alloc_internal()
memblock_reserve() would add a new range to memblock.reserved in case
the new range is not totally covered by any of the current
memblock.reserved range. If the memblock.reserved is full and can't
resize, memblock_reserve() would fail.
This doesn't happen in real world now, I observed this during code
review. While theoretically, it has the chance to happen. And if it
happens, others would think this range of memory is still available and
may corrupt the memory.
This patch checks the return value and goto "done" after it succeeds.
Wei Yang [Wed, 22 Feb 2017 23:45:04 +0000 (15:45 -0800)]
mm/memblock.c: trivial code refine in memblock_is_region_memory()
memblock_is_region_memory() invoke memblock_search() to see whether the
base address is in the memory region. If it fails, idx would be -1.
Then, it returns 0.
If the memblock_search() returns a valid index, it means the base
address is guaranteed to be in the range memblock.memory.regions[idx].
Because of this, it is not necessary to check the base again.
Where the waitqueue_active() check is speculatively re-ordered to before
setting the actual condition (max_order), not seeing the threads that's
going to block; making us miss a wakeup. There are a couple of options
to fix this, including calling wq_has_sleepers() which adds a full
barrier, or unconditionally doing the wake_up_interruptible() and
serialize on the q->lock. However, to make use of the control
dependency, we just need to add L->L guarantees.
While this bug is theoretical, there have been other offenders of the
lockless waitqueue_active() in the past -- this is also documented in
the call itself.
Paul Burton [Wed, 22 Feb 2017 23:44:53 +0000 (15:44 -0800)]
mm: page_alloc: skip over regions of invalid pfns where possible
When using a sparse memory model memmap_init_zone() when invoked with
the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
which aren't in a populated region of the sparse memory map. However if
the memory map is extremely sparse then it can spend a long time
linearly checking each PFN in a large non-populated region of the memory
map & skipping it in turn.
When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
information to quickly discover the next valid PFN given an invalid one
by searching through the list of memory regions & skipping forwards to
the first PFN covered by the memory region to the right of the
non-populated region. Implement this in order to speed up
memmap_init_zone() for systems with extremely sparse memory maps.
James said "I have tested this patch on a virtual model of a Samurai CPU
with a sparse memory map. The kernel boot time drops from 109 to
62 seconds. "
Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.com Signed-off-by: Paul Burton <paul.burton@imgtec.com> Tested-by: James Hartley <james.hartley@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 22 Feb 2017 23:44:50 +0000 (15:44 -0800)]
mm, compaction: add vmstats for kcompactd work
A "compact_daemon_wake" vmstat exists that represents the number of
times kcompactd has woken up. This doesn't represent how much work it
actually did, though.
It's useful to understand how much compaction work is being done by
kcompactd versus other methods such as direct compaction and explicitly
triggered per-node (or system) compaction.
This adds two new vmstats: "compact_daemon_migrate_scanned" and
"compact_daemon_free_scanned" to represent the number of pages kcompactd
has scanned as part of its migration scanner and freeing scanner,
respectively.
These values are still accounted for in the general
"compact_migrate_scanned" and "compact_free_scanned" for compatibility.
It could be argued that explicitly triggered compaction could also be
tracked separately, and that could be added if others find it useful.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven Rostedt [Wed, 22 Feb 2017 23:44:47 +0000 (15:44 -0800)]
mm/mmzone.c: swap likely to unlikely as code logic is different for next_zones_zonelist()
Commit 682a3385e773 ("mm, page_alloc: inline the fast path of the
zonelist iterator") changed how next_zones_zonelist() is called, by
adding a static inline function to do the fast path. This function
adds:
Randy Dunlap [Wed, 22 Feb 2017 23:44:44 +0000 (15:44 -0800)]
mm: fix filemap.c kernel-doc warnings
Fix kernel-doc warnings in mm/filemap.c:
mm/filemap.c:993: warning: No description found for parameter '__page'
mm/filemap.c:993: warning: Excess function parameter 'page' description in '__lock_page'
Nicholas Piggin [Wed, 22 Feb 2017 23:44:41 +0000 (15:44 -0800)]
mm: un-export wake_up_page functions
These are no longer used outside mm/filemap.c, so un-export them and
make them static where possible. These were exported specifically for
NFS use in commit a4796e37c12e ("MM: export page_wakeup functions").
Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:44:33 +0000 (15:44 -0800)]
mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint
Currently we have tracepoints for both active and inactive LRU lists
reclaim but we do not have any which would tell us why we we decided to
age the active list. Without that it is quite hard to diagnose
active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low
tracepoint to tell us this information.
Michal Hocko [Wed, 22 Feb 2017 23:44:30 +0000 (15:44 -0800)]
mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepoint
mm_vmscan_lru_shrink_inactive will currently report the number of
scanned and reclaimed pages. This doesn't give us an idea how the
reclaim went except for the overall effectiveness though. Export and
show other counters which will tell us why we couldn't reclaim some
pages.
- nr_dirty, nr_writeback, nr_congested and nr_immediate tells
us how many pages are blocked due to IO
- nr_activate tells us how many pages were moved to the active
list
- nr_ref_keep reports how many pages are kept on the LRU due
to references (mostly for the file pages which are about to
go for another round through the inactive list)
- nr_unmap_fail - how many pages failed to unmap
All these are rather low level so they might change in future but the
tracepoint is already implementation specific so no tools should be
depending on its stability.
Michal Hocko [Wed, 22 Feb 2017 23:44:27 +0000 (15:44 -0800)]
mm, vmscan: extract shrink_page_list reclaim counters into a struct
shrink_page_list returns quite some counters back to its caller.
Extract the existing 5 into struct reclaim_stat because this makes the
code easier to follow and also allows further counters to be returned.
While we are at it, make all of them unsigned rather than unsigned long
as we do not really need full 64b for them (we never scan more than
SWAP_CLUSTER_MAX pages at once). This should reduce some stack space.
This patch shouldn't introduce any functional change.
Michal Hocko [Wed, 22 Feb 2017 23:44:24 +0000 (15:44 -0800)]
mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint
mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
from is file or anonymous but we do not know which LRU this is.
It is useful to know whether the list is active or inactive, since we
are using the same function to isolate pages from both of them and it's
hard to distinguish otherwise.
Link: http://lkml.kernel.org/r/20170104101942.4860-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:44:21 +0000 (15:44 -0800)]
mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolate
mm_vmscan_lru_isolate shows the number of requested, scanned and taken
pages. This is mostly OK but on 32b systems the number of scanned pages
is quite misleading because it includes both the scanned and skipped
pages. Moreover the skipped part is scaled based on the number of taken
pages. Let's report the exact numbers without any additional logic and
add the number of skipped pages.
This should make the reported data much more easier to interpret.
Link: http://lkml.kernel.org/r/20170104101942.4860-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:44:18 +0000 (15:44 -0800)]
mm, vmscan: add active list aging tracepoint
Our reclaim process has several tracepoints to tell us more about how
things are progressing. We are, however, missing a tracepoint to track
active list aging. Introduce mm_vmscan_lru_shrink_active which reports
the number of
- nr_taken is number of isolated pages from the active list
- nr_referenced pages which tells us that we are hitting referenced
pages which are deactivated. If this is a large part of the
reported nr_deactivated pages then we might be hitting into
the active list too early because they might be still part of
the working set. This might help to debug performance issues.
- nr_active pages which tells us how many pages are kept on the
active list - mostly exec file backed pages. A high number can
indicate that we might be trashing on executables.
Michal Hocko [Wed, 22 Feb 2017 23:44:15 +0000 (15:44 -0800)]
mm, vmscan: remove unused mm_vmscan_memcg_isolate
Patch series "vm, vmscan: enahance vmscan tracepoints", v2.
While debugging [2] I've realized that there is some room for
improvements in the tracepoints set we offer currently. I had hard
times to make any conclusion from the existing ones. The resulting
problem turned out to be active list aging [3] and we are missing at
least two tracepoints to debug such a problem.
Some existing tracepoints could export more information to see _why_ the
reclaim progress cannot be made not only _how much_ we could reclaim.
The later could be seen quite reasonably from the vmstat counters
already. It can be argued that we are showing too many implementation
details in those tracepoints but I consider them way too lowlevel
already to be usable by any kernel independent userspace. I would be
_really_ surprised if anything but debugging tools have used them.
Andrea Arcangeli [Wed, 22 Feb 2017 23:44:12 +0000 (15:44 -0800)]
mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock
pmd_trans_unstable does an atomic read on the pmd so it doesn't require
the pmd_lock for the same check.
This also removes the special assumption that the mmap_sem is hold for
writing if prot_numa is not set. userfaultfd will hold the mmap_sem
only for reading in change_pte_range like prot_numa, but it will not set
prot_numa.
This is always a valid micro-optimization regardless of userfaultfd.
[kirill@shutemov.name: drop unneeded pmd_trans_unstable(pmd) check after __split_huge_pmd()] Link: http://lkml.kernel.org/r/20170208120421.GE5578@node.shutemov.name Link: http://lkml.kernel.org/r/20161216144821.5183-43-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:44:10 +0000 (15:44 -0800)]
userfaultfd: selftest: test UFFDIO_ZEROPAGE on all memory types
This will verify -EINVAL is returned with hugetlbfs/shmem and it'll do a
functional test of UFFDIO_ZEROPAGE on anonymous memory.
Link: http://lkml.kernel.org/r/20161216144821.5183-42-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:44:06 +0000 (15:44 -0800)]
userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events
Add test for userfaultfd events used in non-cooperative scenario when
the process that monitors the userfaultfd and handles user faults is not
the same process that causes the page faults.
Link: http://lkml.kernel.org/r/20161216144821.5183-41-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:44:04 +0000 (15:44 -0800)]
userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page
With future addition of event tests, copy_page will be called with
different userfault file descriptors
Link: http://lkml.kernel.org/r/20161216144821.5183-40-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
userfaultfd_open will be needed by the non cooperative selftest.
Link: http://lkml.kernel.org/r/20161216144821.5183-39-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userland developers asked to be notified immediately by the UFFDIO_API
ioctl if shmem missing mode is supported by userfaultfd in the running
kernel. This avoids the need to run UFFDIO_REGISTER on a shmem virtual
memory range to find out.
Link: http://lkml.kernel.org/r/20161216144821.5183-38-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:43:55 +0000 (15:43 -0800)]
userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY
If the atomic copy_user fails because of a real dangling userland
pointer, we won't go back into the shmem method, so when the method
returns it must not leave anything charged up, except the page itself.
Link: http://lkml.kernel.org/r/20161216144821.5183-37-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:43:52 +0000 (15:43 -0800)]
userfaultfd: shmem: avoid a lockup resulting from corrupted page->flags
Use the non atomic version of __SetPageUptodate while the page is still
private and not visible to lookup operations. Using the non atomic
version after the page is already visible to lookups is unsafe as there
would be concurrent lock_page operation modifying the page->flags while
it runs.
This solves a lockup in find_lock_entry with the userfaultfd_shmem
selftest.
Andrea Arcangeli [Wed, 22 Feb 2017 23:43:49 +0000 (15:43 -0800)]
userfaultfd: shmem: lock the page before adding it to pagecache
A VM_BUG_ON triggered on the shmem selftest.
Link: http://lkml.kernel.org/r/20161216144821.5183-36-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:46 +0000 (15:43 -0800)]
userfaultfd: shmem: add userfaultfd_shmem test
The test verifies that anonymous shared mapping can be used with userfault
using the existing testing method. The shared memory area is allocated
using mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...) and released using
madvise(MADV_REMOVE)
Link: http://lkml.kernel.org/r/20161216144821.5183-35-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:43:43 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
When userfaultfd hugetlbfs support was originally added, it followed the
pattern of anon mappings and did not support any vmas marked VM_SHARED.
As such, support was only added for private mappings.
Remove this limitation and support shared mappings. The primary
functional change required is adding pages to the page cache. More subtle
changes are required for huge page reservation handling in error paths. A
lengthy comment in the code describes the reservation handling.
[mike.kravetz@oracle.com: update] Link: http://lkml.kernel.org/r/c9c8cafe-baa7-05b4-34ea-1dfa5523a85f@oracle.com Link: http://lkml.kernel.org/r/1487195210-12839-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:40 +0000 (15:43 -0800)]
userfaultfd: shmem: allow registration of shared memory ranges
Expand the userfaultfd_register/unregister routines to allow shared
memory VMAs.
Currently, there is no UFFDIO_ZEROPAGE and write-protection support for
shared memory VMAs, which is reflected in ioctl methods supported by
uffdio_register.
Link: http://lkml.kernel.org/r/20161216144821.5183-34-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:37 +0000 (15:43 -0800)]
userfaultfd: shmem: add userfaultfd hook for shared memory faults
When processing a page fault in shared memory area for not present page,
check the VMA determine if faults are to be handled by userfaultfd. If
so, delegate the page fault to handle_userfault.
Link: http://lkml.kernel.org/r/20161216144821.5183-33-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:34 +0000 (15:43 -0800)]
userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory
The shmem_mcopy_atomic_pte implements low lever part of UFFDIO_COPY
operation for shared memory VMAs. It's based on mcopy_atomic_pte with
adjustments necessary for shared memory pages.
Link: http://lkml.kernel.org/r/20161216144821.5183-32-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:43:31 +0000 (15:43 -0800)]
userfaultfd: shmem: add tlbflush.h header for microblaze
It resolves this build error:
All errors (new ones prefixed by >>):
mm/shmem.c: In function 'shmem_mcopy_atomic_pte':
>> mm/shmem.c:2228:2: error: implicit declaration of function 'update_mmu_cache' [-Werror=implicit-function-declaration]
update_mmu_cache(dst_vma, dst_addr, dst_pte);
microblaze may have to be also updated to define it in asm/pgtable.h
like the other archs, then this header inclusion can be removed.
Link: http://lkml.kernel.org/r/20161216144821.5183-31-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:28 +0000 (15:43 -0800)]
userfaultfd: shmem: introduce vma_is_shmem
Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to
ensure compatibility of a VMA with userfault. Introduction of
vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be
used with userfaultfd. Current implementation presumes usage of
vma_is_shmem only by slow path routines in userfaultfd, therefore the
vma_is_shmem is not made inline to leave the few remaining free bits in
vm_flags.
Link: http://lkml.kernel.org/r/20161216144821.5183-30-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:25 +0000 (15:43 -0800)]
userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support
shmem_mcopy_atomic_pte is the low level routine that implements the
userfaultfd UFFDIO_COPY command. It is based on the existing
mcopy_atomic_pte routine with modifications for shared memory pages.
Link: http://lkml.kernel.org/r/20161216144821.5183-29-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:43:22 +0000 (15:43 -0800)]
userfaultfd: introduce vma_can_userfault
Check whether a VMA can be used with userfault in more compact way
Link: http://lkml.kernel.org/r/20161216144821.5183-28-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userland developers asked to be notified immediately by the UFFDIO_API
ioctl if hugetlbfs missing mode is supported by userfaultfd in the
running kernel. This avoids the need to run UFFDIO_REGISTER on a
hugetlbfs virtual memory range to find out.
Link: http://lkml.kernel.org/r/20161216144821.5183-27-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:43:16 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlb
If __mcopy_atomic_hugetlb exits with an error, put_page will be called
if a huge page was allocated and needs to be freed. If a reservation
was associated with the huge page, the PagePrivate flag will be set.
Clear PagePrivate before calling put_page/free_huge_page so that the
global reservation count is not incremented.
Link: http://lkml.kernel.org/r/20161216144821.5183-26-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:43:13 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
will work on hugetlbfs.
This is required for fully functional userfaultfd non-present support on
hugetlbfs.
Link: http://lkml.kernel.org/r/20161216144821.5183-25-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:43:10 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges
Add routine userfaultfd_huge_must_wait which has the same functionality
as the existing userfaultfd_must_wait routine. Only difference is that
new routine must handle page table structure for hugepmd vmas.
Link: http://lkml.kernel.org/r/20161216144821.5183-24-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:43:07 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: add userfaultfd_hugetlb test
Test userfaultfd hugetlb functionality by using the existing testing
method (in userfaultfd.c). Instead of an anonymous memeory, a hugetlbfs
file is mmap'ed private. In this way fallocate hole punch can be used
to release pages. This is because madvise(MADV_DONTNEED) is not
supported for huge pages.
Use the same file, but create wrappers for allocating ranges and
releasing pages. Compile userfaultfd.c with HUGETLB_TEST defined to
produce an executable to test userfaultfd hugetlb functionality.
Link: http://lkml.kernel.org/r/20161216144821.5183-23-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:43:04 +0000 (15:43 -0800)]
userfaultfd: hugetlbfs: allow registration of ranges containing huge pages
Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB
vmas. huge page alignment checking is performed after a VM_HUGETLB vma
is encountered.
Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not
return that as a valid ioctl method for huge page ranges.
Link: http://lkml.kernel.org/r/20161216144821.5183-22-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When processing a hugetlb fault for no page present, check the vma to
determine if faults are to be handled via userfaultfd. If so, drop the
hugetlb_fault_mutex and call handle_userfault().
Link: http://lkml.kernel.org/r/20161216144821.5183-21-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new routine copy_huge_page_from_user() uses kmap_atomic() to map
PAGE_SIZE pages. However, this prevents page faults in the subsequent
call to copy_from_user(). This is OK in the case where the routine is
copied with mmap_sema held. However, in another case we want to allow
page faults. So, add a new argument allow_pagefault to indicate if the
routine should allow page faults.
[dan.carpenter@oracle.com: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda
[akpm@linux-foundation.org: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/20161216144821.5183-20-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:42:55 +0000 (15:42 -0800)]
userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge
pages. It is based on the existing __mcopy_atomic routine for normal
pages. Unlike normal pages, there is no huge page support for the
UFFDIO_ZEROPAGE operation.
Link: http://lkml.kernel.org/r/20161216144821.5183-19-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:42:52 +0000 (15:42 -0800)]
userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support
hugetlb_mcopy_atomic_pte is the low level routine that implements the
userfaultfd UFFDIO_COPY command. It is based on the existing
mcopy_atomic_pte routine with modifications for huge pages.
Link: http://lkml.kernel.org/r/20161216144821.5183-18-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Wed, 22 Feb 2017 23:42:49 +0000 (15:42 -0800)]
userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support
userfaultfd UFFDIO_COPY allows user level code to copy data to a page at
fault time. The data is copied from user space to a newly allocated
huge page. The new routine copy_huge_page_from_user performs this copy.
Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:46 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: wake userfaults after UFFDIO_UNREGISTER
Userfaults may still happen after the userfaultfd monitor thread
received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run.
Wake any pending userfault within UFFDIO_UNREGISTER protected by the
mmap_sem for writing, so they will not be reported to userland leading
to UFFDIO_COPY returning -EINVAL (as the range was already unregistered)
and they will not hang permanently either.
Link: http://lkml.kernel.org/r/20161216144821.5183-16-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_DONTNEED must be notified to userland before the pages are zapped.
This allows userland to immediately stop adding pages to the userfaultfd
ranges before the pages are actually zapped or there could be
non-zeropage leftovers as result of concurrent UFFDIO_COPY run in
between zap_page_range and madvise_userfault_dontneed (both
MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so
they can run concurrently).
Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:40 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED request
If the page is punched out of the address space the uffd reader should
know this and zeromap the respective area in case of the #PF event.
Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Optimize the mremap_userfaultfd_complete() interface to pass only the
vm_userfaultfd_ctx pointer through the stack as a microoptimization.
Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:34 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add mremap() event
The event denotes that an area [start:end] moves to different location.
Length change isn't reported as "new" addresses, if they appear on the
uffd reader side they will not contain any data and the latter can just
zeromap them.
Waiting for the event ACK is also done outside of mmap sem, as for fork
event.
Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:42:30 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users
Since commit d2005e3f41d4 ("userfaultfd: don't pin the user memory in
userfaultfd_file_create()") userfaultfd uses mm_count rather than
mm_users to pin mm_struct.
Make dup_userfaultfd consistent with this behaviour
Link: http://lkml.kernel.org/r/20161216144821.5183-11-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:27 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: Add fork() event
When the mm with uffd-ed vmas fork()-s the respective vmas notify their
uffds with the event which contains a descriptor with new uffd. This
new descriptor can then be used to get events from the child and
populate its mm with data. Note, that there can be different uffd-s
controlling different vmas within one mm, so first we should collect all
those uffds (and ctx-s) in a list and then notify them all one by one
but only once per fork().
The context is created at fork() time but the descriptor, file struct
and anon inode object is created at event read time. So some trickery
is added to the userfaultfd_ctx_read() to handle the ctx queues' locking
vs file creation.
Another thing worth noticing is that the task that fork()-s waits for
the uffd event to get processed WITHOUT the mmap sem.
[aarcange@redhat.com: build warning fix] Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:24 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: report all available features to userland
This will allow userland to probe all features available in the kernel.
It will however only enable the requested features in the open userfaultfd
context.
Link: http://lkml.kernel.org/r/20161216144821.5183-8-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:21 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor
The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.
The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.
Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:18 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: Split the find_userfault() routine
I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue
Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:15 +0000 (15:42 -0800)]
userfaultfd: use vma_is_anonymous
Cleanup the vma->vm_ops usage.
Side note: it would be more robust if vma_is_anonymous() would also
check that vm_flags hasn't VM_PFNMAP set.
Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:12 +0000 (15:42 -0800)]
userfaultfd: convert BUG() to WARN_ON_ONCE()
Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it can't
make any runtime difference. This BUG_ON has never triggered and cannot
trigger.
Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:09 +0000 (15:42 -0800)]
userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP
Minor comment correction.
Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:06 +0000 (15:42 -0800)]
userfaultfd: document _IOR/_IOW
Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2
These userfaultfd features are finished and are ready for larger
exposure in -mm and upstream merging.
1) tmpfs non present userfault
2) hugetlbfs non present userfault
3) non cooperative userfault for fork/madvise/mremap
qemu development code is already exercising 2) and container postcopy
live migration needs 3).
1) is not currently used but there's a self test and we know some qemu
user for various reasons uses tmpfs as backing for KVM so it'll need it
too to use postcopy live migration with tmpfs memory.
All review feedback from the previous submit has been handled and the
fixes are included. There's no outstanding issue AFIK.
Upstream code just did a s/fe/vmf/ conversion in the page faults and
this has been converted as well incrementally.
In addition to the previous submits, this also wakes up stuck userfaults
during UFFDIO_UNREGISTER. The non cooperative testcase actually
reproduced this problem by getting stuck instead of quitting clean in
some rare case as it could call UFFDIO_UNREGISTER while some userfault
could be still in flight. The other option would have been to keep
leaving it up to userland to serialize itself and to patch the testcase
instead but the wakeup during unregister I think is preferable.
David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and
UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid
to probe if the hugetlbfs/shmem missing support is available by calling
UFFDIO_REGISTER. QEMU already checks HUGETLBFS_MAGIC with fstatfs so if
UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER
will succeed (or if it fails, it's for some other more concerning
reason). There's no reason to worry about adding too many feature
flags. There are 64 available and worst case we've to bump the API if
someday we're really going to run out of them.
The round-trip network latency of hugetlbfs userfaults during postcopy
live migration is still of the order of dozen milliseconds on 10GBit if
at 2MB hugepage granularity so it's working perfectly and it should
provide for higher bandwidth or lower CPU usage (which makes it
interesting to add an option in the future to support THP granularity
too for anonymous memory, UFFDIO_COPY would then have to create THP if
alignment/len allows for it). 1GB hugetlbfs granularity will require
big changes in hugetlbfs to work so it's deferred for later.
This patch (of 42):
This adds proper documentation (inline) to avoid the risk of further
misunderstandings about the semantics of _IOW/_IOR and it also reminds
whoever will bump the UFFDIO_API in the future, to change the two ioctl
to _IOW.
This was found while implementing strace support for those ioctl,
otherwise we could have never found it by just reviewing kernel code and
testing it.
_IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
it's only worth fixing if the UFFDIO_API is bumped someday.
Link: http://lkml.kernel.org/r/20161216144821.5183-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: "Dmitry V. Levin" <ldv@altlinux.org> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:42:03 +0000 (15:42 -0800)]
oom, trace: add compaction retry tracepoint
Higher order requests oom debugging is currently quite hard. We do have
some compaction points which can tell us how the compaction is operating
but there is no trace point to tell us about compaction retry logic.
This patch adds a one which will have the following format
we can see that the order 9 request is not retried even though we are in
the highest compaction priority mode becase the last compaction attempt
was withdrawn. This means that compaction_zonelist_suitable must have
returned false and there is no suitable zone to compact for this request
and so no need to retry further.
another example would be
<...>-3137 [001] .... 81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
in this case the order-9 compaction failed to find any suitable block.
We do not retry anymore because this is a costly request and those do
not go below COMPACT_PRIO_SYNC_LIGHT priority.
Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:42:00 +0000 (15:42 -0800)]
oom, trace: add oom detection tracepoints
should_reclaim_retry is the central decision point for declaring the
OOM. It might be really useful to expose data used for this decision
making when debugging an unexpected oom situations.
Say we have an OOM report:
[ 52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[ 52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G W 4.8.0-oomtrace3-00006-gb21338b386d2 #1024
The above shows that we can quickly deduce that the reclaim stopped
making any progress (see no_progress_loops increased in each round) and
while there were still some 51 reclaimable pages they couldn't be
dropped for some reason (vmscan trace points would tell us more about
that part). available will represent reclaimable + free_pages scaled
down per no_progress_loops factor. This is essentially an optimistic
estimate of how much memory we would have when reclaiming everything.
This can be compared to min_wmark to get a rought idea but the
wmark_check tells the result of the watermark check which is more
precise (includes lowmem reserves, considers the order etc.). As we can
see no zone is eligible in the end and that is why we have triggered the
oom in this situation.
Please note that higher order requests might fail on the wmark_check
even when there is much more memory available than min_wmark - e.g.
when the memory is fragmented. A follow up tracepoint will help to
debug those situations.
Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:41:57 +0000 (15:41 -0800)]
mm, trace: extract COMPACTION_STATUS and ZONE_TYPE to a common header
COMPACTION_STATUS resp. ZONE_TYPE are currently used to translate enum
compact_result resp. struct zone index into their symbolic names for an
easier post processing. The follow up patch would like to reuse this as
well. The code involves some preprocessor black magic which is better not
duplicated elsewhere so move it to a common mm tracing relate header.
Link: http://lkml.kernel.org/r/20161220130135.15719-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:51 +0000 (15:41 -0800)]
mm, page_alloc: avoid page_to_pfn() when merging buddies
On architectures that allow memory holes, page_is_buddy() has to perform
page_to_pfn() to check for the memory hole. After the previous patch,
we have the pfn already available in __free_one_page(), which is the
only caller of page_is_buddy(), so move the check there and avoid
page_to_pfn().
Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:48 +0000 (15:41 -0800)]
mm, page_alloc: don't convert pfn to idx when merging
In __free_one_page() we do the buddy merging arithmetics on "page/buddy
index", which is just the lower MAX_ORDER bits of pfn. The operations
we do that affect the higher bits are bitwise AND and subtraction (in
that order), where the final result will be the same with the higher
bits left unmasked, as long as these bits are equal for both buddies -
which must be true by the definition of a buddy.
We can therefore use pfn's directly instead of "index" and skip the
zeroing of >MAX_ORDER bits. This can help a bit by itself, although
compiler might be smart enough already. It also helps the next patch to
avoid page_to_pfn() for memory hole checks.
Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:41:45 +0000 (15:41 -0800)]
mm: throttle show_mem() from warn_alloc()
Tetsuo has been stressing OOM killer path with many parallel allocation
requests when he has noticed that it is not all that hard to swamp
kernel logs with warn_alloc messages caused by allocation stalls. Even
though the allocation stall message is triggered only once in 10s there
might be many different tasks hitting it roughly around the same time.
A big part of the output is show_mem() which can generate a lot of
output even on a small machines. There is no reason to show the state
of memory counter for each allocation stall, especially when multiple of
them are reported in a short time period. Chances are that not much has
changed since the last report. This patch simply rate limits show_mem
called from warn_alloc to only dump something once per second. This
should be enough to give us a clue why an allocation might be stalling
while burst of warnings will not swamp log with too much data.
While we are at it, extract all the show_mem related handling (filters)
into a separate function warn_alloc_show_mem. This will make the code
cleaner and as a bonus point we can distinguish which part of warn_alloc
got throttled due to rate limiting as ___ratelimit dumps the caller.
[akpm@linux-foundation.org: reduce scope of the ratelimit_states] Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 22 Feb 2017 23:41:42 +0000 (15:41 -0800)]
tmpfs: change shmem_mapping() to test shmem_aops
Callers of shmem_mapping() are interested in whether the mapping is swap
backed - except for uprobes, which is interested in whether it should
use shmem_read_mapping_page(). All these callers are better served by a
shmem_mapping() which checks for shmem_aops, than the current version
which goes through several indirections to find where the inode lives -
and has the surprising effect that a private mmap of /dev/zero satisfies
both vma_is_anonymous() and shmem_mapping(), when that device node is on
devtmpfs. I don't think anything in the tree suffers from that
surprise, but it caught me out, and is better fixed.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052148530.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:39 +0000 (15:41 -0800)]
slub: make sysfs directories for memcg sub-caches optional
SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
bunch of debug files. Usually, there aren't that many caches on a
system and this doesn't really matter; however, if memcg is in use, each
cache can have per-cgroup sub-caches. SLUB creates the same directories
for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
Unfortunately, because there can be a lot of cgroups, active or
draining, the product of the numbers of caches, cgroups and files in
each directory can reach a very high number - hundreds of thousands is
commonplace. Millions and beyond aren't difficult to reach either.
What's under /sys/kernel/slab is primarily for debugging and the
information and control on the a root cache already cover its
sub-caches. While having a separate directory for each sub-cache can be
helpful for development, it doesn't make much sense to pay this amount
of overhead by default.
This patch introduces a boot parameter slub_memcg_sysfs which determines
whether to create sysfs directories for per-memcg sub-caches. It also
adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
default value and defaults to 0.
[akpm@linux-foundation.org: kset_unregister(NULL) is legal] Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:36 +0000 (15:41 -0800)]
slab: use memcg_kmem_cache_wq for slab destruction operations
If there's contention on slab_mutex, queueing the per-cache destruction
work item on the system_wq can unnecessarily create and tie up a lot of
kworkers.
Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
global and use that workqueue for the destruction work items too. While
at it, convert the workqueue from an unbound workqueue to a per-cpu one
with concurrency limited to 1. It's generally preferable to use per-cpu
workqueues and concurrency limit of 1 is safe enough.
This is suggested by Joonsoo Kim.
Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:33 +0000 (15:41 -0800)]
slab: remove slub sysfs interface files early for empty memcg caches
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
Each cache has a number of sysfs interface files under /sys/kernel/slab.
On a system with a lot of memory and transient memcgs, the number of
interface files which have to be removed once memory reclaim kicks in
can reach millions.
Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:30 +0000 (15:41 -0800)]
slab: remove synchronous synchronize_sched() from memcg cache deactivation path
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't scale
when a huge number of caches are destroyed back-to-back. While there
used to be a simple batching mechanism, the batching was too restricted
to be helpful.
This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex. While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.
Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:27 +0000 (15:41 -0800)]
slab: introduce __kmemcg_cache_deactivate()
__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches. Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.
This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().
This is pure reorganization and doesn't introduce any observable
behavior changes.
v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:24 +0000 (15:41 -0800)]
slab: implement slab_root_caches list
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches. As there can be a huge number of memcg
caches, this can become very expensive.
This also can make /proc/slabinfo behave very badly. seq_file processes
reads in 4k chunks and seeks to the previous Nth position on slab_caches
list to resume after each chunk. With a lot of memcg cache churns on
the list, reading /proc/slabinfo can become very slow and its content
often ends up with duplicate and/or missing entries.
This patch adds a new list slab_root_caches which lists only the root
caches. When memcg is not enabled, it becomes just an alias of
slab_caches. memcg specific list operations are collected into
memcg_[un]link_cache().
Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:21 +0000 (15:41 -0800)]
slab: link memcg kmem_caches on their associated memory cgroup
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
While a memcg kmem_cache is listed on its root cache's ->children list,
there is no direct way to iterate all kmem_caches which are assocaited
with a memory cgroup. The only way to iterate them is walking all
caches while filtering out caches which don't match, which would be most
of them.
This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge. This combined with the
synchronous RCU operations can tie up a CPU and affect the whole machine
for many hours when memory reclaim triggers offlining and destruction of
the stale memcgs.
This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg. All memcg specific iterations, including
stat file access, are updated to use the new list instead.
Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:17 +0000 (15:41 -0800)]
slab: reorganize memcg_cache_params
We're going to change how memcg caches are iterated. In preparation,
clean up and reorganize memcg_cache_params.
* The shared ->list is replaced by ->children in root and
->children_node in children.
* ->is_root_cache is removed. Instead ->root_cache is moved out of
the child union and now used by both root and children. NULL
indicates root cache. Non-NULL a memcg one.
This patch doesn't cause any observable behavior changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:14 +0000 (15:41 -0800)]
slab: remove synchronous rcu_barrier() call in memcg cache release path
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
destruction because slab pages are freed through RCU and they need to be
able to dereference the associated kmem_cache. Currently, it's done
synchronously with rcu_barrier(). As rcu_barrier() is expensive
time-wise, slab implements a batching mechanism so that rcu_barrier()
can be done for multiple caches at the same time.
Unfortunately, the rcu_barrier() is in synchronous path which is called
while holding cgroup_mutex and the batching is too limited to be
actually helpful.
This patch updates the cache release path so that the batching is
asynchronous and global. All SLAB_DESTORY_BY_RCU caches are queued
globally and a work item consumes the list. The work item calls
rcu_barrier() only once for all caches that are currently queued.
* release_caches() is removed and shutdown_cache() now either directly
release the cache or schedules a RCU callback to do that. This
makes the cache inaccessible once shutdown_cache() is called and
makes it impossible for shutdown_memcg_caches() to do memcg-specific
cleanups afterwards. Move memcg-specific part into a helper,
unlink_memcg_cache(), and make shutdown_cache() call it directly.
Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:11 +0000 (15:41 -0800)]
slub: separate out sysfs_slab_release() from sysfs_slab_remove()
Separate out slub sysfs removal and release, and call the former earlier
from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
through RCU and this will later allow us to remove sysfs files way
earlier during memory cgroup offline instead of release.
Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:08 +0000 (15:41 -0800)]
Revert "slub: move synchronize_sched out of slab_mutex on shrink"
Patch series "slab: make memcg slab destruction scalable", v3.
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.
I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes. The current code is O(N^2) on the total
number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex. Combined, this leads to very expensive and
slow cache destruction operations which can easily keep running for half
a day.
This also messes up /proc/slabinfo along with other cache iterating
operations. seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list. With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.
This patchset addresses the scalability problem.
* Add root and per-memcg lists. Update each user to use the
appropriate list.
* Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
and asynchronous.
* For dying empty slub caches, remove the sysfs files after
deactivation so that we don't end up with millions of sysfs files
without any useful information on them.
This patchset contains the following nine patches.
0001 reverts an existing optimization to prepare for the following
changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
path batched and asynchronous. 0004-0006 separate out the lists.
0007-0008 replace synchronize_sched() in slub destruction path with
call_rcu_sched(). 0009 removes sysfs files early for empty dying
caches. 0010 makes destruction work items use a workqueue with limited
concurrency.
This patch (of 10):
Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
shrink").
With kmem cgroup support enabled, kmem_caches can be created and destroyed
frequently and a great number of near empty kmem_caches can accumulate if
there are a lot of transient cgroups and the system is not under memory
pressure. When memory reclaim starts under such conditions, it can lead
to consecutive deactivation and destruction of many kmem_caches, easily
hundreds of thousands on moderately large systems, exposing scalability
issues in the current slab management code. This is one of the patches to
address the issue.
Moving synchronize_sched() out of slab_mutex isn't enough as it's still
inside cgroup_mutex. The whole deactivation / release path will be
updated to avoid all synchronous RCU operations. Revert this insufficient
optimization in preparation to ease future changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:05 +0000 (15:41 -0800)]
mm, slab: rename kmalloc-node cache to kmalloc-<size>
SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
the kmem_cache_node management structure, and puts it into the generic
kmalloc cache array (e.g. for 128b objects). The name of this cache is
"kmalloc-node", which is confusing for readers of /proc/slabinfo as the
cache is used for generic allocations (and not just the kmem_cache_node
struct) and it appears as the kmalloc-128 cache is missing.
An easy solution is to use the kmalloc-<size> name when pre-creating the
cache, which we can get from the kmalloc_info array.
[vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter] Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub: do not merge cache if slub_debug contains a never-merge flag
In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
commandline but never checks if there are features from the
SLAB_NEVER_MERGE set.
As a result selected by slub_debug caches are always mergeable if they
have been created without a custom constructor set or without one of the
SLAB_* debug features on.
This moves the SLAB_NEVER_MERGE check below the flags update from
commandline to make sure it won't merge the slab cache if one of the debug
features is on.
Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d Signed-off-by: Grygorii Maistrenko <grygoriimkd@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prarit Bhargava [Wed, 22 Feb 2017 23:40:56 +0000 (15:40 -0800)]
kernel/watchdog.c: do not hardcode CPU 0 as the initial thread
When CONFIG_BOOTPARAM_HOTPLUG_CPU0 is enabled, the socket containing the
boot cpu can be replaced. During the hot add event, the message
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
is output implying that the NMI watchdog was disabled at some point. This
is not the case and the message has caused confusion for users of systems
that support the removal of the boot cpu socket.
The watchdog code is coded to assume that cpu 0 is always the first cpu to
initialize the watchdog, and the last to stop its watchdog thread. That
is not the case for initializing if cpu 0 has been removed and added. The
removal case has never been correct because the smpboot code will remove
the watchdog threads starting with the lowest cpu number.
This patch adds watchdog_cpus to track the number of cpus with active NMI
watchdog threads so that the first and last thread can be used to set and
clear the value of firstcpu_err. firstcpu_err is set when the first
watchdog thread is enabled, and cleared when the last watchdog thread is
disabled.
Cong Wang [Wed, 22 Feb 2017 23:40:53 +0000 (15:40 -0800)]
9p: fix a potential acl leak
posix_acl_update_mode() could possibly clear 'acl', if so we leak the
memory pointed by 'acl'. Save this pointer before calling
posix_acl_update_mode() and release the memory if 'acl' really gets
cleared.
Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: Mark Salyzyn <salyzyn@android.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kurz <groug@kaod.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Wed, 22 Feb 2017 23:40:50 +0000 (15:40 -0800)]
block: use for_each_thread() in sys_ioprio_set()/sys_ioprio_get()
IOPRIO_WHO_USER case in sys_ioprio_set()/sys_ioprio_get() are using
while_each_thread(), which is unsafe under RCU lock according to commit 0c740d0afc3bff0a ("introduce for_each_thread() to replace the buggy
while_each_thread()"). Use for_each_thread() (via
for_each_process_thread()) which is safe under RCU lock.
Eric Ren [Wed, 22 Feb 2017 23:40:44 +0000 (15:40 -0800)]
ocfs2: fix deadlock issue when taking inode lock at vfs entry points
Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
results in a deadlock, as the author "Tariq Saeed" realized shortly
after the patch was merged. The discussion happened here
The reason why taking cluster inode lock at vfs entry points opens up a
self deadlock window, is explained in the previous patch of this series.
So far, we have seen two different code paths that have this issue.
1. do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== take PR
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== take PR
2. fchmod|fchmodat
chmod_common
notify_change
ocfs2_setattr <=== take EX
posix_acl_chmod
get_acl
ocfs2_iop_get_acl <=== take PR
ocfs2_iop_set_acl <=== take EX
Fixes them by adding the tracking logic (in the previous patch) for these
funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
ocfs2_setattr().
Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Ren [Wed, 22 Feb 2017 23:40:41 +0000 (15:40 -0800)]
ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.
Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code. For instance:
Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== first time
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== recursive one
A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request. Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.
The idea to fix this issue is mostly taken from gfs2 code.
1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
of the processes' pid who has taken the cluster lock of this lock
resource;
2. introduce a new flag for ocfs2_inode_lock_full:
OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
us if we've got cluster lock.
3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
got the cluster lock in the upper code path.
The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.
The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.
You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.
[sfr@canb.auug.org.au remove some inlines] Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Wed, 22 Feb 2017 23:40:38 +0000 (15:40 -0800)]
score: remove asm/current.h
... it's already using the generic version anyways, so just drop the file
as do the other archs that do not implement their own version of the
current macro.