Andy Lutomirski [Thu, 28 Jul 2016 22:48:14 +0000 (15:48 -0700)]
mm: track NR_KERNEL_STACK in KiB instead of number of stacks
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.
Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures. Keep it simple and use KiB.
Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 28 Jul 2016 22:48:08 +0000 (15:48 -0700)]
mm: CONFIG_ZONE_DEVICE stop depending on CONFIG_EXPERT
When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
However, now that the ZONE_DMA conflict has been eliminated it no longer
makes sense to require CONFIG_EXPERT.
Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memblock: include <asm/sections.h> instead of <asm-generic/sections.h>
asm-generic headers are generic implementations for architecture
specific code and should not be included by common code. Thus use the
asm/ version of sections.h to get at the linker sections.
mm, THP: clean up return value of madvise_free_huge_pmd
The definition of return value of madvise_free_huge_pmd is not clear
before. According to the suggestion of Minchan Kim, change the type of
return value to bool and return true if we do MADV_FREE successfully on
entire pmd page, otherwise, return false. Comments are added too.
Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 28 Jul 2016 22:47:40 +0000 (15:47 -0700)]
mm: bail out in shrink_inactive_list()
With node-lru, if there are enough reclaimable pages in highmem but
nothing in lowmem, VM can try to shrink inactive list although the
requested zone is lowmem.
The problem is that if the inactive list is full of highmem pages then a
direct reclaimer searching for a lowmem page waste CPU scanning
uselessly. It just burns out CPU. Even, many direct reclaimers are
stalled by too_many_isolated if lots of parallel reclaimer are going on
although there are no reclaimable memory in inactive list.
I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
elapsed time.
hackbench 500 process 2
= Old =
1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
= Now =
1st: 31s 2nd: 132s 3rd: 162s 4th: 50s.
[akpm@linux-foundation.org: fixes per Mel] Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: account for skipped pages as a partial scan
Page reclaim determines whether a pgdat is unreclaimable by examining
how many pages have been scanned since a page was freed and comparing
that to the LRU sizes. Skipped pages are not reclaim candidates but
contribute to scanned. This can prematurely mark a pgdat as
unreclaimable and trigger an OOM kill.
This patch accounts for skipped pages as a partial scan so that an
unreclaimable pgdat will still be marked as such but by scaling the cost
of a skip, it'll avoid the pgdat being marked prematurely.
Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: consider whether to decivate based on eligible zones inactive ratio
Minchan Kim reported that with per-zone lru state it was possible to
identify that a normal zone with 8^M anonymous pages could trigger OOM
with non-atomic order-0 allocations as all pages in the zone were in the
active list.
IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due
to highmem anonymous stat so VM never deactivates normal zone's
anonymous pages.
This patch is a modified version of Minchan's original solution but
based upon it. The problem with Minchan's patch is that any low zone
with an imbalanced list could force a rotation.
In this patch, a zone-constrained global reclaim will rotate the list if
the inactive/active ratio of all eligible zones needs to be corrected.
It is possible that higher zone pages will be initially rotated
prematurely but this is the safer choice to maintain overall LRU age.
Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: remove reclaim and compaction retry approximations
If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone
and node double accounting by approximating retries" with the difference
that inactive/active stats are still available. This preserves the
history of why the approximation was retried and why it had to be
reverted to handle OOM kills on 32-bit systems.
Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the reintroduction of per-zone LRU stats, highmem_file_pages is
redundant so remove it.
[mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory] Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap). I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma
zone.
When we investigate OOM problem, reclaimable memory count is crucial
stat to find a problem. Without it, it's hard to parse the OOM message
so I believe we should keep it.
mm, vmscan: release/reacquire lru_lock on pgdat change
With node-lru, the locking is based on the pgdat. As Minchan pointed
out, there is an opportunity to reduce LRU lock release/acquire in
check_move_unevictable_pages by only changing lock on a pgdat change.
mm, vmscan: remove redundant check in shrink_zones()
As pointed out by Minchan Kim, shrink_zones() checks for populated zones
in a zonelist but a zonelist can never contain unpopulated zones. While
it's not related to the node-lru series, it can be cleaned up now.
Link: http://lkml.kernel.org/r/1468853426-12858-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: Update all zone LRU sizes before updating memcg
Minchan Kim reported setting the following warning on a 32-bit system
although it can affect 64-bit systems.
WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
Modules linked in:
CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x76/0xaf
__warn+0xea/0x110
? mem_cgroup_update_lru_size+0x103/0x110
warn_slowpath_fmt+0x3b/0x40
mem_cgroup_update_lru_size+0x103/0x110
isolate_lru_pages.isra.61+0x2e2/0x360
shrink_active_list+0xac/0x2a0
? __delay+0xe/0x10
shrink_node_memcg+0x53c/0x7a0
shrink_node+0xab/0x2a0
do_try_to_free_pages+0xc6/0x390
try_to_free_pages+0x245/0x590
LRU list contents and counts are updated separately. Counts are updated
before pages are added to the LRU and updated after pages are removed.
The warning above is from a check in mem_cgroup_update_lru_size that
ensures that list sizes of zero are empty.
The problem is that node-lru needs to account for highmem pages if
CONFIG_HIGHMEM is set. One impact of the implementation is that the
sizes are updated in multiple passes when pages from multiple zones were
isolated. This happens whether HIGHMEM is set or not. When multiple
zones are isolated, it's possible for a debugging check in memcg to be
tripped.
This patch forces all the zone counts to be updated before the memcg
function is called.
Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, pagevec: release/reacquire lru_lock on pgdat change
With node-lru, the locking is based on the pgdat. Previously it was
required that a pagevec drain released one zone lru_lock and acquired
another zone lru_lock on every zone change. Now, it's only necessary if
the node changes. The end-result is fewer lock release/acquires if the
pages are all on the same node but in different zones.
Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 28 Jul 2016 22:47:08 +0000 (15:47 -0700)]
mm, page_alloc: fix dirtyable highmem calculation
When I tested vmscale in mmtest in 32bit, I found the benchmark was slow
down 0.5 times.
base node
1 global-1
User 12.98 16.04
System 147.61 166.42
Elapsed 26.48 38.08
With vmstat, I found IO wait avg is much increased compared to base.
The reason was highmem_dirtyable_memory accumulates free pages and
highmem_file_pages from HIGHMEM to MOVABLE zones which was wrong. With
that, dirth_thresh in throtlle_vm_write is always 0 so that it calls
congestion_wait frequently if writeback starts.
With this patch, it is much recovered.
base node fi
1 global-1 fix
User 12.98 16.04 13.78
System 147.61 166.42 143.92
Elapsed 26.48 38.08 29.64
Link: http://lkml.kernel.org/r/1468404004-5085-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmstat: remove zone and node double accounting by approximating retries
The number of LRU pages, dirty pages and writeback pages must be
accounted for on both zones and nodes because of the reclaim retry
logic, compaction retry logic and highmem calculations all depending on
per-zone stats.
Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
exception is costly high-order allocations or allocations that cannot
fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
allocations then it would fall through to __alloc_pages_direct_compact.
This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
but without per-zone stats there are not many alternatives. The impact
it that zone-constrained allocations may delay before considering the
OOM killer.
As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.
In combination, that means that the per-node stats can be used when
deciding whether to continue reclaim using a rough approximation. While
it is possible this will make the wrong decision on occasion, it will
not infinite loop as the number of reclaim attempts is capped by
MAX_RECLAIM_RETRIES.
The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone
stats as it is sufficient.
In combination, this allows the per-zone LRU and dirty state counters to
be removed.
[mgorman@techsingularity.net: fix acct_highmem_file_pages()] Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested by: Michal Hocko <mhocko@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmstat: print node-based stats in zoneinfo file
There are a number of stats that were previously accessible via zoneinfo
that are now invisible. While it is possible to create a new file for
the node stats, this may be missed by users. Instead this patch prints
the stats under the first populated zone in /proc/zoneinfo.
Link: http://lkml.kernel.org/r/1467970510-21195-34-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vmstat: account per-zone stalls and pages skipped during reclaim
The vmstat allocstall was fairly useful in the general sense but
node-based LRUs change that. It's important to know if a stall was for
an address-limited allocation request as this will require skipping
pages from other zones. This patch adds pgstall_* counters to replace
allocstall. The sum of the counters will equal the old allocstall so it
can be trivially recalculated. A high number of address-limited
allocation requests may result in a lot of useless LRU scanning for
suitable pages.
As address-limited allocations require pages to be skipped, it's
important to know how much useless LRU scanning took place so this patch
adds pgskip* counters. This yields the following model
1. The number of address-space limited stalls can be accounted for (pgstall)
2. The amount of useless work required to reclaim the data is accounted (pgskip)
3. The total number of scans is available from pgscan_kswapd and pgscan_direct
so from that the ratio of useful to useless scans can be calculated.
[mgorman@techsingularity.net: s/pgstall/allocstall/] Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
This is partially a preparation patch for more vmstat work but it also
has the slight advantage that __count_zid_vm_events is cheaper to
calculate than __count_zone_vm_events().
Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: page_alloc: cache the last node whose dirty limit is reached
If a page is about to be dirtied then the page allocator attempts to
limit the total number of dirty pages that exists in any given zone.
The call to node_dirty_ok is expensive so this patch records if the last
pgdat examined hit the dirty limits. In some cases, this reduces the
number of calls to node_dirty_ok().
Link: http://lkml.kernel.org/r/1467970510-21195-31-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, page_alloc: remove fair zone allocation policy
The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone. Reclaim is now node-based so this should no longer
be an issue and the fair zone allocation policy is not free. This patch
removes it.
Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: add classzone information to tracepoints
This is convenient when tracking down why the skip count is high because
it'll show what classzone kswapd woke up at and what zones are being
isolated.
Link: http://lkml.kernel.org/r/1467970510-21195-29-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
The buffer_heads_over_limit limit in kswapd is inconsistent with direct
reclaim behaviour. It may force an an attempt to reclaim from all zones
and then not reclaim at all because higher zones were balanced than
required by the original request.
This patch will causes kswapd to consider reclaiming from all zones if
buffer_heads_over_limit. However, if there are eligible zones for the
allocation request that woke kswapd then no reclaim will occur even if
buffer_heads_over_limit. This avoids kswapd over-reclaiming just
because buffer_heads_over_limit.
[mgorman@techsingularity.net: fix comment about buffer_heads_over_limit] Link: http://lkml.kernel.org/r/1468404004-5085-2-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-28-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: avoid passing in `remaining' unnecessarily to prepare_kswapd_sleep()
As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep()
always passes in 0 for `remaining' and the second call can trivially
check the parameter in advance.
Suggested-by: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1467970510-21195-27-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
The scan_control structure has enough information available for
compaction_ready() to make a decision. The classzone_idx manipulations
in shrink_zones() are no longer necessary as the highest populated zone
is no longer used to determine if shrink_slab should be called or not.
[mgorman@techsingularity.net remove redundant check in shrink_zones()] Link: http://lkml.kernel.org/r/1468588165-12461-3-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-26-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim. It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far. The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.
Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, page_alloc: wake kswapd based on the highest eligible zone
The ac_classzone_idx is used as the basis for waking kswapd and that is
based on the preferred zoneref. If the preferred zoneref's first zone
is lower than what is available on other nodes, it's possible that
kswapd is woken on a zone with only higher, but still eligible, zones.
As classzone_idx is strictly adhered to now, it causes a problem because
eligible pages are skipped.
For example, node 0 has only DMA32 and node 1 has only NORMAL. An
allocating context running on node 0 may wake kswapd on node 1 telling
it to skip all NORMAL pages.
Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: only wakeup kswapd once per node for the requested classzone
kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account. Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.
Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes
have already been visited. For zone-ordering, each node should be
checked only once.
Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: move vmscan writes and file write accounting to the node
As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node. For consistency,
also account page writes and page dirtying on a per-node basis.
After this patch, there are a few remaining zone counters that may appear
strange but are fine. NUMA stats are still per-zone as this is a
user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.
Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.
[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information. Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.
Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, memcg: move memcg limit enforcement from zones to nodes
Memcg needs adjustment after moving LRUs to the node. Limits are
tracked per memcg but the soft-limit excess is tracked per zone. As
global page reclaim is based on the node, it is easy to imagine a
situation where a zone soft limit is exceeded even though the memcg
limit is fine.
This patch moves the soft limit tree the node. Technically, all the
variable names should also change but people are already familiar by the
meaning of "mz" even if "mn" would be a more appropriate name now.
Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information. This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not. Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.
[mgorman@techsingularity.net: optimization] Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vmscan: do not reclaim from kswapd if there is any eligible zone
kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on
lower zones. Now that we are reclaiming on a per-node basis, any
eligible zone can be used and pages will still be aged fairly. This
patch avoids reclaiming excessively unless buffer_heads are over the
limit and it's necessary to reclaim from a higher zone than requested by
the waker of kswapd to relieve low memory pressure.
[hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed] Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-13-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: remove duplicate logic clearing node congestion and dirty state
Reclaim may stall if there is too much dirty or congested data on a
node. This was previously based on zone flags and the logic for
clearing the flags is in two places. As congestion/dirty tracking is
now tracked on a per-node basis, we can remove some duplicate logic.
Link: http://lkml.kernel.org/r/1467970510-21195-12-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: by default have direct reclaim only shrink once per node
Direct reclaim iterates over all zones in the zonelist and shrinking
them but this is in conflict with node-based reclaim. In the default
case, only shrink once per node.
Link: http://lkml.kernel.org/r/1467970510-21195-11-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: simplify the logic deciding whether kswapd sleeps
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order. It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat(). What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order. This patch irons out the logic to check just that and the end
result is less headache inducing.
Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone. With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.
[vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO] Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric.
This patch gets rid of many of the node-based versus zone-based
decisions.
o A node is considered balanced when any eligible lower zone is balanced.
This eliminates one class of age-inversion problem because we avoid
reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
balanced.
o Some anomalies related to writeback and congestion tracking being based on
zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
have similar aging and reclaiming characteristics as every
other node.
Link: http://lkml.kernel.org/r/1467970510-21195-8-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: have kswapd only scan based on the highest requested zone
kswapd checks all eligible zones to see if they need balancing even if
it was woken for a lower zone. This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems. Ideally this is completely unnecessary when
reclaiming on a per-node basis. In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved
in higher zones but this should be the exceptional case.
Link: http://lkml.kernel.org/r/1467970510-21195-7-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmscan: begin reclaiming pages on a per-node basis
This patch makes reclaim decisions on a per-node basis. A reclaimer
knows what zone is required by the allocation request and skips pages
from higher zones. In many cases this will be ok because it's a
GFP_HIGHMEM request of some description. On 64-bit, ZONE_DMA32 requests
will cause some problems but 32-bit devices on 64-bit platforms are
increasingly rare. Historically it would have been a major problem on
32-bit with big Highmem:Lowmem ratios but such configurations are also
now rare and even where they exist, they are not encouraged. If it
really becomes a problem, it'll manifest as very low reclaim
efficiencies.
Link: http://lkml.kernel.org/r/1467970510-21195-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zone padding separates write-intensive fields used by page allocation,
compaction and vmstats but the comments are a little misleading and need
clarification.
Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.
Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.
Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.
In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions
1. Long-term isolation of highmem pages when reclaim is lowmem
When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.
That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.
2. Linear scan lowmem pages if the initial LRU shrink fails
This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.
Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Node-based reclaim requires node-based LRUs and locking. This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review. It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.
Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, vmstat: add infrastructure for per-node vmstats
Patchset: "Move LRU page reclaim from zones to nodes v9"
This series moves LRUs from the zones to the node. While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;
1. The residency of a page partially depends on what zone the page was
allocated from. This is partially combatted by the fair zone allocation
policy but that is a partial solution that introduces overhead in the
page allocator paths.
2. Currently, reclaim on node 0 behaves slightly different to node 1. For
example, direct reclaim scans in zonelist order and reclaims even if
the zone is over the high watermark regardless of the age of pages
in that LRU. Kswapd on the other hand starts reclaim on the highest
unbalanced zone. A difference in distribution of file/anon pages due
to when they were allocated results can result in a difference in
again. While the fair zone allocation policy mitigates some of the
problems here, the page reclaim results on a multi-zone node will
always be different to a single-zone node.
it was scheduled on as a result.
3. kswapd and the page allocator scan zones in the opposite order to
avoid interfering with each other but it's sensitive to timing. This
mitigates the page allocator using pages that were allocated very recently
in the ideal case but it's sensitive to timing. When kswapd is allocating
from lower zones then it's great but during the rebalancing of the highest
zone, the page allocator and kswapd interfere with each other. It's worse
if the highest zone is small and difficult to balance.
4. slab shrinkers are node-based which makes it harder to identify the exact
relationship between slab reclaim and LRU reclaim.
The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.
Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.
The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.
pagealloc
---------
This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)
This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
User 189.19 191.80
System 2604.45 2533.56
Elapsed 2855.30 2786.39
The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;
tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 approx-v9
User 645.72 525.90
System 403.85 331.75
Elapsed 6795.36 6783.67
This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Minor Faults 645838 647465
Major Faults 573 640
Swap Ins 0 0
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 4604145344190646
Normal allocs 7805307279887245
Movable allocs 0 0
Allocation stalls 24 67
Stall zone DMA 0 0
Stall zone DMA32 0 0
Stall zone Normal 0 2
Stall zone HighMem 0 0
Stall zone Movable 0 65
Direct pages scanned 10969 30609
Kswapd pages scanned 9337514493492094
Kswapd pages reclaimed 9337224393489370
Direct pages reclaimed 10969 30609
Kswapd efficiency 99% 99%
Kswapd velocity 13741.015 13781.934
Direct efficiency 100% 100%
Direct velocity 1.614 4.512
Percentage direct scans 0% 0%
kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).
pgbench read-only large configuration on ext4
---------------------------------------------
pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe
It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.
stutter
-------
stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.
While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.
This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.
1. Reclaim/compaction is going to be affected because the amount of reclaim is
no longer targetted at a specific zone. Compaction works on a per-zone basis
so there is no guarantee that reclaiming a few THP's worth page pages will
have a positive impact on compaction success rates.
2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
are called is now different. This may or may not be a problem but if it
is, it'll be because shrinkers are not called enough and some balancing
is required.
3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
distributed between zones and the fair zone allocation policy used to do
something very similar for anon. The distribution is now different but not
necessarily in any way that matters but it's still worth bearing in mind.
VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that. The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state. The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet. There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
patch with no functional change. There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.
Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:45:19 +0000 (15:45 -0700)]
cpuset, mm: fix TIF_MEMDIE check in cpuset_change_task_nodemask
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
it is checking the flag on the current task rather than the given one.
This doesn't make much sense and it is actually wrong. If the current
task which updates the nodemask of a cpuset got killed by the OOM killer
then a part of the cpuset cgroup processes would have incompatible
nodemask which is surprising to say the least.
The comment suggests the intention was to skip oom victim or an exiting
task so we should be checking the given task. But even then it would be
layering violation because it is the memory allocator to interpret the
TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
should simply follow the same mask.
Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:45:16 +0000 (15:45 -0700)]
freezer, oom: check TIF_MEMDIE on the correct task
freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
It is, however, checking the flag on the current task rather than the
given one. This is really confusing because freezing() can be called
also on !current tasks. It would end up working correctly for its main
purpose because __refrigerator will be always called on the current task
so the oom victim will never get frozen. But it could lead to
surprising results when a task which is freezing a cgroup got oom killed
because only part of the cgroup would get frozen. This is highly
unlikely but worth fixing as the resulting code would be more clear
anyway.
Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 28 Jul 2016 22:45:10 +0000 (15:45 -0700)]
mm: fix vm-scalability regression in cgroup-aware workingset code
Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
detection") added a page->mem_cgroup lookup to the cache eviction,
refault, and activation paths, as well as locking to the activation
path, and the vm-scalability tests showed a regression of -23%.
While the test in question is an artificial worst-case scenario that
doesn't occur in real workloads - reading two sparse files in parallel
at full CPU speed just to hammer the LRU paths - there is still some
optimizations that can be done in those paths.
Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
doesn't need to be stabilized when counting an activation; we merely
need to hold the RCU lock to prevent the memcg from being freed.
Michal Hocko [Thu, 28 Jul 2016 22:45:04 +0000 (15:45 -0700)]
mm, oom: tighten task_will_free_mem() locking
"mm, oom: fortify task_will_free_mem" has dropped task_lock around
task_will_free_mem in oom_kill_process bacause it assumed that a
potential race when the selected task exits will not be a problem as the
oom_reaper will call exit_oom_victim.
Tetsuo was objecting that nommu doesn't have oom_reaper so the race
would be still possible. The code would be racy and lockup prone
theoretically in other aspects without the oom reaper anyway so I didn't
considered this a big deal. But it seems that further changes I am
planning in this area will benefit from stable task->mm in this path as
well. So let's drop find_lock_task_mm from task_will_free_mem and call
it from under task_lock as we did previously. Just pull the task->mm !=
NULL check inside the function.
Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:45:01 +0000 (15:45 -0700)]
mm, oom: hide mm which is shared with kthread or global init
The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being
selected" the victim cannot be a vforked task of the global init so we
are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users
are quite rare as well.
In order to help forward progress for the OOM killer, make sure that
this really rare case will not get in the way - we do this by hiding the
mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.
After this patch we should guarantee forward progress for the OOM killer
even when the selected victim is sharing memory with a kernel thread or
global init as long as the victims mm is still alive.
Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:58 +0000 (15:44 -0700)]
mm, oom_reaper: do not attempt to reap a task more than twice
oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.
This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.
As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.
Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:55 +0000 (15:44 -0700)]
mm, oom: task_will_free_mem should skip oom_reaped tasks
The 0-day robot has encountered the following:
Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
oom_reaper is trying to reap the same task again and again.
This is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.
Analyzed by Tetsuo Handa.
Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:52 +0000 (15:44 -0700)]
mm, oom: fortify task_will_free_mem()
task_will_free_mem is rather weak. It doesn't really tell whether the
task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step into making
it more robust for multi-threaded applications so now we know that the
whole process is going down and probably drop the mm.
This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
use_mm().
Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock up
the oom killer.
Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:49 +0000 (15:44 -0700)]
mm, oom: kill all tasks sharing the mm
Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
We can still encounter oom disabled vforked task which has to be killed as
well if we want to have other tasks sharing the mm reapable because it can
access the memory before doing exec. Killing such a task should be
acceptable because it is highly unlikely it has done anything useful
because it cannot modify any memory before it calls exec. An alternative
would be to keep the task alive and skip the oom reaper and risk all the
weird corner cases where the OOM killer cannot make forward progress
because the oom victim hung somewhere on the way to exit.
[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
the setting is inherently racy and we cannot do much about it without
introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:46 +0000 (15:44 -0700)]
mm, oom: skip vforked tasks from being selected
vforked tasks are not really sitting on any memory. They are sharing the
mm with parent until they exec into a new code. Until then it is just
pinning the address space. OOM killer will kill the vforked task along
with its parent but we still can end up selecting vforked task when the
parent wouldn't be selected. E.g. init doing vfork to launch a task or
vforked being a child of oom unkillable task with an updated oom_score_adj
to be killable.
Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.
Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:43 +0000 (15:44 -0700)]
mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.
It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers. But this will not work because
some programs are doing
vfork()
set_oom_adj()
exec()
We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork. As a result all the processes will share the same
oom_score_adj. The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.
Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:40 +0000 (15:44 -0700)]
proc, oom_adj: extract oom_score_adj setting into a helper
Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.
This patch shouldn't introduce any functional changes.
Link: http://lkml.kernel.org/r/1466426628-15074-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:37 +0000 (15:44 -0700)]
proc, oom: drop bogus sighand lock
Oleg has pointed out that can simplify both oom_adj_{read,write} and
oom_score_adj_{read,write} even further and drop the sighand lock. The
main purpose of the lock was to protect p->signal from going away but this
will not happen since ea6d290ca34c ("signals: make task_struct->signal
immutable/refcountable").
The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.
Suggested-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/1466426628-15074-3-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 28 Jul 2016 22:44:34 +0000 (15:44 -0700)]
proc, oom: drop bogus task_lock and mm check
Series "Handle oom bypass more gracefully", V5
The following 10 patches should put some order to very rare cases of mm
shared between processes and make the paths which bypass the oom killer
oom reapable and therefore much more reliable finally. Even though mm
shared outside of thread group is rare (either vforked tasks for a short
period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases. E.g. it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.
Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.
Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.
Patch 3 is a clean up of oom_score_adj handling and a preparatory work
for later patches.
Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread groups.
This can be considered a user visible behavior change because one thread
group updating oom_score_adj will affect others which share the same mm
via clone(CLONE_VM). I argue that this should be acceptable because we
already have the same behavior for threads in the same thread group and
sharing the mm without signal struct is just a different model of
threading. This is probably the most controversial part of the series,
I would like to find some consensus here. There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT. I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.
Patch 5 makes sure that no vforked task is selected if it is sharing the
mm with oom unkillable task.
Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.
Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.
Patch 8 is new in this version and it addresses an issue pointed out by
0-day OOM report where an oom victim was reaped several times.
Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made. This will give an upper bound to how long an oom_reapable task
can block the oom killer from selecting another victim if the oom_reaper
is not able to reap the victim.
Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads and
global init). We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.
This patch (of 10):
Both oom_adj_write and oom_score_adj_write are using task_lock, check for
task->mm and fail if it is NULL. This is not needed because the
oom_score_adj is per signal struct so we do not need mm at all. The code
has been introduced by 3d5992d2ac7d ("oom: add per-mm oom disable count")
but we do not do per-mm oom disable since c9f01245b6a7 ("oom: remove
oom_disable_count").
The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread group
leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
succeed. This is unexpected at best.
Remove the lock along with the check to fix the unexpected behavior and
also because there is not real need for the lock in the first place.
Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add braces to avoid "ambiguous ‘else’" compiler warnings
Some of our "for_each_xyz()" macro constructs make gcc unhappy about
lack of braces around if-statements inside or outside the loop, because
the loop construct itself has a "if-then-else" statement inside of it.
The resulting warnings look something like this:
drivers/gpu/drm/i915/i915_debugfs.c: In function ‘i915_dump_lrc’:
drivers/gpu/drm/i915/i915_debugfs.c:2103:6: warning: suggest explicit braces to avoid ambiguous ‘else’ [-Wparentheses]
if (ctx != dev_priv->kernel_context)
^
even if the code itself is fine.
Since the warning is fairly easy to avoid by adding a braces around the
if-statement near the for_each_xyz() construct, do so, rather than
disabling the otherwise potentially useful warning.
(The if-then-else statements used in the "for_each_xyz()" constructs are
designed to be inherently safe even with no braces, but in this case
it's quite understandable that gcc isn't really able to tell that).
This finally leaves the standard "allmodconfig" build with just a
handful of remaining warnings, so new and valid warnings hopefully will
stand out.
Newer versions of gcc warn about the use of __builtin_return_address()
with a non-zero argument when "-Wall" is specified:
kernel/trace/trace_irqsoff.c: In function ‘stop_critical_timings’:
kernel/trace/trace_irqsoff.c:433:86: warning: calling ‘__builtin_return_address’ with a nonzero argument is unsafe [-Wframe-address]
stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
[ .. repeats a few times for other similar cases .. ]
It is true that a non-zero argument is somewhat dangerous, and we do not
actually have very many uses of that in the kernel - but the ftrace code
does use it, and as Stephen Rostedt says:
"We are well aware of the danger of using __builtin_return_address() of
> 0. In fact that's part of the reason for having the "thunk" code in
x86 (See arch/x86/entry/thunk_{64,32}.S). [..] it adds extra frames
when tracking irqs off sections, to prevent __builtin_return_address()
from accessing bad areas. In fact the thunk_32.S states: 'Trampoline to
trace irqs off. (otherwise CALLER_ADDR1 might crash)'."
For now, __builtin_return_address() with a non-zero argument is the best
we can do, and the warning is not helpful and can end up making people
miss other warnings for real problems.
So disable the frame-address warning on compilers that need it.
Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'hsi-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi
Pull HSI updates from Sebastian Reichel:
- proper runtime pm support for omap-ssi and ssi-protocol
- misc fixes
* tag 'hsi-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi: (24 commits)
HSI: omap_ssi: drop pm_runtime_irq_safe
HSI: omap_ssi_port: use rpm autosuspend API
HSI: omap_ssi: call msg->complete() from process context
HSI: omap_ssi_port: ensure clocks are kept enabled during transfer
HSI: omap_ssi_port: replace pm_runtime_put_sync with non-sync variant
HSI: omap_ssi_port: avoid calling runtime_pm_*_sync inside spinlock
HSI: omap_ssi_port: avoid pm_runtime_get_sync in ssi_start_dma and ssi_start_pio
HSI: omap_ssi_port: switch to threaded pio irq
HSI: omap_ssi_core: remove pm_runtime_get_sync call from tasklet
HSI: omap_ssi_core: use pm_runtime_put instead of pm_runtime_put_sync
HSI: omap_ssi_port: prepare start_tx/stop_tx for blocking pm_runtime calls
HSI: core: switch port event notifier from atomic to blocking
HSI: omap_ssi_port: replace wkin_cken with atomic bitmap operations
HSI: omap_ssi: convert cawake irq handler to thread
HSI: ssi_protocol: fix ssip_xmit invocation
HSI: ssi_protocol: replace spin_lock with spin_lock_bh
HSI: ssi_protocol: avoid ssi_waketest call with held spinlock
HSI: omap_ssi: do not reset module
HSI: omap_ssi_port: remove useless newline
hsi: Only descend into hsi directory when CONFIG_HSI is set
...
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random driver updates from Ted Ts'o:
"A number of improvements for the /dev/random driver; the most
important is the use of a ChaCha20-based CRNG for /dev/urandom, which
is faster, more efficient, and easier to make scalable for
silly/abusive userspace programs that want to read from /dev/urandom
in a tight loop on NUMA systems.
This set of patches also improves entropy gathering on VM's running on
Microsoft Azure, and will take advantage of a hw random number
generator (if present) to initialize the /dev/urandom pool"
(It turns out that the random tree hadn't been in linux-next this time
around, because it had been dropped earlier as being too quiet. Oh
well).
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: strengthen input validation for RNDADDTOENTCNT
random: add backtracking protection to the CRNG
random: make /dev/urandom scalable for silly userspace programs
random: replace non-blocking pool with a Chacha20-based CRNG
random: properly align get_random_int_hash
random: add interrupt callback to VMBus IRQ handler
random: print a warning for the first ten uninitialized random users
random: initialize the non-blocking pool via add_hwgenerator_randomness()
Merge tag 'media/v4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media documentation updates from Mauro Carvalho Chehab:
"This patch series does the conversion of all media documentation stuff
to Restrutured Text markup format and add them to the
Documentation/index.rst file.
The media documentation was grouped into 4 books:
- media uAPI
- media kAPI
- V4L driver-specific documentation
- DVB driver-specific documentation
It also contains several documentation improvements and one fixup
patch for a core issue with cropcap.
PS. After this patch series, the media DocBook is deprecated and
should be removed. I'll add such patch on a future pull request"
* tag 'media/v4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (322 commits)
[media] cx23885-cardlist.rst: add a new card
[media] doc-rst: add some needed escape codes
[media] doc-rst: kapi: use :c:func: instead of :cpp:func
doc-rst: kernel-doc: fix a change introduced by mistake
[media] v4l2-ioctl.h add debug info for struct v4l2_ioctl_ops
[media] dvb_ringbuffer.h: some documentation improvements
[media] v4l2-ctrls.h: fully document the header file
[media] doc-rst: Fix some typedef ugly warnings
[media] doc-rst: reorganize the kAPI v4l2 chapters
[media] rename v4l2-framework.rst to v4l2-intro.rst
[media] move V4L2 clocks to a separate .rst file
[media] v4l2-fh.rst: add cross references and markups
[media] v4l2-fh.rst: add fh contents from v4l2-framework.rst
[media] v4l2-fh.h: add documentation for it
[media] v4l2-event.rst: add cross-references and markups
[media] v4l2-event.h: document all functions
[media] v4l2-event.rst: add text from v4l2-framework.rst
[media] v4l2-framework.rst: remove videobuf quick chapter
[media] v4l2-dev: add cross-references and improve markup
[media] doc-rst: move v4l2-dev doc to a separate file
...
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of driver updates (fcoe, lpfc,
ufs, qla2xxx, hisi_sas). The most important other change is removing
the flag to allow non-blk_mq on a per host basis (it's unused); there
is still a global module parameter for all of SCSI just in case.
The rest are an assortment of minor fixes and typo updates"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (101 commits)
scsi:libsas: fix oops caused by assigning a freed task to ->lldd_task
fnic: pci_dma_mapping_error() doesn't return an error code
scsi: lpfc: avoid harmless comparison warning
fcoe: implement FIP VLAN responder
fcoe: Rename 'fip_frame' to 'fip_vn2vn_notify_frame'
lpfc: call lpfc_sli_validate_fcp_iocb() with the hbalock held
scsi: ufs: remove unnecessary goto label
hpsa: change hpsa_passthru_ioctl timeout
hpsa: correct skipping masked peripherals
qla2xxx: Update driver version to 8.07.00.38-k
qla2xxx: Fix BBCR offset
qla2xxx: Fix duplicate message id.
qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.
qla2xxx: Separate ISP type bits out from device type.
qla2xxx: Correction to function qla26xx_dport_diagnostics().
qla2xxx: Add support to handle Loop Init error Asynchronus event.
qla2xxx: Let DPORT be enabled purely by nvram.
qla2xxx: Add bsg interface to support statistics counter reset.
qla2xxx: Add bsg interface to support D_Port Diagnostics.
qla2xxx: Check for device state before unloading the driver.
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Updates for the input subsystem. This contains the following new
drivers promised in the last merge window:
- driver for touchscreen controller found in Surface 3
- driver for Pegasus Notetaker tablet
- driver for Atmel Captouch Buttons
- driver for Raydium I2C touchscreen controllers
- powerkey driver for HISI 65xx SoC
plus a few fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
Input: tty/vt/keyboard - use memdup_user()
Input: pegasus_notetaker - set device mode in reset_resume() if in use
Input: pegasus_notetaker - cancel workqueue's work in suspend()
Input: pegasus_notetaker - fix usb_autopm calls to be balanced
Input: pegasus_notetaker - handle usb control msg errors
Input: wacom_w8001 - handle errors from input_mt_init_slots()
Input: wacom_w8001 - resolution wasn't set for ABS_MT_POSITION_X/Y
Input: pixcir_ts - add support for axis inversion / swapping
Input: icn8318 - use of_touchscreen helpers for inverting / swapping axes
Input: edt-ft5x06 - add support for inverting / swapping axes
Input: of_touchscreen - add support for inverted / swapped axes
Input: synaptics-rmi4 - use the RMI_F11_REL_BYTES define in rmi_f11_rel_pos_report
Input: synaptics-rmi4 - remove unneeded variable
Input: synaptics-rmi4 - remove pointer to rmi_function in f12_data
Input: synaptics-rmi4 - support regulator supplies
Input: raydium_i2c_ts - check CRC of incoming packets
Input: xen-kbdfront - prefer xenbus_write() over xenbus_printf() where possible
Input: fix a double word "is is" in include/linux/input.h
Input: add powerkey driver for HISI 65xx SoC
Input: apanel - spelling mistake - "skiping" -> "skipping"
...
Merge branch 'i2c/for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
"Here is the I2C pull request for 4.8:
- the core and i801 driver gained support for SMBus Host Notify
- core support for more than one address in DT
- i2c_add_adapter() has now better error messages. We can remove all
error messages from drivers calling it as a next step.
- bigger updates to rk3x driver to support rk3399 SoC
- the at24 eeprom driver got refactored and can now read special
variants with unique serials or fixed MAC addresses.
The rest is regular driver updates and bugfixes"
* 'i2c/for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (66 commits)
i2c: i801: use IS_ENABLED() instead of checking for built-in or module
Documentation: i2c: slave: give proper example for pm usage
Documentation: i2c: slave: describe buffer problems a bit better
i2c: bcm2835: Don't complain on -EPROBE_DEFER from getting our clock
i2c: i2c-smbus: drop useless stubs
i2c: efm32: fix a failure path in efm32_i2c_probe()
Revert "i2c: core: Cleanup I2C ACPI namespace"
Revert "i2c: core: Add function for finding the bus speed from ACPI"
i2c: Update the description of I2C_SMBUS
i2c: i2c-smbus: fix i2c_handle_smbus_host_notify documentation
eeprom: at24: tweak the loop_until_timeout() macro
eeprom: at24: add support for at24mac series
eeprom: at24: support reading the serial number for 24csxx
eeprom: at24: platform_data: use BIT() macro
eeprom: at24: split at24_eeprom_write() into specialized functions
eeprom: at24: split at24_eeprom_read() into specialized functions
eeprom: at24: hide the read/write loop behind a macro
eeprom: at24: call read/write functions via function pointers
eeprom: at24: coding style fixes
eeprom: at24: move at24_read() below at24_eeprom_write()
...
Merge tag 'spi-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"Quite a lot of cleanup and maintainence work going on this release in
various drivers, and also a fix for a nasty locking issue in the core:
- A fix for locking issues when external drivers explicitly locked
the bus with spi_bus_lock() - we were using the same lock to both
control access to the physical bus in multi-threaded I/O operations
and exclude multiple callers.
Confusion between these two caused us to have scenarios where we
were dropping locks. These are fixed by splitting into two
separate locks like should have been done originally, making
everything much clearer and correct.
- Support for DMA in spi_flash_read().
- Support for instantiating spidev on ACPI systems, including some
test devices used in Windows validation.
- Use of the core DMA mapping functionality in the McSPI driver.
- Start of support for ThunderX SPI controllers, involving a very big
set of changes to the Cavium driver.
- Support for Braswell, Exynos 5433, Kaby Lake, Merrifield, RK3036,
RK3228, RK3368 controllers"
* tag 'spi-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (64 commits)
spi: Split bus and I/O locking
spi: octeon: Split driver into Octeon specific and common parts
spi: octeon: Move include file from arch/mips to drivers/spi
spi: octeon: Put register offsets into a struct
spi: octeon: Store system clock freqency in struct octeon_spi
spi: octeon: Convert driver to use readq()/writeq() functions
spi: pic32-sqi: fixup wait_for_completion_timeout return handling
spi: pic32: fixup wait_for_completion_timeout return handling
spi: rockchip: limit transfers to (64K - 1) bytes
spi: xilinx: Return IRQ_NONE if no interrupts were detected
spi: xilinx: Handle errors from platform_get_irq()
spi: s3c64xx: restore removed comments
spi: s3c64xx: add Exynos5433 compatible for ioclk handling
spi: s3c64xx: use error code from clk_prepare_enable()
spi: s3c64xx: rename goto labels to meaningful names
spi: s3c64xx: document the clocks and the clock-name property
spi: s3c64xx: add exynos5433 spi compatible
spi: s3c64xx: fix reference leak to master in s3c64xx_spi_remove()
spi: spi-sh: Remove deprecated create_singlethread_workqueue
spi: spi-topcliff-pch: Remove deprecated create_singlethread_workqueue
...
Merge tag 'leds_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
"New LED class driver:
- LED driver for TI LP3952 6-Channel Color LED
LED core improvements:
- Only descend into leds directory when CONFIG_NEW_LEDS is set
- Add no-op gpio_led_register_device when LED subsystem is disabled
- MAINTAINERS: Add file patterns for led device tree bindings
LED Trigger core improvements:
- return error if invalid trigger name is provided via sysfs
LED class drivers improvements
- is31fl32xx: define complete i2c_device_id table
- is31fl32xx: fix typo in id and match table names
- leds-gpio: Set of_node for created LED devices
- pca9532: Add device tree support
Conversion of IDE trigger to common disk trigger:
- leds: convert IDE trigger to common disk trigger
- leds: documentation: 'ide-disk' to 'disk-activity'
- unicore32: use the new LED disk activity trigger
- parisc: use the new LED disk activity trigger
- mips: use the new LED disk activity trigger
- arm: use the new LED disk activity trigger
- powerpc: use the new LED disk activity trigger"
* tag 'leds_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
leds: is31fl32xx: define complete i2c_device_id table
leds: is31fl32xx: fix typo in id and match table names
leds: LED driver for TI LP3952 6-Channel Color LED
leds: leds-gpio: Set of_node for created LED devices
leds: triggers: return error if invalid trigger name is provided via sysfs
leds: Only descend into leds directory when CONFIG_NEW_LEDS is set
leds: Add no-op gpio_led_register_device when LED subsystem is disabled
unicore32: use the new LED disk activity trigger
parisc: use the new LED disk activity trigger
mips: use the new LED disk activity trigger
arm: use the new LED disk activity trigger
powerpc: use the new LED disk activity trigger
leds: documentation: 'ide-disk' to 'disk-activity'
leds: convert IDE trigger to common disk trigger
leds: pca9532: Add device tree support
MAINTAINERS: Add file patterns for led device tree bindings
Merge tag 'for-linus-4.8' of git://git.code.sf.net/p/openipmi/linux-ipmi
Pull IPMI updates from Corey Minyard:
"Remove some old cruft that was disabled by default a long time ago.
No modern hardware should need this, and anybody who really doesn't
have something to automatically detect IPMI can add the device by hand
on the module commandline or hot add it"
* tag 'for-linus-4.8' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi: remove trydefaults parameter and default init
Merge tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"This last cycle, Thor was busy adding Arria10 eth FIFO support to the
altera_edac driver along with other improvements. We have two
cleanups/fixes too.
Summary:
- Altera Arria10 ethernet FIFO buffer support (Thor Thayer)
- Minor cleanups"
* tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
ARM: dts: Add Arria10 Ethernet EDAC devicetree entry
EDAC, altera: Add Arria10 Ethernet EDAC support
EDAC, altera: Add Arria10 ECC memory init functions
Documentation: dt: socfpga: Add Arria10 Ethernet binding
EDAC, altera: Drop some ifdeffery
EDAC, altera: Add panic flag check to A10 IRQ
EDAC, altera: Check parent status for Arria10 EDAC block
EDAC, altera: Make all private data structures static
EDAC: Correct channel count limit
EDAC, amd64_edac: Init opstate at the proper time during init
EDAC, altera: Handle Arria10 SDRAM child node
EDAC, altera: Add ECC Manager IRQ controller support
Documentation: dt: socfpga: Add interrupt-controller to ecc-manager
Several build configurations had already disabled this warning because
it generates a lot of false positives. But some had not, and it was
still enabled for "allmodconfig" builds, for example.
Looking at the warnings produced, every single one I looked at was a
false positive, and the warnings are frequent enough (and big enough)
that they can easily hide real problems that you don't notice in the
noise generated by -Wmaybe-uninitialized.
The warning is good in theory, but this is a classic case of a warning
that causes more problems than the warning can solve.
If gcc gets better at avoiding false positives, we may be able to
re-enable this warning. But as is, we're better off without it, and I
want to be able to see the *real* warnings.
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Features and fixes for 4.8-rc0:
- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places"
* tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
xen: add static initialization of steal_clock op to xen_time_ops
xen/pvhvm: run xen_vcpu_setup() for the boot CPU
xen/evtchn: use xen_vcpu_id mapping
xen/events: fifo: use xen_vcpu_id mapping
xen/events: use xen_vcpu_id mapping in events_base
x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
xen: introduce xen_vcpu_id mapping
x86/acpi: store ACPI ids from MADT for future usage
x86/xen: update cpuid.h from Xen-4.7
xen/evtchn: add IOCTL_EVTCHN_RESTRICT
xen-blkback: really don't leak mode property
xen-blkback: constify instance of "struct attribute_group"
xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
xen-blkback: prefer xenbus_scanf() over xenbus_gather()
xen: support runqueue steal time on xen
arm/xen: add support for vm_assist hypercall
xen: update xen headers
xen-pciback: drop superfluous variables
xen-pciback: short-circuit read path used for merging write values
...
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Kexec support for arm64
- Kprobes support
- Expose MIDR_EL1 and REVIDR_EL1 CPU identification registers to sysfs
- Trapping of user space cache maintenance operations and emulation in
the kernel (CPU errata workaround)
- Clean-up of the early page tables creation (kernel linear mapping,
EFI run-time maps) to avoid splitting larger blocks (e.g. pmds) into
smaller ones (e.g. ptes)
- VDSO support for CLOCK_MONOTONIC_RAW in clock_gettime()
- ARCH_HAS_KCOV enabled for arm64
- Optimise IP checksum helpers
- SWIOTLB optimisation to only allocate/initialise the buffer if the
available RAM is beyond the 32-bit mask
- Properly handle the "nosmp" command line argument
- Fix for the initialisation of the CPU debug state during early boot
- vdso-offsets.h build dependency workaround
- Build fix when RANDOMIZE_BASE is enabled with MODULES off
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
arm64: arm: Fix-up the removal of the arm64 regs_query_register_name() prototype
arm64: Only select ARM64_MODULE_PLTS if MODULES=y
arm64: mm: run pgtable_page_ctor() on non-swapper translation table pages
arm64: mm: make create_mapping_late() non-allocating
arm64: Honor nosmp kernel command line option
arm64: Fix incorrect per-cpu usage for boot CPU
arm64: kprobes: Add KASAN instrumentation around stack accesses
arm64: kprobes: Cleanup jprobe_return
arm64: kprobes: Fix overflow when saving stack
arm64: kprobes: WARN if attempting to step with PSTATE.D=1
arm64: debug: remove unused local_dbg_{enable, disable} macros
arm64: debug: remove redundant spsr manipulation
arm64: debug: unmask PSTATE.D earlier
arm64: localise Image objcopy flags
arm64: ptrace: remove extra define for CPSR's E bit
kprobes: Add arm64 case in kprobe example module
arm64: Add kernel return probes support (kretprobes)
arm64: Add trampoline code for kretprobes
arm64: kprobes instruction simulation support
arm64: Treat all entry code as non-kprobe-able
...
Pull tile architecture updates from Chris Metcalf:
"A few stray changes"
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
tile: support gcc 7 optimization to use __multi3
tile 32-bit big-endian: fix bugs in syscall argument order
tile: allow disabling CONFIG_EARLY_PRINTK
Merge tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes two trivial changes, one to use kmemdup and another
to control the log level of recovery messages"
* tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: Use kmemdup instead of kmalloc and memcpy
dlm: add log_info config option
Merge tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"The major change in this version is mitigating cpu overheads on write
paths by replacing redundant inode page updates with mark_inode_dirty
calls. And we tried to reduce lock contentions as well to improve
filesystem scalability. Other feature is setting F2FS automatically
when detecting host-managed SMR.
Enhancements:
- ioctl to move a range of data between files
- inject orphan inode errors
- avoid flush commands congestion
- support lazytime
Bug fixes:
- return proper results for some dentry operations
- fix deadlock in add_link failure
- disable extent_cache for fcollapse/finsert"
* tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
f2fs: clean up coding style and redundancy
f2fs: get victim segment again after new cp
f2fs: handle error case with f2fs_bug_on
f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
f2fs: support an ioctl to move a range of data blocks
f2fs: fix to report error number of f2fs_find_entry
f2fs: avoid memory allocation failure due to a long length
f2fs: reset default idle interval value
f2fs: use blk_plug in all the possible paths
f2fs: fix to avoid data update racing between GC and DIO
f2fs: add maximum prefree segments
f2fs: disable extent_cache for fcollapse/finsert inodes
f2fs: refactor __exchange_data_block for speed up
f2fs: fix ERR_PTR returned by bio
f2fs: avoid mark_inode_dirty
f2fs: move i_size_write in f2fs_write_end
f2fs: fix to avoid redundant discard during fstrim
f2fs: avoid mismatching block range for discard
f2fs: fix incorrect f_bfree calculation in ->statfs
f2fs: use percpu_rw_semaphore
...
Merge tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"The major addition is the new iomap based block mapping
infrastructure. We've been kicking this about locally for years, but
there are other filesystems want to use it too (e.g. gfs2). Now it
is fully working, reviewed and ready for merge and be used by other
filesystems.
There are a lot of other fixes and cleanups in the tree, but those are
XFS internal things and none are of the scale or visibility of the
iomap changes. See below for details.
I am likely to send another pull request next week - we're just about
ready to merge some new functionality (on disk block->owner reverse
mapping infrastructure), but that's a huge chunk of code (74 files
changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
separate to all the "normal" pull request changes so they don't get
lost in the noise.
Summary of changes in this update:
- generic iomap based IO path infrastructure
- generic iomap based fiemap implementation
- xfs iomap based Io path implementation
- buffer error handling fixes
- tracking of in flight buffer IO for unmount serialisation
- direct IO and DAX io path separation and simplification
- shortform directory format definition changes for wider platform
compatibility
- various buffer cache fixes
- cleanups in preparation for rmap merge
- error injection cleanups and fixes
- log item format buffer memory allocation restructuring to prevent
rare OOM reclaim deadlocks
- sparse inode chunks are now fully supported"
* tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
xfs: remove EXPERIMENTAL tag from sparse inode feature
xfs: bufferhead chains are invalid after end_page_writeback
xfs: allocate log vector buffers outside CIL context lock
libxfs: directory node splitting does not have an extra block
xfs: remove dax code from object file when disabled
xfs: skip dirty pages in ->releasepage()
xfs: remove __arch_pack
xfs: kill xfs_dir2_inou_t
xfs: kill xfs_dir2_sf_off_t
xfs: split direct I/O and DAX path
xfs: direct calls in the direct I/O path
xfs: stop using generic_file_read_iter for direct I/O
xfs: split xfs_file_read_iter into buffered and direct I/O helpers
xfs: remove s_maxbytes enforcement in xfs_file_read_iter
xfs: kill ioflags
xfs: don't pass ioflags around in the ioctl path
xfs: track and serialize in-flight async buffers against unmount
xfs: exclude never-released buffers from buftarg I/O accounting
xfs: don't reset b_retries to 0 on every failure
xfs: remove extraneous buffer flag changes
...
Tony Camuso [Wed, 22 Jun 2016 18:22:28 +0000 (14:22 -0400)]
ipmi: remove trydefaults parameter and default init
Parameter trydefaults=1 causes the ipmi_init to initialize ipmi through
the legacy port io space that was designated for ipmi. Architectures
that do not map legacy port io can panic when trydefaults=1.
Rather than implement build-time conditional exceptions for each
architecture that does not map legacy port io, we have removed legacy
port io from the driver.
Parameter 'trydefaults' has been removed. Attempts to use it hereafter
will evoke the "Unknown symbol in module, or unknown parameter" message.
The patch was built against a number of architectures and tested for
regressions and functionality on x86_64 and ARM64.
Signed-off-by: Tony Camuso <tcamuso@redhat.com>
Removed the config entry and the address source entry for default,
since neither were used any more.
* topic/docs-next: (322 commits)
[media] cx23885-cardlist.rst: add a new card
[media] doc-rst: add some needed escape codes
[media] doc-rst: kapi: use :c:func: instead of :cpp:func
doc-rst: kernel-doc: fix a change introduced by mistake
[media] v4l2-ioctl.h add debug info for struct v4l2_ioctl_ops
[media] dvb_ringbuffer.h: some documentation improvements
[media] v4l2-ctrls.h: fully document the header file
[media] doc-rst: Fix some typedef ugly warnings
[media] doc-rst: reorganize the kAPI v4l2 chapters
[media] rename v4l2-framework.rst to v4l2-intro.rst
[media] move V4L2 clocks to a separate .rst file
[media] v4l2-fh.rst: add cross references and markups
[media] v4l2-fh.rst: add fh contents from v4l2-framework.rst
[media] v4l2-fh.h: add documentation for it
[media] v4l2-event.rst: add cross-references and markups
[media] v4l2-event.h: document all functions
[media] v4l2-event.rst: add text from v4l2-framework.rst
[media] v4l2-framework.rst: remove videobuf quick chapter
[media] v4l2-dev: add cross-references and improve markup
[media] doc-rst: move v4l2-dev doc to a separate file
...
arm64: arm: Fix-up the removal of the arm64 regs_query_register_name() prototype
Commit 0a8ea52c3eb1 ("arm64: Add HAVE_REGS_AND_STACK_ACCESS_API
feature") inadvertently removed the arch/arm prototype instead of the
arm64 one introduced by the original patch. There should not be any
bisection issues since this function is not called from anywhere else
(it could as well be removed from arch/arm at some point).
David S. Miller [Wed, 27 Jul 2016 06:19:29 +0000 (23:19 -0700)]
xgene: Fix build warning with ACPI disabled.
drivers/net/ethernet/apm/xgene/xgene_enet_hw.c: In function 'xgene_enet_phy_connect':
drivers/net/ethernet/apm/xgene/xgene_enet_hw.c:759:22: warning: unused variable 'adev' [-Wunused-variable]
Fixes: 8089a96f601b ("drivers: net: xgene: Add backward compatibility") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
be2net: perform temperature query in adapter regardless of its interface state
The be2net driver performs fw temperature queries on be_worker() routine,
which is executed each second for each be_adapter. There is a frequency
threshold to avoid fw query to happens at each call to be_worker();
instead, currently a fw query occurs once in 64 runs of the procedure.
Nevertheless, this fw temperature query is invoked only for adapters which
interface is up, so we can see I/O errors on read of hwmon counters from
userspace (from tools like lm-sensors) in case we have adapters' functions
which interface is down.
This patch moves the fw query code to be invoked even if interface is down.
No functional changes were introduced.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Acked-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>