git.karo-electronics.de Git - karo-tx-linux.git/log

mm: vmscan: flatten kswapd priority loop

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced.  It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset.  This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark
even at 100% efficiency.  This patch irons out the logic a bit by
controlling when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency.  It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: obey proportional scanning requirements for kswapd

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit the
number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim. It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
[kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
[hannes@cmpxchg.org: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: limit the number of pages kswapd reclaims at each priority

This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.

Changelog since V3
o Drop the slab shrink changes in light of Glaubers series and
  discussions highlighted that there were a number of potential
  problems with the patch. (mel)
o Rebased to 3.10-rc1

Changelog since V2
o Preserve ratio properly for proportional scanning (kamezawa)

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
o Reformat comment in shrink_page_list (andi)
o Clarify some comments (dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time.  Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts.  In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100% CPU
usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's compounded
by the problem that it can be very workload and machine specific.

This series aims at addressing some of the worst of these problems without
attempting to fundmentally alter how page reclaim works.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.

Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.

This was tested using memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in
memory.

                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)

Note how the vanilla kernels performance collapses when there is enough IO
taking place in the background.  This drop in performance is part of what
users complain of when they start backups.  Note how the swapin and major
fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.

20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six blocks
of information reported here

memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background

io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload

swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.

swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 1, the worse the
trashing is.  Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.

minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults

majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.

Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.

                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0

Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd

     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539

A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.

This patch:

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/THP: deposit the transpare huge pgtable before set_pmd

Architectures like powerpc use the deposited pgtable to store hash index
values. We need to make the deposted pgtable is visible to other cpus
before we are ready to take a hash fault.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

thp: define HPAGE_PMD_* constants as BUILD_BUG() if !THP

Currently, HPAGE_PMD_* constans rely on PMD_SHIFT regardless of
CONFIG_TRANSPARENT_HUGEPAGE.  PMD_SHIFT is not defined everywhere (e.g.
arm nommu case).

It means we can't use anything like this in generic code:

        if (PageTransHuge(page))
                zero_huge_user(page, 0, HPAGE_PMD_SIZE);
        else
                clear_highpage(page);

For !THP case, PageTransHuge() is 0 and compiler can eliminate
zero_huge_user() call.  But it still need to be valid C expression, means
HPAGE_PMD_SIZE has to expand to something compiler can understand.

Previously, HPAGE_PMD_* were defined to BUILD_BUG() for !THP. Let's come
back to it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/THP: don't use HPAGE_SHIFT in transparent hugepage code

For architectures like powerpc that support multiple explicit hugepage
sizes, HPAGE_SHIFT indicate the default explicit hugepage shift.  For THP
to work the hugepage size should be same as PMD_SIZE.  So use PMD_SHIFT
directly.  So move the define outside CONFIG_TRANSPARENT_HUGEPAGE #ifdef
because we want to use these defines in generic code with if
(pmd_trans_huge()) conditional.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/THP: withdraw the pgtable after pmdp related operations

For architectures like ppc64 we look at deposited pgtable when calling
pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw after finishing
pmdp related operations.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/THP: add pmd args to pgtable deposit and withdraw APIs

This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/thp: use the correct function when updating access flags

We should use pmdp_set_access_flags to update access flags.  Archs like
powerpc use extra checks(_PAGE_BUSY) when updating a hugepage PTE.  A
set_pmd_at doesn't do those checks.  We should use set_pmd_at only when
updating a none hugepage PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: don't re-init pageset in zone_pcp_update()

When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.

Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0. Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: relocate comment to be directly above code it refers to.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu pagesets
will not be drained when zone_pcp_update() is called (they will end up
being shrunk, not completely drained, later when a 0-order page is freed
in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE

pcp->batch could change at any point, avoid relying on it being a stable
value.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.

This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.

Suggested by Gilad Ben-Yossef in these threads:

https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49

Also reproduces his proposed comment.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

Because we are going to rely upon a careful transision between old and new
->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: factor out setting of pcp->high and pcp->batch

"Problems" with the current code:

1: there is a lack of synchronization in setting ->high and ->batch in
   percpu_pagelist_fraction_sysctl_handler()

2: stop_machine() in zone_pcp_update() is unnecissary.

3: zone_pcp_update() does not consider the case where
   percpu_pagelist_fraction is non-zero

To fix:

1: add memory barriers, a safe ->batch value, an update side mutex when
   updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
   that expect a stable value.

2: avoid draining pages in zone_pcp_update(), rely upon the memory
   barriers added to fix #1

3: factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(),
which is used by memory_hotplug.  I'm rather certain that I've diserned
(and preserved) the essential behavior (changing ->high and ->batch), and
only eliminated unneeded actions (draining the per cpu pages), but this
may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to
offline a page, and after the offline has "succeeded".  It appears that
the draining was added to zone_pcp_update() to avoid refactoring
setup_pageset() into 2 funtions.

This patch:

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high based
on ->batch.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

uio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: swapin_nr_pages() can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin
is sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages?  But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection.  Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches.  Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below.  Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.

I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.

Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-remove-compressed-copy-from-zram-in-memory-fix-2-fix

invert unlikely() test, augment comment, 80-col cleanup

Cc: Artem Savkov <artem.savkov@gmail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

non-swapcache pages in end_swap_bio_read()

There is no guarantee that page in end_swap_bio_read is in swapcache so we
need to check it before calling page_swap_info().  Otherwise kernel hits a
bug on like the one below.  Introduced in "mm: remove compressed copy from
zram in-memory"

kernel BUG at mm/swapfile.c:2361!
invalid opcode: 0000 [#1] SMP
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-rc4-next-20130607+ #61
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task: ffff88001e5ccfc0 ti: ffff88001e5ea000 task.ti: ffff88001e5ea000
RIP: 0010:[<ffffffff811462eb>]  [<ffffffff811462eb>] page_swap_info+0xab/0xb0
RSP: 0000:ffff88001ec03c78  EFLAGS: 00010246
RAX: 0100000000000009 RBX: ffffea0000794780 RCX: 0000000000000c0b
RDX: 0000000000000046 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88001ec03c88 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000001 R14: ffff88001e7f6200 R15: 0000000000001000
FS:  0000000000000000(0000) GS:ffff88001ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000000240b000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffea0000794780 ffff88001e7f6200 ffff88001ec03cb8 ffffffff81145486
ffff88001e5cd5c0 ffff88001c02cd20 0000000000001000 0000000000000000
ffff88001ec03cc8 ffffffff81199518 ffff88001ec03d28 ffffffff81518ec3
Call Trace:
<IRQ>
[<ffffffff81145486>] end_swap_bio_read+0x96/0x130
[<ffffffff81199518>] bio_endio+0x18/0x40
[<ffffffff81518ec3>] blk_update_request+0x213/0x540
[<ffffffff81518fa0>] ? blk_update_request+0x2f0/0x540
[<ffffffff817986a6>] ? ata_hsm_qc_complete+0x46/0x130
[<ffffffff81519212>] blk_update_bidi_request+0x22/0x90
[<ffffffff8151b9ea>] blk_end_bidi_request+0x2a/0x80
[<ffffffff8151ba8b>] blk_end_request+0xb/0x10
[<ffffffff817693aa>] scsi_io_completion+0xaa/0x6b0
[<ffffffff817608d8>] scsi_finish_command+0xc8/0x130
[<ffffffff81769aff>] scsi_softirq_done+0x13f/0x160
[<ffffffff81521ebc>] blk_done_softirq+0x7c/0x90
[<ffffffff81049030>] __do_softirq+0x130/0x3f0
[<ffffffff810d454e>] ? handle_irq_event+0x4e/0x70
[<ffffffff81049405>] irq_exit+0xa5/0xb0
[<ffffffff81003cb1>] do_IRQ+0x61/0xe0
[<ffffffff81c2832f>] common_interrupt+0x6f/0x6f
<EOI>
[<ffffffff8107ebff>] ? local_clock+0x4f/0x60
[<ffffffff81c27f85>] ? _raw_spin_unlock_irq+0x35/0x50
[<ffffffff81c27f7b>] ? _raw_spin_unlock_irq+0x2b/0x50
[<ffffffff81078bd0>] finish_task_switch+0x80/0x110
[<ffffffff81078b93>] ? finish_task_switch+0x43/0x110
[<ffffffff81c2525c>] __schedule+0x32c/0x8c0
[<ffffffff81c2c010>] ? notifier_call_chain+0x150/0x150
[<ffffffff81c259d4>] schedule+0x24/0x70
[<ffffffff81c25d42>] schedule_preempt_disabled+0x22/0x30
[<ffffffff81093645>] cpu_startup_entry+0x335/0x380
[<ffffffff81c1ed7e>] start_secondary+0x217/0x219
Code: 69 bc 16 82 48 c7 c7 77 bc 16 82 31 c0 49 c1 ec 39 49 c1 e9 10 41 83 e1 01 e8 6c d2 ad 00 5b 4a 8b 04 e5 e0 bf 14 83 41 5c c9 c3 <0f> 0b eb fe 90 48 8b 07 55 48 89 e5 a9 00 00 01 00 74 12 e8 3d
RIP  [<ffffffff811462eb>] page_swap_info+0xab/0xb0
RSP <ffff88001ec03c78>

Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-remove-compressed-copy-from-zram-in-memory-fix

tweak comment text

Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove compressed copy from zram in-memory

Swap subsystem does lazy swap slot free with expecting the page would be
swapped out again so we can avoid unnecessary write.

But the problem in in-memory swap(ex, zram) is that it consumes memory
space until vm_swap_full(ie, used half of all of swap device) condition
meet.  It could be bad if we use multiple swap device, small in-memory
swap and big storage swap or in-memory swap alone.

This patch makes swap subsystem free swap slot as soon as swap-read is
completed and make the swapcache page dirty so the page should be written
out the swap device to reclaim it.  It means we never lose it.

I tested this patch with kernel compile workload.

1. before

compile time : 9882.42
zram max wasted space by fragmentation: 13471881 byte
memory space consumed by zram: 174227456 byte
the number of slot free notify: 206684

2. after

compile time : 9653.90
zram max wasted space by fragmentation: 11805932 byte
memory space consumed by zram: 154001408 byte
the number of slot free notify: 426972

* changelog from v3
  * Rebased on next-20130508

* changelog from v1
  * Add more comment

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove free_area_cache

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, memcg: don't take task_lock in task_in_mem_cgroup

For processes that have detached their mm's, task_in_mem_cgroup()
unnecessarily takes task_lock() when rcu_read_lock() is all that is
necessary to call mem_cgroup_from_task().

While we're here, switch task_in_mem_cgroup() to return bool.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pagemap: prepare to reuse constant bits with page-shift

In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits 55-60
are about to change in a couple of releases.  Next, if a user issues
soft-dirty clear command via the clear_refs file (it was disabled before
v3.9) we assume that he's aware of the new pagemap format, note that fact
and report the bits in pagemap in the new manner.

The "migration strategy" looks like this then:

1. existing users are not affected -- they don't touch soft-dirty feature, thus
   see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
   this trick with clear-soft-dirty affecting pagemap format.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

soft-dirty: call mmu notifiers when write-protecting ptes

As noticed by Xiao, since soft-dirty clear command modifies page tables we
have to flush tlbs and call mmu notifiers.  While the former is done by
the clear_refs engine itself, the latter is to be done.

One thing to note about this -- in order not to call per-page invalidate
notifier (_all_ address space is about to be changed), the
_invalidate_range_start and _end are used.  But for those start and end
are not known exactly.  To address this, the same trick as in exit_mmap()
is used -- start is 0 and end is (unsigned long)-1.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: soft-dirty bits for user memory changes tracking

The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a page
at some virtual address the #PF occurs and the kernel sets the soft-dirty
bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus
all the kernel does is finds this fact out and puts back writable, dirty
and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked with
soft-dirty as well, since from the user perspective mremap modifies the
virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pagemap-introduce-pagemap_entry_t-without-pmshift-bits-v4

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pagemap: introduce pagemap_entry_t without pmshift bits

These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry. Moreover, in next patch we will need to report one more bit in
the pagemap, but all bits are already busy on it.

That said, describe the pagemap entry that has 6 more free zero bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

clear_refs: introduce private struct for mm_walk

In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry this
info.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

clear_refs: sanitize accepted commands declaration

This is the implementation of the soft-dirty bit concept that should help
keep track of changes in user memory, which in turn is very-very required
by the checkpoint-restore project (http://criu.org).

To create a dump of an application(s) we save all the information about it
to files, and the biggest part of such dump is the contents of tasks' memory.
However, there are usage scenarios where it's not required to get _all_ the
task memory while creating a dump. For example, when doing periodical dumps,
it's only required to take full memory dump only at the first step and then
take incremental changes of memory. Another example is live migration. We
copy all the memory to the destination node without stopping all tasks, then
stop them, check for what pages has changed, dump it and the rest of the state,
then copy it to the destination node. This decreases freeze time significantly.

That said, some help from kernel to watch how processes modify the contents
of their memory is required.

The proposal is to track changes with the help of new soft-dirty bit this way:

1. First do "echo 4 > /proc/$pid/clear_refs".
   At that point kernel clears the soft dirty _and_ the writable bits from all
   ptes of process $pid. From now on every write to any page will result in #pf
   and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
   the soft dirty flag.

2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
   (the 55'th one). If set, the respective pte was written to since last call
   to clear refs.

The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck,
the latter one marks kernel pages with it, while the former bit is put on user
pages so they do not conflict to each other.

This patch:

A new clear-refs type will be added in the next patch, so prepare
code for that.

[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

crypto: sanitize argument for format string

The template lookup interface does not provide a way to use format
strings, so make sure that the interface cannot be abused accidentally.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

watchdog: trigger all-cpu backtrace when locked up and going to panic

Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.

This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

llist: llist_add() can use llist_add_batch()

llist_add(new, head) can simply use llist_add_batch(new, new, head),
no need to duplicate the code.

This obviously uninlines llist_add() and to me this is a win. But we
can make llist_add_batch() inline if this is desirable, in this case
gcc can notice that new_first == new_last if the caller is llist_add().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

llist: fix/simplify llist_add() and llist_add_batch()

1. This is mostly theoretical, but llist_add*() need ACCESS_ONCE().

   Otherwise it is not guaranteed that the first cmpxchg() uses the
   same value for old_entry and new_last->next.

2. These helpers cache the result of cmpxchg() and read the initial
   value of head->first before the main loop. I do not think this
   makes sense. In the likely case cmpxchg() succeeds, otherwise
   it doesn't hurt to reload head->first.

   I think it would be better to simplify the code and simply read
   ->first before cmpxchg().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fput: turn "list_head delayed_fput_list" into llist_head

fput() and delayed_fput() can use llist and avoid the locking.

This is unlikely path, it is not that this change can improve
the performance, but this way the code looks simpler.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: do not pass disk names as format strings

Disk names may contain arbitrary strings, so they must not be interpreted
as format strings. It seems that only md allows arbitrary strings to be
used for disk names, but this could allow for a local memory corruption
from uid 0 into ring 0.

CVE-2013-2851

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/cdrom/cdrom.c: use kzalloc() for failing hardware

In drivers/cdrom/cdrom.c mmc_ioctl_cdrom_read_data() allocates a memory
area with kmalloc in line 2885.

2885         cgc->buffer = kmalloc(blocksize, GFP_KERNEL);
2886         if (cgc->buffer == NULL)
2887                 return -ENOMEM;

In line 2908 we can find the copy_to_user function:

2908         if (!ret && copy_to_user(arg, cgc->buffer, blocksize))

The cgc->buffer is never cleaned and initialized before this function.  If
ret = 0 with the previous basic block, it's possible to display some
memory bytes in kernel space from userspace.

When we read a block from the disk it normally fills the ->buffer but if
the drive is malfunctioning there is a chance that it would only be
partially filled.  The result is an leak information to userspace.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block/compat_ioctl.c: do not leak info to user-space

There is a hole in struct hd_geometry, so we have to zero the struct on
stack before copying it to user-space.

Signed-off-by: Cong Wang <amwang@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/cdrom/gdrom.c: fix device number leak

Without this patch, gdrom_major will leak when gd.cd_info alloc fails.

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/net/irda/donauboe.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/mvumi.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/initio.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/dmx3191d.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/dc395x.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/a100u2w.c: convert to module_pci_driver

Use module_pci_driver instead of init/exit, make code clean.

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

lglock: update lockdep annotations to report recursive local locks

Oleg Nesterov recently noticed that the lockdep annotations in lglock.c
are not sufficient to detect some obvious deadlocks, such as
lg_local_lock(LOCK) + lg_local_lock(LOCK) or spin_lock(X) +
lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).

Both issues are easily fixed by indicating to lockdep that lglock's local
locks are not recursive. We shouldn't use the rwlock acquire/release
functions here, as lglock doesn't share the same semantics. Instead we
can base our lockdep annotations on the lock_acquire_shared (for local
lglock) and lock_acquire_exclusive (for global lglock) helpers.

I am not proposing new lglock specific helpers as I don't see the point of
the existing second level of helpers :)

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

lockdep: introduce lock_acquire_exclusive/shared helper macros

In lockdep.h, the spinlock/mutex/rwsem/rwlock/lock_map acquire macros have
different definitions based on the value of CONFIG_PROVE_LOCKING. We have
separate ifdefs for each of these definitions, which seems redundant.

Introduce lock_acquire_{exclusive,shared,shared_recursive} helpers which
will have different definitions based on CONFIG_PROVE_LOCKING. Then all
other helper macros can be defined based on the above ones, which reduces
the amount of ifdefined code.

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

include/linux/sched.h: don't use task->pid/tgid in same_thread_group/has_group_leader_pid

task_struct->pid/tgid should go away.

1. Change same_thread_group() to use task->signal for comparison.

2. Change has_group_leader_pid(task) to compare task_pid(task) with
signal->leader_pid.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

softirq: use _RET_IP_

Use the already defined macro to pass the function return address.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/cluster/tcp.c: free sc->sc_page in sc_kref_release()

There is a memory leak in sc_kref_release(). When free struct
o2net_sock_container (sc), we should release sc->sc_page.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/journal.h: add bits_wanted while calculating credits in ocfs2_calc_extend_credits

While adding extends to a file, the credits are calculated incorrectly and
if the requested clusters is more than one (or more because we used a
conservative limit) then we run out of journal credits and we hit an
assert in journalling code.

The function parameter bits_wanted variable was not used at all.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix mutex_unlock and possible memory leak in ocfs2_remove_btree_range

In ocfs2_remove_btree_range, when calling ocfs2_lock_refcount_tree and
ocfs2_prepare_refcount_change_for_del failed, it goes to out and then
tries to call mutex_unlock without mutex_lock before. And when calling
ocfs2_reserve_blocks_for_rec_trunc failed, it should free ref_tree before
return.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: remove unecessary variable needs_checkpoint

Code cleanup: needs_checkpoint is assigned to but never used. Delete the
variable.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add missing dlm_put() in dlm_begin_reco_handler()

dlm_begin_reco_handler() returns without putting dlm when dlm recovery
state is DLM_RECO_STATE_FINALIZE.

Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: should not use le32_add_cpu to set ocfs2_dinode i_flags

If we use le32_add_cpu to set ocfs2_dinode i_flags, it may lead to the
corresponding flag corrupted. So we should change it to bitwise and/or
operation.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: shencanquan <shencanquan@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/dlm/dlmrecovery.c:dlm_request_all_locks(): ret should be int instead of enum

In dlm_request_all_locks, ret is type enum.  But o2net_send_message
returns a type int value.  Then it will never run into the following error
branch.  So we should change the ret type from enum to int.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/dlm/dlmrecovery.c: remove duplicate declarations

Below 3 functions have already been declared in dlmcommon.h, so we have
no need to declare them again in dlmrecovery.c.
dlm_complete_recovery_thread
dlm_launch_recovery_thread
dlm_kick_recovery_thread

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

isdn: clean up debug format string usage

Avoid unneeded local string buffers for constructing debug output. Also
cleans up debug calls that contain a single parameter so that they cannot
be accidentally parsed as format strings.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/atm/he.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mISDN: add support for group membership check

This patch adds a module parameter to allow a group access to the mISDN
devices. Otherwise, unpriviledged users on systems with ISDN hardware
have the ability to dial out, potentially causing expensive bills.

Based on a different implementation by Patrick Koppen.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Patrick Koppen <isdn4linux@koppen.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/net/ethernet/ibm/ehea/ehea_main.c: add alias entry for portN properties

Use separate table for alias entries in the ehea module, otherwise the
probe() function will operate on the separate ports instead of the
lhea-"root" entry of the device-tree

Addresses https://bugzilla.novell.com/show_bug.cgi?id=435215

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Olaf Hering <ohering@suse.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

configfs: use capped length for ->store_attribute()

The difference between "count" and "len" is that "len" is capped at 4095.
Changing it like this makes it match how sysfs_write_file() is
implemented.

This is a static analysis patch. I haven't found any store_attribute()
functions where this change makes a difference.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

virtio_balloon: leak_balloon(): only tell host if we got pages deflated

balloon_page_dequeue() can return NULL. If it does for the first page
being freed then leak_balloon() will create a scatter list with len=0.
Which in turn seems to generate an invalid virtio request.

I didn't get this in practice, I found it by code review. On the other
hand, such an invalid virtio request will cause errors in QEMU and
fill_balloon() also performs the same check implemented by this commit.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/mtd/chips/gen_probe.c: refactor call to request_module()

This reduces the size of the stack frame when calling request_module().
Performing the sprintf before the call is not needed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/ide/delkin_cb.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

scripts/setlocalversion on write-protected source tree

I just stumbled across another[0] issue when scripts/setlocalversion
operates on a write-protected source tree.  Back then[0] the source tree
was on an read-only NFS share, so "test -w" was introduced before "git
update-index" was run.

This time, the source tree is on read/write NFS share, but the permissions
are world-readable and only a specific user (or root) can write.  Thus,
"test -w ." returns "0" and then runs "git update-index", producing the
following message (on a dirty tree):

  fatal: Unable to create '/usr/local/src/linux-git/.git/index.lock': Permission denied

While it says "fatal", compilation continues just fine.

However, I don't think a kernel compilation should alter the source tree
(or the .git directory) in any way and I don't see how removing "git
update-index" could do any harm.  The Mercurial and SVN routines in
scripts/setlocalversion don't have any tree-modifying commands, AFAICS.
So, maybe the patch below would be acceptable.

[0] https://patchwork.kernel.org/patch/29718/

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Cc: Nico Schottelius <nico-linuxsetlocalversion@schottelius.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/infiniband/core/cm.c: convert to using idr_alloc_cyclic()

commit 3e6628c4b347 ("idr: introduce idr_alloc_cyclic()") adds a new
idr_alloc_cyclic routine and converts several of these users to it. This
is just a missed one - add it.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hrtimer: one more expiry time overflow check in hrtimer_interrupt

When executing a date command to set the system date and time to a few
seconds before the 2038 problem expiration time, we got a WARN_ON_ONCE()
like this:

  root@renesas:~# date -s "2038-1-19 3:14:00"
  Tue Jan 19 03:14:00 GMT 2038
  (then wait for 7-8 seconds)
  root@renesas:~# [   27.662658] ------------[ cut here ]------------
  [   27.667297] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x3c/0x138()
  [   27.675720] Modules linked in:
  [   27.678802] [<c00130ec>] (unwind_backtrace+0x0/0xe0) from [<c001f4d8>] (warn_slowpath_common+0x4c/0x64)
  [   27.688201] [<c001f4d8>] (warn_slowpath_common+0x4c/0x64) from [<c001f508>] (warn_slowpath_null+0x18/0x1c)
  [   27.697845] [<c001f508>] (warn_slowpath_null+0x18/0x1c) from [<c00549bc>] (clockevents_program_event+0x3c/0x138)
  [   27.708007] [<c00549bc>] (clockevents_program_event+0x3c/0x138) from [<c005510c>] (tick_program_event+0x2c/0x34)
  [   27.718170] [<c005510c>] (tick_program_event+0x2c/0x34) from [<c003fa98>] (hrtimer_interrupt+0x268/0x2a8)
  [   27.727752] [<c003fa98>] (hrtimer_interrupt+0x268/0x2a8) from [<c00180c8>] (cmt_timer_interrupt+0x2c/0x34)
  [   27.737396] [<c00180c8>] (cmt_timer_interrupt+0x2c/0x34) from [<c0066748>] (handle_irq_event_percpu+0xb0/0x2a8)
  [   27.747467] [<c0066748>] (handle_irq_event_percpu+0xb0/0x2a8) from [<c0066998>] (handle_irq_event+0x58/0x74)
  [   27.757293] [<c0066998>] (handle_irq_event+0x58/0x74) from [<c0068f24>] (handle_fasteoi_irq+0xc0/0x148)
  [   27.766662] [<c0068f24>] (handle_fasteoi_irq+0xc0/0x148) from [<c0066014>] (generic_handle_irq+0x20/0x30)
  [   27.776245] [<c0066014>] (generic_handle_irq+0x20/0x30) from [<c000ef54>] (handle_IRQ+0x60/0x84)
  [   27.785003] [<c000ef54>] (handle_IRQ+0x60/0x84) from [<c0009334>] (gic_handle_irq+0x34/0x4c)
  [   27.793426] [<c0009334>] (gic_handle_irq+0x34/0x4c) from [<c000e2c0>] (__irq_svc+0x40/0x70)
  [   27.801788] Exception stack(0xc04aff68 to 0xc04affb0)
  [   27.806823] ff60:                   00000000 f0100000 00000001 00000000 c04ae000 c04ec388
  [   27.815002] ff80: c04b604c c0840d80 40004059 412fc093 00000000 00000000 c04ce140 c04affb0
  [   27.823150] ffa0: c000f064 c000f068 60000013 ffffffff
  [   27.828216] [<c000e2c0>] (__irq_svc+0x40/0x70) from [<c000f068>] (default_idle+0x24/0x2c)
  [   27.836395] [<c000f068>] (default_idle+0x24/0x2c) from [<c000f338>] (cpu_idle+0x74/0xc8)
  [   27.844451] [<c000f338>] (cpu_idle+0x74/0xc8) from [<c048c6d4>] (start_kernel+0x248/0x288)
  [   27.852722] ---[ end trace 9d8ad385bde80fd3 ]---
  [   27.857330] hrtimer: interrupt took 0 ns

This is triggered with our v3.4-based custom ARM kernel, but we confirmed
that v3.10-rc can still have the same problem.

I found a similar issue fixed in v3.9 by Prarit Bhargava in commit
8f294b5a13 ("hrtimer: Add expiry time overflow check in
hrtimer_interrupt", 2013-04-08).  It tried to resolve a overflow issue
detected around 1970 + 100 seconds.

On the other hand, we have another call site of tick_program_event() at
the bottom of hrtimer_interrupt().  The warning this time is triggered
there, so we need to apply the same fix to it.

Signed-off-by: Shinya Kuribayashi <shinya.kuribayashi.px@renesas.com>
Reported-by: Hiroyuki Yokoyama <hiroyuki.yokoyama.vx@renesas.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/timer.c: fix jiffies wrap behavior of round_jiffies*()

Make sure that the round_jiffies*() functions return a time that is
in the future when the jiffies counter has recently wrapped.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix_timers: fix racy timer delta caching on task exit

When a task exits, we perform a caching of the remaining cputime delta
before expiring of its timers.

This is done from the following places:

* When the task is reaped. We iterate through its list of
  posix cpu timers and store the remaining timer delta to
  the timer struct instead of the absolute value.
  (See posix_cpu_timers_exit() / posix_cpu_timers_exit_group() )

* When we call posix_cpu_timer_get() or posix_cpu_timer_schedule().
  If the timer's task is considered dying when watched from these
  places, the same conversion from absolute to relative expiry time
  is performed. Then the given task's reference is released.
  (See clear_dead_task() ).

The relevance of this caching is questionable but this is another
and deeper debate.

The big issue here is that these two sources of caching don't mix
up very well together.

More specifically, the caching can easily be done twice, resulting
in a wrong delta as it gets spuriously substracted a second time by
the elapsed clock. This can happen in the following scenario:

1) The task exits and gets reaped: we call posix_cpu_timers_exit()
   and the absolute timer expiry values are converted to a relative
   delta.

2) timer_gettime() -> posix_cpu_timer_get() is called and relies on
   clear_dead_task() because  tsk->exit_state == EXIT_DEAD.
   The delta gets substracted again by the elapsed clock and we return
   a wrong result.

To fix this, just remove the caching done on task reaping time.  It
doesn't bring much value on its own.  The caching done from
posix_cpu_timer_get/schedule is enough.

And it would also be hard to get it really right: we could make it put and
clear the target task in the timer struct so that readers know if they are
dealing with a relative cached of absolute value.  But it would be racy.
The only safe way to do it would be to lock the itimer->it_lock so that we
know nobody reads the cputime expiry value while we modify it and its
target task reference.  Doing so would involve some funny workarounds to
avoid circular lock against the sighand lock.  There is just no reason to
maintain this.

The user visible effect of this patch can be observed by running the
following code: it creates a subthread that launches a posix cputimer
which expires after 10 seconds. But then the subthread only busy loops for 2
seconds and exits. The parent reaps the subthread and read the timer value.
Its expected value should the be the initial timer's expiration value
minus the cputime elapsed in the subthread. Roughly 10 - 2 = 8 seconds:

#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>

static timer_t id;
static struct itimerspec val = { .it_value.tv_sec = 10, }, new;

static void *thread(void *unused)
{
int err;
struct timeval start, end, diff;

timer_create(CLOCK_THREAD_CPUTIME_ID, NULL, &id);
if (err < 0) {
perror("Can't create timer\n");
return NULL;
}

/* Arm 10 sec timer */
err = timer_settime(id, 0, &val, NULL);
if (err < 0) {
perror("Can't set timer\n");
return NULL;
}

/* Exit after 2 seconds of execution */
gettimeofday(&start, NULL);
        do {
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
} while (diff.tv_sec < 2);

return NULL;
}

int main(int argc, char **argv)
{
pthread_t pthread;
int err;

err = pthread_create(&pthread, NULL, thread, NULL);
if (err) {
perror("Can't create thread\n");
return -1;
}
pthread_join(pthread, NULL);
/* Just wait a little bit to make sure the child got reaped */
sleep(1);
err = timer_gettime(id, &new);
if (err)
perror("Can't get timer value\n");
printf("%d %ld\n", new.it_value.tv_sec, new.it_value.tv_nsec);

return 0;
}

Before the patch:

       $ ./posix_cpu_timers
       6 2278074

After the patch:

      $ ./posix_cpu_timers
      8 1158766

Before the patch, the elapsed time got two more seconds spuriously accounted.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()

In order to re-arm a timer after it fired, we take a sample of the current
process or thread cputime.

If the task is dying though, we don't arm anything but we cache the
remaining timer expiration delta for further reads.

Something similar is performed in posix_cpu_timer_get() but here we forget
to take the process wide cputime sample before caching it.

As a result we are storing random stack content, leading every further
reads of that timer to return junk values.

Fix this by taking the appropriate sample in the case of process wide
timers.

This probably doesn't matter much in practice because, at this stage, the
thread is the last one in the group and we reached exit_notify().  This
implies that we called exit_itimers() and there should be no more timers
to handle for that task.

So this is likely dead code anyway but let's fix the current logic
and the warning that came along:

    kernel/posix-cpu-timers.c: In function 'posix_cpu_timer_schedule':
    kernel/posix-cpu-timers.c:1127: warning: 'now' may be used uninitialized in this function

Then we can start to think further about cleaning up that code.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: add basic posix timers selftests

Add some initial basic tests on a few posix timers interface such as
setitimer() and timer_settime().

These simply check that expiration happens in a reasonable timeframe after
expected elapsed clock time (user time, user + system time, real time,
...).

This is helpful for finding basic breakages while hacking
on this subsystem.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix_cpu_timers: consolidate expired timers check

Consolidate the common code amongst per thread and per process timers list
on tick time.

List traversal, expiry check and subsequent updates can be shared in a
common helper.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix_cpu_timers: consolidate timer list cleanups

Cleaning up the posix cpu timers on task exit shares some common code
among timer list types, most notably the list traversal and expiry time
update.

Unify this in a common helper.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix_cpu_timer: consolidate expiry time type

The posix cpu timer expiry time is stored in a union of two types: a 64
bits field if we rely on scheduler precise accounting, or a cputime_t if
we rely on jiffies.

This results in quite some duplicate code and special cases to handle the
two types.

Just unify this into a single 64 bits field. cputime_t can always fit
into it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers-iommu-msm_iommu_devc-fix-leak-and-clean-up-error-paths-fix

remove now-unneeded initialization of ctx_drvdata, remove unneeded braces

Cc: David Brown <davidb@codeaurora.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Libo Chen <clbchenlibo.chen@huawei.com>
Cc: Libo Chen <libo.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/iommu/msm_iommu_dev.c: fix leak and clean up error paths

Fix two obvious problems:

1. We have registered msm_iommu_driver first, and need unregister it
when registered msm_iommu_ctx_driver fail

2. We don`t need to kfree drvdata before kzalloc successful.

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Acked-by: David Brown <davidb@codeaurora.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fsnotify: update comments concerning locking scheme

There have been changes in the locking scheme of fsnotify but the comments
in the source code have not been updated yet. This patch corrects this.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

inotify: fix race when adding a new watch

In inotify_new_watch() the number of watches for a group is compared
against the max number of allowed watches and increased afterwards. The
check and incrementation is not done atomically, so it is possible for
multiple concurrent threads to pass the check and increment the number of
marks above the allowed max.

This patch uses an inotify groups mark_lock to ensure that both check and
incrementation are done atomic. Furthermore we dont have to worry about
the race that allows a concurrent thread to add a watch just after
inotify_update_existing_watch() returned with -ENOENT anymore, since this
is also synchronized by the groups mark mutex now.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

dnotify: replace dnotify_mark_mutex with mark mutex of dnotify_group

There is no need to use a special mutex to protect against the fcntl/close
race (see dnotify.c for a description of this race). Instead the
dnotify_groups mark mutex can be used.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fanotify: put duplicate code for adding vfsmount/inode marks into an own function

The code under the groups mark_mutex in fanotify_add_inode_mark() and
fanotify_add_vfsmount_mark() is almost identical. So put it into a
seperate function.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fanotify: fix races when adding/removing marks

For both adding an event to an existing mark and destroying a mark we
first have to find it via fsnotify_find_[inode|vfsmount]_mark().  But
getting the mark and adding an event (or destroying it) is not done
atomically.  This opens a race where a thread is about to destroy a mark
while another thread still finds the same mark and adds an event to its
mask although it will be destroyed.

Another race exists concerning the excess of a groups number of marks
limit: When a mark is added the number of group marks is checked against
the max number of marks per group and increased afterwards.  Since check
and increment is also not done atomically, this may result in 2 or more
processes passing the check at the same time and increasing the number of
group marks above the allowed limit.

With this patch both races are avoided by doing the concerning operations
with the groups mark mutex locked.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fanotify: info leak in copy_event_to_user()

The ->reserverd field isn't cleared so we leak one byte of stack
information to userspace.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cyber2000fb: avoid palette corruption at higher clocks

When 1280x1024@75Hz mode is set, console palette is not set properly -
sometimes the background is white, sometimes yellow and text colors are
also messed up.  This does not happen at 1280x1024@60Hz and below.

It seems that the HW needs some time before setting the palette - maybe
the PLL needs more time to lock at higher speeds.  This patch fixes the
problem but without knowing what register to check for PLL lock(?), the
delay might be excessive.

On Fri, 28 Jan 2011 18:15:37 +0000
Russell King <rmk@arm.linux.org.uk> wrote:

> On Tue, Jan 18, 2011 at 01:14:24PM -0800, Andrew Morton wrote:
> > Russell, I have an (old) note here that this is awaiting an ack from
> > yourself?
>
> Well, I can reproduce this problem on the Netwinders here.  I'm not sure
> that we should delay all mode switches by one second - and any attempt
> to reduce this value does result in the palette not being set correctly.
>
> For 1280x1024-75, the dotclock is 135MHz, which gives a PLL values of
> 0x41 and 0x06.  That's: M=0x41+1, N=0x06+1, P=0x00 (top 2 bits of 0x06)
> -> Q=1
>
> Fpll = 14.31818MHz * M / N
> Fout = Fpll / Q
>
> The PLL itself is formed by dividing the 14-ish MHz frequency by N and
> phase comparing the output of the VCO, divided by M, and adjusting the
> VCO until the two correlate.  As VCOs typically tend to have a limited
> range, it's normal to divide the output frequency to produce a greater
> range - and in this case that's done by Q.
>
> For the 800x600-100 copied from /etc/fb.modes, this has a dotclock of
> 67.5MHz, which is exactly half this rate.  The PLL values for this are:
> M=0x41+1, N=0x06+1, P=0x01, giving PLL values of 0x41 and 0x46.
>
> Booting with 800x600-100 does not suffer the problem.  So it's not
> related to PLL lock time.  There's something else going on.
>
> Another experiment I tried was forcing the PLL values to produce 108MHz
> instead of 135MHz.  108MHz is the dotclock for 1280x1024-60.  This too
> doesn't suffer the problem.
>
> I've also tried chosing other delay values.  100ms is too short and
> produces the problem, but 1s works.  1s for a PLL to lock is a hell of
> a time, especially for a PLL operating in the MHz range.
>
> I've tried setting the PLL to a known good freqency, and then switching
> to 135MHz - the problem persists.  It's not like 135MHz is reaching the
> limits - it'll go up to 206MHz.
>
> So, I don't think this has anything to do with PLL locking.  I think
> there's something else going on which isn't immediately obvious - maybe
> bandwidth starvation preventing us from writing properly to the palette?
> As it's a horrible VGA, where you write the same register multiple times
> I wouldn't be surprised if some writes were going missing.
>
> I'll see if I can play around with it some more this evening, but I've
> spent an awful long time on just this issue already this afternoon...
>
> I think further investigation needs to happen on this patch before it's
> acceptable.  Or maybe we should prevent the cyberpro coming up in

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/acornfb.c: remove dead code

acornfb checks for HAS_VIDC while support for that macro was removed in
v2.6.23 (when the arm26 port was removed). So we can remove a bit of dead
code.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/imxfb.c: make local symbols static

These symbols are used only in this file. Make them static.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/udlfb.c: make local symbol static

'dlfb_handle_damage' is used only in this file. Make it static.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/udlfb.c: use NULL instead of 0

Pointer variables should be initialized with NULL instead of 0.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/smscufx.c: use NULL instead of 0

'info' is a pointer. Use NULL instead of 0.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Steve Glendinning <steve.glendinning@shawell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/media/pci/pt1/pt1.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>