git.karo-electronics.de Git - karo-tx-linux.git/log

hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Use an mmu_gather instead of a temporary linked list for accumulating
pages when we unmap a hugepage range. This also allows us to get rid of
i_mmap_mutex unmap_hugepage_range in the following patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hugetlbfs: add an inline helper for finding hstate index

Add an inline helper and use it in the code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hugetlbfs: don't use ERR_PTR with VM_FAULT* values

The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hugetlb: rename max_hstate to hugetlb_max_hstate

This patchset implements a memory controller extension to control HugeTLB
allocations.  The extension allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault.  Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time
implies that, the application will get SIGBUS signal if it tries to access
HugeTLB pages beyond its limit.  This requires the application to know
beforehand how much HugeTLB memory it would require for its use.

The goal is to control how many HugeTLB pages a group of tasks can
allocate.  It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock.  An HPC job scheduler requires jobs to specify their resource
requirements in the job file.  Once their requirements can be met, job
schedulers (such as SLURM) will schedule the job.  We need to make sure
that the jobs won't consume more resources than requested.  If they do we
should either error out or kill the application.

This patch:

Rename max_hstate to hugetlb_max_hstate.  We will be using this from other
subsystems like memcg in later patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: remove reclaim_mode_t

There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
and lumpy reclaim have been removed. This patch gets rid of
reclaim_mode_t as well and improves the documentation about what
reclaim/compaction is and when it is triggered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: do not stall on writeback during memory compaction

This patch stops reclaim/compaction entering sync reclaim as this was only
intended for lumpy reclaim and an oversight. Page migration has its own
logic for stalling on writeback pages if necessary and memory compaction
is already using it.

Waiting on page writeback is bad for a number of reasons but the primary
one is that waiting on writeback to a slow device like USB can take a
considerable length of time. Page reclaim instead uses
wait_iff_congested() to throttle if too many dirty pages are being
scanned.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: remove lumpy reclaim

This series removes lumpy reclaim and some stalling logic that was
unintentionally being used by memory compaction.  The end result is that
stalling on dirty pages during page reclaim now depends on
wait_iff_congested().

Four kernels were compared

3.3.0     vanilla
3.4.0-rc2 vanilla
3.4.0-rc2 lumpyremove-v2 is patch one from this series
3.4.0-rc2 nosync-v2r3 is the full series

Removing lumpy reclaim saves almost 900 bytes of text whereas the full
series removes 1200 bytes.

   text    data     bss     dec     hex filename
6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla
6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2
6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2

There are behaviour changes in the series and so tests were run with
monitoring of ftrace events.  This disrupts results so the performance
results are distorted but the new behaviour should be clearer.

fs-mark running in a threaded configuration showed little of interest as
it did not push reclaim aggressively

FS-Mark Multi Threaded
                        3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73

MMTests Statistics: vmstat
Page Ins                                       80532       82212       81420       79480
Page Outs                                  111434984   111456240   111437376   111582628
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                           44881       27889       27453       34843
Kswapd pages scanned                        25841428    25860774    25861233    25843212
Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
Direct pages reclaimed                         44881       27889       27453       34843
Kswapd efficiency                                99%         99%         99%         99%
Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
Direct efficiency                               100%        100%        100%        100%
Direct velocity                               37.783      23.375      23.031      29.188
Percentage direct scans                           0%          0%          0%          0%

ftrace showed that there was no stalling on writeback or pages submitted
for IO from reclaim context.

postmark was similar and while it was more interesting, it also did not
push reclaim heavily.

POSTMARK
                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27

MMTests Statistics: vmstat
Page Ins                                    13710192    13729032    13727944    13760136
Page Outs                                   43071140    42987228    42733684    42931624
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                               0           0           0           0
Kswapd pages scanned                         9941613     9937443     9939085     9929154
Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
Direct pages reclaimed                             0           0           0           0
Kswapd efficiency                                99%         99%         99%         99%
Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
Direct efficiency                               100%        100%        100%        100%
Direct velocity                                0.000       0.000       0.000       0.000

It looks like here that the full series regresses performance but as
ftrace showed no usage of wait_iff_congested() or sync reclaim I am
assuming it's a disruption due to monitoring.  Other data such as memory
usage, page IO, swap IO all looked similar.

Running a benchmark with a plain DD showed nothing very interesting.  The
full series stalled in wait_iff_congested() slightly less but stall times
on vanilla kernels were marginal.

Running a benchmark that hammered on file-backed mappings showed stalls
due to congestion but not in sync writebacks

MICRO
                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
MMTests Statistics: duration
Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91

MMTests Statistics: vmstat
Page Ins                                      108712      120708       97224      110344
Page Outs                                  155514576   156017404   155813676   156193256
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0
Direct pages scanned                         2599253     1550480     2512822     2414760
Kswapd pages scanned                        69742364    71150694    68839041    69692533
Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
Direct pages reclaimed                         53693       94750       61792       75205
Kswapd efficiency                                49%         48%         50%         49%
Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
Direct efficiency                                 2%          6%          2%          3%
Direct velocity                             1432.174     845.464    1379.807    1317.446
Percentage direct scans                           3%          2%          3%          3%
Page writes by reclaim                             0           0           0           0
Page writes file                                   0           0           0           0
Page writes anon                                   0           0           0           0
Page reclaim immediate                             0           0           0        1218
Page rescued immediate                             0           0           0           0
Slabs scanned                                  15360       16384       13312       16384
Direct inode steals                                0           0           0           0
Kswapd inode steals                             4340        4327        1630        4323

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 0          0          0          0
Direct time   congest     waited               0ms        0ms        0ms        0ms
Direct full   congest     waited                 0          0          0          0
Direct number conditional waited               900        870        754        789
Direct time   conditional waited               0ms        0ms        0ms       20ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited              2106       2308       2116       1915
KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
KSwapd full   congest     waited              1346       1530       1202       1278
KSwapd number conditional waited             12922      16320      10943      14670
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0

Reclaim statistics are not radically changed.  The stall times in kswapd
are massive but it is clear that it is due to calls to congestion_wait()
and that is almost certainly the call in balance_pgdat().  Otherwise
stalls due to dirty pages are non-existant.

I ran a benchmark that stressed high-order allocation.  This is very
artifical load but was used in the past to evaluate lumpy reclaim and
compaction.  Generally I look at allocation success rates and latency
figures.

STRESS-HIGHALLOC
                 3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44

MMTests Statistics: vmstat
Page Ins                                     4486020     2807256     2855944     2876244
Page Outs                                    7261600     7973688     7975320     7986120
Swap Ins                                       31694           0           0           0
Swap Outs                                      98179           0           0           0
Direct pages scanned                           53494       57731       34406      113015
Kswapd pages scanned                         6271173     1287481     1278174     1219095
Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
Direct pages reclaimed                          1468       14564       16649       92456
Kswapd efficiency                                32%         99%         98%         98%
Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
Direct efficiency                                 2%         25%         48%         81%
Direct velocity                               46.047      50.092      29.672      97.306
Percentage direct scans                           0%          4%          2%          8%
Page writes by reclaim                       1616049           0           0           0
Page writes file                             1517870           0           0           0
Page writes anon                               98179           0           0           0
Page reclaim immediate                        103778       27339        9796       17831
Page rescued immediate                             0           0           0           0
Slabs scanned                                1096704      986112      980992      998400
Direct inode steals                              223      215040      216736      247881
Kswapd inode steals                           175331       61548       68444       63066
Kswapd skipped wait                            21991           0           1           0
THP fault alloc                                    1         135         125         134
THP collapse alloc                               393         311         228         236
THP splits                                        25          13           7           8
THP fault fallback                                 0           0           0           0
THP collapse fail                                  3           5           7           7
Compaction stalls                                865        1270        1422        1518
Compaction success                               370         401         353         383
Compaction failures                              495         869        1069        1135
Compaction pages moved                        870155     3828868     4036106     4423626
Compaction move failure                        26429       23865       29742       27514

Success rates are completely hosed for 3.4-rc2 which is almost certainly
due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled].
I expected this would happen for kswapd and impair allocation success
rates (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this
much a difference: 80% less scanning, 37% less reclaim by kswapd

In comparison, reclaim/compaction is not aggressive and gives up easily
which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would be
much more aggressive about reclaim/compaction than THP allocations are.
The stress test above is allocating like neither THP or hugetlbfs but is
much closer to THP.

Mainline is now impaired in terms of high order allocation under heavy
load although I do not know to what degree as I did not test with
__GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
resizing, THP allocation and high order atomic allocation failures from
network devices.

In terms of congestion throttling, I see the following for this test

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 3          0          0          0
Direct time   congest     waited               0ms        0ms        0ms        0ms
Direct full   congest     waited                 0          0          0          0
Direct number conditional waited               957        512       1081       1075
Direct time   conditional waited               0ms        0ms        0ms        0ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited                36          4          3          5
KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
KSwapd full   congest     waited                30          4          3          5
KSwapd number conditional waited             88514        197        332        542
KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
KSwapd full   conditional waited                49          0          0          0

The "conditional waited" times are the most interesting as this is
directly impacted by the number of dirty pages encountered during scan.
As lumpy reclaim is no longer scanning contiguous ranges, it is finding
fewer dirty pages.  This brings wait times from about 5 seconds to 0.
kswapd itself is still calling congestion_wait() so it'll still stall but
it's a lot less.

In terms of the type of IO we were doing, I see this

FTrace Reclaim Statistics: mm_vmscan_writepage
Direct writes anon  sync                         0          0          0          0
Direct writes anon  async                        0          0          0          0
Direct writes file  sync                         0          0          0          0
Direct writes file  async                        0          0          0          0
Direct writes mixed sync                         0          0          0          0
Direct writes mixed async                        0          0          0          0
KSwapd writes anon  sync                         0          0          0          0
KSwapd writes anon  async                    91682          0          0          0
KSwapd writes file  sync                         0          0          0          0
KSwapd writes file  async                   822629          0          0          0
KSwapd writes mixed sync                         0          0          0          0
KSwapd writes mixed async                        0          0          0          0

In 3.2, kswapd was doing a bunch of async writes of pages but
reclaim/compaction was never reaching a point where it was doing sync IO.
This does not guarantee that reclaim/compaction was not calling
wait_on_page_writeback() but I would consider it unlikely.  It indicates
that merging patches 2 and 3 to stop reclaim/compaction calling
wait_on_page_writeback() should be safe.

This patch:

Lumpy reclaim had a purpose but in the mind of some, it was to kick the
system so hard it trashed.  For others the purpose was to complicate
vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
memory compaction needs to step up and replace it so this patch sends
lumpy reclaim to the farm.

The tracepoint format changes for isolating LRU pages with this patch
applied.  Furthermore reclaim/compaction can no longer queue dirty pages
in pageout() if the underlying BDI is congested.  Lumpy reclaim used this
logic and reclaim/compaction was using it in error.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove swap token code

The swap token code no longer fits in with the current VM model.  It does
not play well with cgroups or the better NUMA placement code in
development, since we have only one swap token globally.

It also has the potential to mess with scalability of the system, by
increasing the number of non-reclaimable pages on the active and inactive
anon LRU lists.

Last but not least, the swap token code has been broken for a year without
complaints, as reported by Konstantin Khlebnikov.  This suggests we no
longer have much use for it.

The days of sub-1G memory systems with heavy use of swap are over.  If we
ever need thrashing reducing code in the future, we will have to implement
something that does scale.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Bob Picco <bpicco@meloft.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, thp: allow fallback when pte_alloc_one() fails for huge pmd

The transparent hugepages feature is careful to not invoke the oom killer
when a hugepage cannot be allocated.

pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however,
currently results in VM_FAULT_OOM which invokes the pagefault oom killer
to kill a memory-hogging task.

This is unnecessary since it's possible to drop the reference to the
hugepage and fallback to allocating a small page.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, thp: remove unnecessary ret variable

The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so
remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb.c: use long vars instead of int in region_count()

args f & t and fields from & to of struct file_region are defined as long.
Use long instead of int to type the temp vars.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy()

We have enum definition in mempolicy.h: MPOL_REBIND_ONCE. It should
replace the magic number 0 for step comparison in function
mpol_rebind_policy.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory_failure: let the compiler add the function name

These things tend to get out of sync with time so let the compiler
automatically enter the current function name using __func__.

No functional change.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

brlocks/lglocks: cleanups

lglocks and brlocks are currently generated with some complicated macros
in lglock.h.  But there's no reason to not just use common utility
functions and put all the data into a common data structure.

Since there are at least two users it makes sense to share this code in a
library.  This is also easier maintainable than a macro forest.

This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)

In general the users now look more like normal function calls with
pointers, not magic macros.

The patch is rather large because I move over all users in one go to keep
it bisectable.  This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.

[akpm@linux-foundation.org: checkpatch fixes]
[levinsasha928@gmail.com: fix dup_mnt_ns()]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fsnotify: handle subfiles' perm events

Recently I'm working on fanotify and found the following strange
behaviors.

I wrote a program to set fanotify_mark on "/tmp/block" and FAN_DENY
all events notified.

fanotify_mask = FAN_ALL_EVENTS | FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
$ cd /tmp/block; cat foo
cat: foo: Operation not permitted

Operation on the file is blocked as expected.

But,

fanotify_mask = FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
$ cd /tmp/block; cat foo
aaa

It's not blocked anymore. This is confusing behavior. Also reading
commit "fsnotify: call fsnotify_parent in perm events", it seems like
fsnotify should handle subfiles' perm events as well as the other notify
events.

With this patch, regardless of FAN_ALL_EVENTS set or not:
$ cd /tmp/block; cat foo
cat: foo: Operation not permitted

Operation on the file is now blocked properly.

FS_OPEN_PERM and FS_ACCESS_PERM are not listed on FS_EVENTS_POSS_ON_CHILD.
Due to fsnotify_inode_watches_children() check, if you only specify only
these events as fsnotify_mask, you don't get subfiles' perm events
notified.

This patch add the events to FS_EVENTS_POSS_ON_CHILD to get them notified
even if only these events are specified to fsnotify_mask.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fsnotify: remove unused parameter from send_to_group()

We don't use "mnt" anymore in send_to_group() after 1968f5eed5 ("fanotify:
use both marks when possible") was applied.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs: hardlink creation restrictions

On systems that have user-writable directories on the same partition as
system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp.  The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e.  a root process follows a hardlink created by another
user).  Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.

The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.

Many Linux users are surprised when they learn they can link to files they
have no access to, so this change appears to follow the doctrine of "least
surprise".  Additionally, this change does not violate POSIX, which states
"the implementation may require that the calling process has permission to
access the existing file"[1].

This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for a
while.  Otherwise, the change has been undisruptive while in use in Ubuntu
for the last 1.5 years.

This patch is based on the patch in Openwall and grsecurity.  I have added
a sysctl to enable the protected behavior, documentation, and an audit
notification.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

[akpm@linux-foundation.org: uninline may_linkat() and audit_log_link_denied()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Federica Teodori <federica.teodori@googlemail.com>
Cc: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs: symlink restrictions on sticky directories

A longstanding class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp.  The common method of exploitation of this flaw is
to cross privilege boundaries when following a given symlink (i.e.  a root
process follows a symlink belonging to another user).  For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

The solution is to permit symlinks to only be followed when outside a
sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.

Some pointers to the history of earlier discussion that I could find:

1996 Aug, Zygo Blaxell
  http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
  http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
  http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
  http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
  https://lkml.org/lkml/2010/5/30/144

Past objections and rebuttals could be summarized as:

- Violates POSIX.
   - POSIX didn't consider this situation and it's not useful to follow
     a broken specification at the cost of security.
- Might break unknown applications that use this feature.
   - Applications that break because of the change are easy to spot and
     fix. Applications that are vulnerable to symlink ToCToU by not having
     the change aren't. Additionally, no applications have yet been found
     that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
   - True, but applications are not perfect, and new software is written
     all the time that makes these mistakes; blocking this flaw at the
     kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
   - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
   - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

This patch is based on the patch in Openwall and grsecurity, along with
suggestions from Al Viro.  I have added a sysctl to enable the protected
behavior, documentation, and an audit notification.

[akpm@linux-foundation.org: move sysctl_protected_sticky_symlinks declaration into .h]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Federica Teodori <federica.teodori@googlemail.com>
Cc: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

vfs: increment iversion when a file is truncated

When a file is truncated with truncate()/ftruncate() and then closed,
iversion is not updated.  This patch uses ATTR_SIZE flag as an indication
to increment iversion.

Mimi said:

On fput(), i_version is used to detect and flag files that have changed
and need to be re-measured in the IMA measurement policy.  When a file
is truncated with truncate()/ftruncate() and then closed, i_version is
not updated.  As a result, although the file has changed, it will not be
re-measured and added to the IMA measurement list on subsequent access.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/atp870u.c: fix bad use of udelay

The ACARD driver calls udelay() with a value > 2000, which leads to
to the following compilation error on ARM:
  ERROR: "__bad_udelay" [drivers/scsi/atp870u.ko] undefined!
  make[1]: *** [__modpost] Error 1

This is because udelay is defined on ARM, roughly speaking, as

#define udelay(n) ((n) > 2000 ? __bad_udelay() : \
__const_udelay((n) * ((2199023U*HZ)>>11)))

The argument to __const_udelay is the number of jiffies to wait divided by
4, but this does not work unless the multiplication does not overflow, and
that is what the build error is designed to prevent.  The intended
behavior can be achieved by using mdelay to call udelay multiple times in
a loop.

[jn: adding context]
Signed-off-by: Martin Michlmayr <tbm@cyrius.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/ufs: fix incorrect return value about SUCCESS and FAILED

Currently the UFS host driver has returned incorrect values for SUCCESS
and FAILED. Fix it to return the correct value to the upper layer.

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: Santosh Y <santoshsy@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/ufs: fix evaluation of task_failed status

Else FAILED would be set even if task_result was originally equal to
UPIU_TASK_MANAGEMENT_FUNC_SUCCEEDED.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Venkatraman S <svenkatr@ti.com>
Reviewed-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: Santosh Y <santoshsy@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/ufs: reverse the ufshcd_is_device_present logic

Otherwise it counter intuitively returns 0 if device is present.

Signed-off-by: Venkatraman S <svenkatr@ti.com>
Reviewed-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: Santosh Y <santoshsy@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/scsi/ufs: use module_pci_driver

Use macro module_pci_driver and get rid of boilerplate code. No
functional changes.

Signed-off-by: Venkatraman S <svenkatr@ti.com>
Acked-by: Santosh Y <santoshsy@gmail.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use bitmap_weight()

Use bitmap_weight() instead of reinventing the wheel.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use find_last_bit()

We already have find_last_bit(). So just use it as described in the
comment.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

blackfin: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

connector/userns: replace netlink uses of cap_raised with capable.

In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.

In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).

Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN). The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.

In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.

Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.

The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).

To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

unicore32: use block_sigmask()

Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures. In the past some architectures got this code wrong, so
using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/thermal/spear_thermal.c: add Device Tree probing capability

SPEAr platforms now support DT and so must convert all drivers to support
DT. This patch adds DT probing support for SPEAr thermal sensor driver
and updates its documentation too.

Also, as SPEAr is the only user of this driver and is only available with
DT, make this an only DT driver. So, platform_data is completely removed
and passed via DT now.

Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Vincenzo Frascino <vincenzo.frascino@st.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

h8300: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

score: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

score: don't mask signals if we fail to setup signal stack

If setup_rt_frame() returns -EFAULT then we must not block any signals
in the current process.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

microblaze: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

microblaze: fix signal masking

There are a couple of problems with the current signal code,

1. If we failed to setup the signal stack frame then we should not be
   masking any signals.

2. ka->sa.sa_mask is only added to the current blocked signals list if
   SA_NODEFER is set in ka->sa.sa_flags.  If we successfully setup the
   signal frame and are going to run the handler then we must honour
   sa_mask.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

microblaze: no need to reset handler if SA_ONESHOT

get_signal_to_deliver() already resets the signal handler if SA_ONESHOT is
set in ka->sa.sa_flags, there's no need to do it again in handle_signal().
Furthermore, because we were modifying ka->sa.sa_handler (which is a copy
of sighand->action[]) instead of sighand->action[] the original code
actually had no effect on signal delivery.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

microblaze: don't reimplement force_sigsegv()

Instead of open coding the sequence from force_sigsegv() just call it.
This also fixes a bug because we were modifying ka->sa.sa_handler (which
is a copy of sighand->action[]), whereas the intention of the code was to
modify sighand->action[] directly.

As the original code was working with a copy it had no effect on signal
delivery.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ia64: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timeconst.pl: remove deprecated defined(@array)

The use of defined() on arrays and hashes has been deprecated since perl
5.6, but until 5.17.6 it only warned on lexicals, not package globals.

Signed-off-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

percpu-remove-percpu_xxx-functions-fix

Cc: Alex Shi <alex.shi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

percpu: remove percpu_xxx() functions

There are no percpu_xxx callers remaining

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

net: use this_cpu_xxx replace percpu_xxx funcs

percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86: change percpu_read_stable() to this_cpu_read_stable()

It has no function change. It's a preparation for percpu_xxx serial
function change.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86-use-this_cpu_xxx-to-replace-percpu_xxx-funcs-fix

Cc: Alex Shi <alex.shi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86: use this_cpu_xxx to replace percpu_xxx funcs

Since percpu_xxx() serial functions are duplicate with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in
code.

And further more, as Christoph Lameter's requirement, I try to use
__this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario.
The preempt safe scenarios include:
1, in irq/softirq/nmi handler
2, protected by preempt_disable
3, protected by spin_lock
4, if the code context imply that it is preempt safe, like the code is
follows or be followed a preempt safe code.

BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs
implement in a single instruction for x86 machine. But it maybe other
platforms' performance.

[akpm@linux-foundation.org: fix build]
[sfr@canb.auug.org.au: arch/x86/include/asm/desc.h: fix smp_processor_id's need for this_cpu_read]
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cris: select GENERIC_ATOMIC64

Cris doesn't implement atomic64 operations neither, should select
GENERIC_ATOMIC64.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cris: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpuidle: add checks to avoid NULL pointer dereference

The existing check for dev == NULL in __cpuidle_register_device() is
rendered useless because dev is dereferenced before the check itself.
Moreover, correctly speaking, it is the job of the callers of this
function, i.e., cpuidle_register_device() & cpuidle_enable_device() (which
also happen to be exported functions) to ensure that
__cpuidle_register_device() is called with a non-NULL dev.

So add the necessary dev == NULL checks in the two callers and remove the
(useless) check from __cpuidle_register_device().

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpuidle: remove unused hrtimer_peek_ahead_timers() call

cpuidle: remove unused hrtimer_peek_ahead_timers() call

  commit 9a6558371bcd01c2973b7638181db4ccc34eab4f
  Author: Arjan van de Ven <arjan@linux.intel.com>
  Date:   Sun Nov 9 12:45:10 2008 -0800

     regression: disable timer peek-ahead for 2.6.28

     It's showing up as regressions; disabling it very likely just papers
     over an underlying issue, but time is running out for 2.6.28, lets get
     back to this for 2.6.29

Many years has passed since 2008, so it seems ok to remove whole `#if 0' block.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Trinabh Gupta <g.trinabh@gmail.com>
Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mn10300: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

m68k: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

m32r: use set_current_blocked() and block_sigmask()

As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.

Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

avr32: use block_sigmask()

Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures.

In the past some architectures got this code wrong, so using this helper
function should stop that from happening again.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Havard Skinnemoen <hskinnemoen@gmail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

avr32: use set_current_blocked() in handle_signal/sys_rt_sigreturn

It is wrong to change ->blocked directly, see e6fa16ab. Change
handle_signal() and sys_rt_sigreturn() to use the right helper,
set_current_blocked().

Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Acked-by: Havard Skinnemoen <hskinnemoen@gmail.com>
Reviewed-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

avr32: don't mask signals in the error path

The current handle_signal() implementation is broken - it will mask
signals if we fail to setup the signal stack frame, which isn't the
desired behaviour, we should only be masking signals if we succeed in
setting up the stack frame. It looks like this code was copied from the
old (broken) arm implementation but wasn't updated when the arm code was
fixed in commit a6c61e9dfdd0 ("[ARM] 3168/1: Update ARM signal delivery
and masking").

Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Acked-by: Havard Skinnemoen <hskinnemoen@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/xen/Kconfig: fix Kconfig layout

Fit it into 80 columns so that it is readable in menuconfig.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/x86/kernel/apic/io_apic.c: move io_apic_level_ack_pending() inside CONFIG_GENERIC_PENDING_IRQ

x86_64 allnoconfig:

arch/x86/kernel/apic/io_apic.c:382: warning: 'io_apic_level_ack_pending' defined but not used

Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/x86/include/asm/spinlock.h: fix comment

This comment is no longer true. We support up to 2^16 CPUs because
__ticket_t is an u16 if NR_CPUS is larger than 256.

Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

intel_mid_powerbtn: mark irq as IRQF_NO_SUSPEND

So that the power button still wakes up the platform.

Signed-off-by: Pierre Tardy <pierre.tardy@intel.com>
Tested-by: Kangkai Yin <kangkai.yin@intel.com>
Tested-by: Yong Wang <yong.y.wang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/x86/platform/iris/iris.c: register a platform device and a platform driver

This makes the iris driver use the platform API, so it is properly exposed
in /sys.

[akpm@linux-foundation.org: remove commented-out code, add missing space to printk, clean up code layout]
Signed-off-by: Shérab <Sebastien.Hinderer@ens-lyon.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/x86/platform/geode/net5501.c: change active_low to 0 for LED driver

It seems that there was an error with the active_low = 1 for the
LED, since it should be set to 0 (meaning that active is high,
since 0 is false, hence the confusion.

The wiki article about it confuses it, since it contradicts itself,
regarding what turns on the LED.

I have tested 3.4-rc2 on my net5501 with this patch, and it makes the LED
behave correctly, where "none" turns it off, and "default-on" turns it on,
when echoed onto the trigger "file" in /sys/class/leds.

Signed-off-by: Bjarke Istrup Pedersen <gurligebis@gentoo.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

umem: fix up unplugging

Fix a regression introduced by 7eaceaccab5f40 ("block: remove per-queue
plugging"). In that patch, Jens removed the whole mm_unplug_device()
function, which used to be the trigger to make umem start to work.

We need to implement unplugging to make umem start to work, or I/O
will never be triggered.

Signed-off-by: Tao Guo <Tao.Guo@emc.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/leds: correct __devexit annotations

__devexit functions are discarded without CONFIG_HOTPLUG, so they need to
be referenced carefully. A __devexit function may also not be called from
a __devinit function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: free spare array to avoid memory leak

When the last event is unregistered, there is no need to keep the spare
array anymore. So free it to avoid memory leak.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

namespaces, pid_ns: fix leakage on fork() failure

Fork() failure post namespace creation for a child cloned with
CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted
during creation, but not unmounted during cleanup. Call
pid_ns_release_proc() during cleanup.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()

66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()") added
code to avoid a race condition by elevating the page refcount in
hugetlb_fault() while calling hugetlb_cow().

However, one code path in hugetlb_cow() includes an assertion that the
page count is 1, whereas it may now also have the value 2 in this path.

The consensus is that this BUG_ON has served its purpose, so rather than
extending it to cover both cases, we just remove it.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.0.29+, 3.2.16+, 3.3.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix division by 0 in percpu_pagelist_fraction()

percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as a
possible error from proc_dointvec_minmax(). If any other error is
returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point. For
example, writing 0 bytes into the proc file would trigger the issue.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

proc/pid/pagemap: correctly report non-present ptes and holes between vmas

Reset the current pagemap-entry if the current pte isn't present, or if
current vma is over. Otherwise pagemap reports last entry again and
again.

non-present pte reporting was broken in commit v3.3-3738-g092b50b
("pagemap: introduce data structure for pagemap entry")

reporting for holes was broken in commit v3.3-3734-g5aaabe8
("pagemap: avoid splitting thp when reading /proc/pid/pagemap")

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/gpu/drm/gma500/mdfld_device.c: fix build

drivers/gpu/drm/gma500/mdfld_device.c:675: error: expected '}' before ';' token

Repairs "cdv: continue synching up with updated reference code".

Cc: Alan Cox <alan@linux.intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pagemap.h: fix warning about possibly used before init var

Commit f56f821feb7b36223f309e0ec05986bb137ce418 (linux-next)

    "mm: extend prefault helpers to fault in more than PAGE_SIZE"

added in the new functions:

fault_in_multipages_writeable
fault_in_multipages_readable

However, we currently see:

  include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function
  include/linux/pagemap.h:492: note: 'ret' was declared here

Unlike a lot of gcc nags, this one appears somewhat legit.  i.e.  passing
in an invalid negative value of "size" does make it look like all the
conditionals in there would be bypassed and the uninitialized value would
be returned.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge remote-tracking branch 'dma-mapping/dma-mapping-next'

Conflicts:
arch/x86/include/asm/dma-mapping.h

Merge remote-tracking branch 'ep93xx/ep93xx-cleanup'

Merge remote-tracking branch 'arm-soc/for-next'

Conflicts:
arch/arm/mach-ixp2000/enp2611.c
arch/arm/mach-ixp2000/include/mach/platform.h
arch/arm/mach-ixp2000/ixdp2400.c
arch/arm/mach-ixp2000/ixdp2800.c
arch/arm/mach-ixp2000/ixdp2x01.c
arch/arm/mach-ixp2000/pci.c
arch/arm/mach-ixp23xx/include/mach/platform.h
arch/arm/mach-ixp23xx/ixdp2351.c
arch/arm/mach-ixp23xx/pci.c
arch/arm/mach-ixp23xx/roadrunner.c
arch/arm/mach-lpc32xx/common.c
drivers/mfd/ab5500-core.c
lib/Makefile

Merge remote-tracking branch 'gpio/gpio/next'

Merge remote-tracking branch 'modem-shm/for-next'

Conflicts:
drivers/Kconfig

Merge remote-tracking branch 'vhost/linux-next'

Conflicts:
drivers/net/virtio_net.c

Merge remote-tracking branch 'moduleh/for-sfr'

Merge remote-tracking branch 'tegra/for-next'

Merge remote-tracking branch 'pinctrl/for-next'

Conflicts:
Documentation/driver-model/devres.txt
drivers/pinctrl/core.c

Merge remote-tracking branch 'writeback/writeback-for-next'

Merge remote-tracking branch 'tmem/linux-next'

Merge remote-tracking branch 'char-misc/char-misc-next'

Merge remote-tracking branch 'staging/staging-next'

Conflicts:
drivers/Kconfig
drivers/Makefile
drivers/staging/android/Kconfig
drivers/staging/android/Makefile
drivers/staging/line6/driver.c

Merge remote-tracking branch 'usb/usb-next'

Conflicts:
drivers/input/tablet/wacom_wac.c
drivers/usb/core/inode.c
drivers/usb/host/ehci-tegra.c

Merge remote-tracking branch 'tty/tty-next'

Merge remote-tracking branch 'driver-core/driver-core-next'

Merge remote-tracking branch 'regmap/for-next'

Conflicts:
drivers/base/regmap/regmap.c

Merge remote-tracking branch 'drivers-x86/linux-next'

Merge remote-tracking branch 'workqueues/for-next'

Merge remote-tracking branch 'percpu/for-next'

Merge remote-tracking branch 'xen-two/linux-next'

Merge remote-tracking branch 'kvm-ppc/kvm-ppc-next'

Merge remote-tracking branch 'kvm/linux-next'

Conflicts:
Documentation/feature-removal-schedule.txt

Merge remote-tracking branch 'kmemleak/kmemleak'

Merge remote-tracking branch 'rcu/rcu/next'

Merge remote-tracking branch 'tip/auto-latest'

Conflicts:
arch/x86/Kconfig

Merge remote-tracking branch 'spi/spi/next'

Merge remote-tracking branch 'edac-amd/for-next'

Merge remote-tracking branch 'fsnotify/for-next'

Merge remote-tracking branch 'pm/linux-next'