lib/genalloc.c: fix overflow of ending address of memory chunk
In struct gen_pool_chunk, end_addr means the end address of memory chunk
(inclusive), but in the implementation it is treated as address + size of
memory chunk (exclusive), so it points to the address plus one instead of
correct ending address.
The ending address of memory chunk plus one will cause overflow on the
memory chunk including the last address of memory map, e.g. when starting
address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
address will be 0x100000000.
Use correct ending address like starting address + size - 1.
[akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/hotplug: remove unnecessary BUG_ON in __offline_pages()
I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(),
because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block"
is always greater than 0.
In v2.6.32, If info->length==0, this way may hit this BUG_ON().
acpi_memory_disable_device()
remove_memory(info->start_addr, info->length)
offline_pages()
A later Fujitsu patch renamed this function and the BUG_ON() is
unnecessary.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:40 +0000 (14:21 -0700)]
mm, vmalloc: use well-defined find_last_bit() func
Our intention in here is to find last_bit within the region to flush.
There is well-defined function, find_last_bit() for this purpose and its
performance may be slightly better than current implementation. So change
it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmstat: create separate function to fold per cpu diffs into local counters
The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.
This patch (of 3):
It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.
If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.
The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.
[akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:26 +0000 (14:21 -0700)]
mm, page_alloc: add unlikely macro to help compiler optimization
We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
path. For helping compiler optimization, add unlikely macro to
ALLOC_NO_WATERMARKS checking.
This patch doesn't have any effect now, because gcc already optimize this
properly. But we cannot assume that gcc always does right and nobody
re-evaluate if gcc do proper optimization with their change, for example,
it is not optimized properly on v3.10. So adding compiler hint here is
reasonable.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:18 +0000 (14:21 -0700)]
mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache
If a vma with VM_NORESERVE allocate a new page for page cache, we should
check whether this area is reserved or not. If this address is already
reserved by other process(in case of chg == 0), we should decrement
reserve count, because this allocated page will go into page cache and
currently, there is no way to know that this page comes from reserved pool
or not when releasing inode. This may introduce over-counting problem to
reserved count. With following example code, you can easily reproduce
this situation.
Assume 2MB, nr_hugepages = 100
size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
After finish the program, run 'cat /proc/meminfo'. You can see below
result.
HugePages_Free: 100
HugePages_Rsvd: 1
To fix this, we should check our mapping type and tracked region. If our
mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current
allocated page will go into page cache which is already reserved region
when mapping is created. In this case, we should decrease reserve count.
As implementing above, this patch solve the problem.
[akpm@linux-foundation.org: fix spelling in comment] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 11 Sep 2013 21:21:07 +0000 (14:21 -0700)]
mm, hugetlb: remove decrement_hugepage_resv_vma()
Now, Checking condition of decrement_hugepage_resv_vma() and
vma_has_reserves() is same, so we can clean-up this function with
vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
one call site, so we can remove function and embed it into
dequeue_huge_page_vma() directly. This patch implement it.
Joonsoo Kim [Wed, 11 Sep 2013 21:21:06 +0000 (14:21 -0700)]
mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()
If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to
check reserve counting and eventually we cannot be ensured to allocate a
huge page in fault time. With following example code, you can easily find
this situation.
Assume 2MB, nr_hugepages = 100
fd = hugetlbfs_unlinked_fd();
if (fd < 0)
return 1;
size = 200 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
During executing sleep(10), run 'cat /proc/meminfo' on another process.
HugePages_Free: 99
HugePages_Rsvd: 100
Number of free should be higher or equal than number of reserve, but this
aren't. This represent that non reserved shared mapping steal a reserved
page. Non reserved shared mapping should not eat into reserve space.
If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
that we don't have reserved pages, then we check that we have enough free
pages in dequeue_huge_page_vma(). This prevent to steal a reserved page.
With this change, above test generate a SIGBUG which is correct, because
all free pages are reserved and non reserved shared mapping can't get a
free page.
Joonsoo Kim [Wed, 11 Sep 2013 21:21:04 +0000 (14:21 -0700)]
mm, hugetlb: do not use a page in page cache for cow optimization
Currently, we use a page with mapped count 1 in page cache for cow
optimization. If we find this condition, we don't allocate a new page and
copy contents. Instead, we map this page directly. This may introduce a
problem that writting to private mapping overwrite hugetlb file directly.
You can find this situation with following code.
We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE
WRITE: s". If we turn off this optimization to a page in page cache, the
problem is disappeared.
So, I change the trigger condition of optimization. If this page is not
AnonPage, we don't do optimization. This makes this optimization turning
off for a page cache.
Joonsoo Kim [Wed, 11 Sep 2013 21:21:00 +0000 (14:21 -0700)]
mm, hugetlb: fix and clean-up node iteration code to alloc or free
Current node iteration code have a minor problem which do one more node
rotation if we can't succeed to allocate. For example, if we start to
allocate at node 0, we stop to iterate at node 0. Then we start to
allocate at node 1 for next allocation.
I introduce new macros "for_each_node_mask_to_[alloc|free]" and fix and
clean-up node iteration code to alloc or free. This makes code more
understandable.
Joonsoo Kim [Wed, 11 Sep 2013 21:20:50 +0000 (14:20 -0700)]
mm, hugetlb: move up the code which check availability of free huge page
In this time we are holding a hugetlb_lock, so hstate values can't be
changed. If we don't have any usable free huge page in this time, we
don't need to proceed with the processing. So move this code up.
Johannes Weiner [Wed, 11 Sep 2013 21:20:49 +0000 (14:20 -0700)]
mm: revert "page-writeback.c: subtract min_free_kbytes from dirtyable memory"
This reverts commit 75f7ad8e043d. It was the result of a problem
observed with a 3.2 kernel and merged in 3.9, while the issue had been
resolved upstream in 3.3 (commit ab8fabd46f81: "mm: exclude reserved
pages from dirtyable memory").
The "reserved pages" are a superset of min_free_kbytes, thus this change
is redundant and confusing. Revert it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul Szabo <psz@maths.usyd.edu.au> Cc: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 11 Sep 2013 21:20:47 +0000 (14:20 -0700)]
mm: page_alloc: fair zone allocator policy
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.
But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.
The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.
Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021
Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937
When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Cc: Zlatko Calusic <zcalusic@bitsync.net> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 11 Sep 2013 21:20:46 +0000 (14:20 -0700)]
mm: page_alloc: rearrange watermark checking in get_page_from_freelist
Allocations that do not have to respect the watermarks are rare
high-priority events. Reorder the code such that per-zone dirty limits
and future checks important only to regular page allocations are ignored
in these extraordinary situations.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 11 Sep 2013 21:20:39 +0000 (14:20 -0700)]
mm: vmscan: fix numa reclaim balance problem in kswapd
The way the page allocator interacts with kswapd creates aging imbalances,
where the amount of time a userspace page gets in memory under reclaim
pressure is dependent on which zone, which node the allocator took the
page frame from.
#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system
#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone
These patches are the aging fairness part of the thrash-detection based
file LRU balancing. Andrea recommended to submit them separately as they
are bugfixes in their own right.
The following test ran a foreground workload (memcachetest) with
background IO of various sizes on a 4 node 8G system (similar results were
observed with single-node 4G systems):
BAS FAIRALLO
BASE FAIRALLOC
User 287.78 460.97
System 2151.67 3142.51
Elapsed 9737.00 8879.34
BAS FAIRALLO
BASE FAIRALLOC
Minor Faults 5372192557188551
Major Faults 392195 15157
Swap Ins 2994854 112770
Swap Outs 4907092 134982
Direct pages scanned 0 41824
Kswapd pages scanned 329750638128269
Kswapd pages reclaimed 63230697093495
Direct pages reclaimed 0 41824
Kswapd efficiency 19% 87%
Kswapd velocity 3386.573 915.414
Direct efficiency 100% 100%
Direct velocity 0.000 4.710
Percentage direct scans 0% 0%
Zone normal velocity 2011.338 550.661
Zone dma32 velocity 1365.623 369.221
Zone dma velocity 9.612 0.242
Page writes by reclaim 18732404.000 614807.000
Page writes file 13825312 479825
Page writes anon 4907092 134982
Page reclaim immediate 85490 5647
Sector Reads 12080532 483244
Sector Writes 8874050865438876
Page rescued immediate 0 0
Slabs scanned 82560 12160
Direct inode steals 0 0
Kswapd inode steals 24401 40013
Kswapd skipped wait 0 0
THP fault alloc 6 8
THP collapse alloc 5481 5812
THP splits 75 22
THP fault fallback 0 0
THP collapse fail 0 0
Compaction stalls 0 54
Compaction success 0 45
Compaction failures 0 9
Page migrate success 881492 82278
Page migrate failure 0 0
Compaction pages isolated 0 60334
Compaction migrate scanned 0 53505
Compaction free scanned 0 1537605
Compaction cost 914 86
NUMA PTE updates 4673823141988419
NUMA hint faults 3117556424213387
NUMA hint local faults 104273936411593
NUMA pages migrated 881492 55344
AutoNUMA cost 156221 121361
The overall runtime was reduced, throughput for both the foreground
workload as well as the background IO improved, major faults, swapping and
reclaim activity shrunk significantly, reclaim efficiency more than
quadrupled.
This patch:
When the page allocator fails to get a page from all zones in its given
zonelist, it wakes up the per-node kswapds for all zones that are at their
low watermark.
However, with a system under load the free pages in a zone can fluctuate
enough that the allocation fails but the kswapd wakeup is also skipped
while the zone is still really close to the low watermark.
When one node misses a wakeup like this, it won't be aged before all the
other node's zones are down to their low watermarks again. And skipping a
full aging cycle is an obvious fairness problem.
Kswapd runs until the high watermarks are restored, so it should also be
woken when the high watermarks are not met. This ages nodes more equally
and creates a safety margin for the page counter fluctuation.
By using zone_balanced(), it will now check, in addition to the watermark,
if compaction requires more order-0 pages to create a higher order page.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In collapse_huge_page() there is a race window between releasing the
mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may
return NULL. So check the return value to avoid NULL pointer dereference.
mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint()
In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the
migratetype *after* it has been set to the preferred migratetype (if the
ownership was changed). Obviously that wouldn't have been the original
intent. (We already have a separate 'change_ownership' field to tell
whether the ownership of the pageblock was changed from the
fallback_migratetype to the preferred type.)
The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_allo.c: restructure free-page stealing code and fix a bug
The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!
First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto
'pageblock_nr_pages' no. of pages to the preferred allocation list. But
we change the ownership of that pageblock to the preferred type only if we
manage to successfully move atleast half of that pageblock (or if
page_group_by_mobility_disabled is set).
However, the current code ignores the latter part and sets the
'migratetype' variable to the preferred type, irrespective of whether we
actually changed the pageblock migratetype of that block or not. So, the
page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
'change_ownership' might be shown as 1 when it must have been 0).
So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype'
variable is used subsequently for checks such as "is_migrate_cma()".
Obviously the intent there is to check if the *fallback* type is
MIGRATE_CMA, but since we already set the 'migratetype' variable to
start_migratetype, we end up checking if the *preferred* type is
MIGRATE_CMA!!
To make things more interesting, this actually doesn't cause a bug in
practice, because we never change *anything* if the fallback type is CMA.
So, restructure the code in such a way that it is trivial to understand
what is going on, and also fix the above mentioned bug. And while at it,
also add a comment explaining the subtlety behind the migratetype used in
the call to expand().
[akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Wed, 11 Sep 2013 21:20:32 +0000 (14:20 -0700)]
swap: make cluster allocation per-cpu
swap cluster allocation is to get better request merge to improve
performance. But the cluster is shared globally, if multiple tasks are
doing swap, this will cause interleave disk access. While multiple tasks
swap is quite common, for example, each numa node has a kswapd thread
doing swap and multiple threads/processes doing direct page reclaim.
ioscheduler can't help too much here, because tasks don't send swapout IO
down to block layer in the meantime. Block layer does merge some IOs, but
a lot not, depending on how many tasks are doing swapout concurrently. In
practice, I've seen a lot of small size IO in swapout workloads.
We makes the cluster allocation per-cpu here. The interleave disk access
issue goes away. All tasks swapout to their own cluster, so swapout will
become sequential, which can be easily merged to big size IO. If one CPU
can't get its per-cpu cluster (for example, there is no free cluster
anymore in the swap), it will fallback to scan swap_map. The CPU can
still continue swap. We don't need recycle free swap entries of other
CPUs.
In my test (swap to a 2-disk raid0 partition), this improves around 10%
swapout throughput, and request size is increased significantly.
How does this impact swap readahead is uncertain though. On one side,
page reclaim always isolates and swaps several adjancent pages, this will
make page reclaim write the pages sequentially and benefit readahead. On
the other side, several CPU write pages interleave means the pages don't
live _sequentially_ but relatively _near_. In the per-cpu allocation
case, if adjancent pages are written by different cpus, they will live
relatively _far_. So how this impacts swap readahead depends on how many
pages page reclaim isolates and swaps one time. If the number is big,
this patch will benefit swap readahead. Of course, this is about
sequential access pattern. The patch has no impact for random access
pattern, because the new cluster allocation algorithm is just for SSD.
Alternative solution is organizing swap layout to be per-mm instead of
this per-cpu approach. In the per-mm layout, we allocate a disk range for
each mm, so pages of one mm live in swap disk adjacently. per-mm layout
has potential issues of lock contention if multiple reclaimers are swap
pages from one mm. For a sequential workload, per-mm layout is better to
implement swap readahead, because pages from the mm are adjacent in disk.
But per-cpu layout isn't very bad in this workload, as page reclaim always
isolates and swaps several pages one time, such pages will still live in
disk sequentially and readahead can utilize this. For a random workload,
per-mm layout isn't beneficial of request merge, because it's quite
possible pages from different mm are swapout in the meantime and IO can't
be merged in per-mm layout. while with per-cpu layout we can merge
requests from any mm. Considering random workload is more popular in
workloads with swap (and per-cpu approach isn't too bad for sequential
workload too), I'm choosing per-cpu layout.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Wed, 11 Sep 2013 21:20:31 +0000 (14:20 -0700)]
swap: fix races exposed by swap discard
The previous patch can expose races, according to Hugh:
swapoff was sometimes failing with "Cannot allocate memory", coming from
try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
on a free entry temporarily SWAP_MAP_BAD while being discarded.
We should use ACCESS_ONCE() there, and whenever accessing swap_map
locklessly; but rather than peppering it throughout try_to_unuse(), just
declare *swap_map with volatile.
try_to_unuse() is accustomed to *swap_map going down racily, but not
necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
prevent that transition once SWP_WRITEOK is switched off, when it's a
waste of time to issue discards anyway (swapon can do a whole discard).
Another issue is:
In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
because we don't check if readahead swap entry is bad. This doesn't break
anything but such swapin page is wasteful and can only be freed at page
reclaim. We should avoid read such swap entry. And in discard, we mark
swap entry SWAP_MAP_BAD and then switch it to normal when discard is
finished. If readahead reads such swap entry, we have the same issue, so
we much check if swap entry is bad too.
Thanks Hugh to inspire swapin_readahead could use bad swap entry.
[include Hugh's patch 'swap: fix swapoff ENOMEMs from discard'] Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Wed, 11 Sep 2013 21:20:30 +0000 (14:20 -0700)]
swap: make swap discard async
swap can do cluster discard for SSD, which is good, but there are some
problems here:
1. swap do the discard just before page reclaim gets a swap entry and
writes the disk sectors. This is useless for high end SSD, because an
overwrite to a sector implies a discard to original sector too. A
discard + overwrite == overwrite.
2. the purpose of doing discard is to improve SSD firmware garbage
collection. Idealy we should send discard as early as possible, so
firmware can do something smart. Sending discard just after swap entry
is freed is considered early compared to sending discard before write.
Of course, if workload is already bound to gc speed, sending discard
earlier or later doesn't make
3. block discard is a sync API, which will delay scan_swap_map()
significantly.
4. Write and discard command can be executed parallel in PCIe SSD.
Making swap discard async can make execution more efficiently.
This patch makes swap discard async and moves discard to where swap entry
is freed. Discard and write have no dependence now, so above issues can
be avoided. Idealy we should do discard for any freed sectors, but some
SSD discard is very slow. This patch still does discard for a whole
cluster.
My test does a several round of 'mmap, write, unmap', which will trigger a
lot of swap discard. In a fusionio card, with this patch, the test
runtime is reduced to 18% of the time without it, so around 5.5x faster.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Wed, 11 Sep 2013 21:20:28 +0000 (14:20 -0700)]
swap: change block allocation algorithm for SSD
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to
20~30% CPU time (when cluster is hard to find, the CPU time can be up to
80%), which becomes a bottleneck. scan_swap_map() scans a byte array to
search a 256 page cluster, which is very slow.
Here I introduced a simple algorithm to search cluster. Since we only
care about 256 pages cluster, we can just use a counter to track if a
cluster is free. Every 256 pages use one int to store the counter. If
the counter of a cluster is 0, the cluster is free. All free clusters
will be added to a list, so searching cluster is very efficient. With
this, scap_swap_map() overhead disappears.
This might help low end SD card swap too. Because if the cluster is
aligned, SD firmware can do flash erase more efficiently.
We only enable the algorithm for SSD. Hard disk swap isn't fast enough
and has downside with the algorithm which might introduce regression (see
below).
The patch slightly changes which cluster is choosen. It always adds free
cluster to list tail. This can help wear leveling for low end SSD too.
And if no cluster found, the scan_swap_map() will do search from the end
of last cluster. So if no cluster found, the scan_swap_map() will do
search from the end of last free cluster, which is random. For SSD, this
isn't a problem at all.
Another downside is the cluster must be aligned to 256 pages, which will
reduce the chance to find a cluster. I would expect this isn't a big
problem for SSD because of the non-seek penality. (And this is the reason
I only enable the algorithm for SSD).
Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 11 Sep 2013 21:20:27 +0000 (14:20 -0700)]
mm/page_alloc.c: use '__paginginit' instead of '__init'
set_pageblock_order() may be called when memory hotplug, so need use
'__paginginit' instead of '__init'.
The related warning:
The function __meminit .free_area_init_node() references
a function __init .set_pageblock_order().
If .set_pageblock_order is only used by .free_area_init_node then
annotate .set_pageblock_order with a matching annotation.
Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jerry Zhou [Wed, 11 Sep 2013 21:20:26 +0000 (14:20 -0700)]
mm: fix negative left shift count when PAGE_SHIFT > 20
When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
previous calculating here will generate an unexpected result. In
addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
already integral multiple of 1MB.
Signed-off-by: Jerry Zhou <uulinux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Wed, 11 Sep 2013 21:20:24 +0000 (14:20 -0700)]
mm: vmstats: track TLB flush stats on UP too
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.
UP systems do not do remote TLB flushes, so compile those counters out on
UP.
arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4. It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.
[akpm@linux-foundation.org: tweak comments] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Wed, 11 Sep 2013 21:20:23 +0000 (14:20 -0700)]
mm: vmstats: tlb flush counters
I was investigating some TLB flush scaling issues and realized that we do
not have any good methods for figuring out how many TLB flushes we are
doing.
It would be nice to be able to do these in generic code, but the
arch-independent calls don't explicitly specify whether we actually need
to do remote flushes or not. In the end, we really need to know if we
actually _did_ global vs. local invalidations, so that leaves us with few
options other than to muck with the counters from arch-specific code.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/zswap.c: get swapper address_space by using macro
There is a proper macro to get the corresponding swapper address space
from a swap entry. Instead of directly accessing "swapper_spaces" array,
use the "swap_address_space" macro.
Signed-off-by: Sunghan Suh <sunghan.suh@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: mmap_region: kill correct_wcount/inode, use allow_write_access()
correct_wcount and inode in mmap_region() just complicate the code. This
boolean was needed previously, when deny_write_access() was called before
vma_merge(), now we can simply check VM_DENYWRITE and do
allow_write_access() if it is set.
allow_write_access() checks file != NULL, so this is safe even if it was
possible to use VM_DENYWRITE && !file. Just we need to ensure we use the
same file which was deny_write_access()'ed, so the patch also moves "file
= vma->vm_file" down after allow_write_access().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@android.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: shift VM_GROWS* check from mmap_region() to do_mmap_pgoff()
mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set.
In particular this means that mmap_region()->vma_merge(file, vm_flags)
must always fail if "vm_flags & VM_GROWS" is set incorrectly.
So it does not make sense to check VM_GROWS* after we already allocated
the new vma, the only caller, do_mmap_pgoff(), which can pass this flag
can do the check itself.
And this looks a bit more correct, mmap_region() already unmapped the
old mapping at this stage. But if mmap() is going to fail, it should
avoid do_munmap() if possible.
Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't
return EINVAL in the case when it currently returns another error code.
Vladimir Cernov [Wed, 11 Sep 2013 21:20:15 +0000 (14:20 -0700)]
mm/madvise.c: fix coding-style errors
This fixes following errors:
- ERROR: "(foo*)" should be "(foo *)"
- ERROR: "foo ** bar" should be "foo **bar"
Signed-off-by: Vladimir Cernov <gg.kaspersky@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: mempolicy: turn vma_set_policy() into vma_dup_policy()
Simple cleanup. Every user of vma_set_policy() does the same work, this
looks a bit annoying imho. And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.
The driver core clears the driver data to NULL after device_release or
on probe failure. Thus, it is not needed to manually clear the device
driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Miller [Wed, 11 Sep 2013 21:20:11 +0000 (14:20 -0700)]
cciss: set max scatter gather entries to 32 on P600
At one time we used to set the maximum number of scatter gather elements
on all Smart Array controllers to 32. At some point in time the
firmware began to write the "appropriate" value for each controller into
the config table. The cciss driver would then read that and set
h->maxsgentries.
On the P600 that value is 544. Under some workloads a significant
performance reduction may result. This patch forces the P600 to use
only 32 scatter gather elements. Other controllers are not affected.
Signed-off-by: Mike Miller <mike.miller@hp.com> Signed-off-by: Dwight (Bud) Brown <bubrown@redhat.com> Signed-off-by: Tomas Henzl <thenzl@redhat.com> Acked-by: Stephen M. Cameron <steve.cameron@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cai Zhiyong [Wed, 11 Sep 2013 21:20:09 +0000 (14:20 -0700)]
block: support embedded device command line partition
Read block device partition table from command line. The partition used
for fixed block device (eMMC) embedded device. It is no MBR, save
storage space. Bootloader can be easily accessed by absolute address of
data on the block device. Users can easily change the partition.
This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
About the partition verbose reference
"Documentation/block/cmdline-partition.txt"
[akpm@linux-foundation.org: fix printk text]
[yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 11 Sep 2013 21:20:05 +0000 (14:20 -0700)]
ocfs2: fix the end cluster offset of FIEMAP
Call fiemap ioctl(2) with given start offset as well as an desired mapping
range should show extents if possible. However, we somehow figure out the
end offset of mapping via 'mapping_end -= cpos' before iterating the
extent records which would cause problems if the given fiemap length is
too small to a cluster size, e.g,
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reported-by: David Weber <wb@munzinger.de> Tested-by: David Weber <wb@munzinger.de> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: Mark Fashen <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:20:04 +0000 (14:20 -0700)]
ocfs2: remove unused variable ip in dlmfs_get_root_inode()
Variable ip in dlmfs_get_root_inode() is defined but not used. So clean
it up.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In o2hb_shutdown_slot() and o2hb_check_slot(), since event is defined as
local, it is only valid during the call stack. So the following tiny race
case may happen in a multi-volumes mounted environment:
o2hb-vol1 o2hb-vol2
1) o2hb_shutdown_slot
allocate local event1
2) queue_node_event
add event1 to global o2hb_node_events
3) o2hb_shutdown_slot
allocate local event2
4) queue_node_event
add event2 to global o2hb_node_events
5) o2hb_run_event_list
delete event1 from o2hb_node_events
6) o2hb_run_event_list
event1 empty, return
7) o2hb_shutdown_slot
event1 lifecycle ends
8) o2hb_fire_callbacks
event1 is already *invalid*
This patch lets it wait on o2hb_callback_sem when another thread is firing
callbacks. And for performance consideration, we only call
o2hb_run_event_list when there is an event queued.
Signed-off-by: Joyce <xuejiufei@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:20:01 +0000 (14:20 -0700)]
ocfs2: avoid possible NULL pointer dereference in o2net_accept_one()
Since o2nm_get_node_by_num() may return NULL, we add this check in
o2net_accept_one() to avoid possible NULL pointer dereference.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:20:00 +0000 (14:20 -0700)]
ocfs2: adjust code style for o2net_handler_tree_lookup()
Code in o2net_handler_tree_lookup() may be corrupted by mistake. So
adjust it to promote readability.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 11 Sep 2013 21:19:59 +0000 (14:19 -0700)]
ocfs2: free path in ocfs2_remove_inode_range()
In ocfs2_remove_inode_range(), there is a memory leak. The variable path
has allocated memory with ocfs2_new_path_from_et(), but it is not free.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:19:58 +0000 (14:19 -0700)]
ocfs2: fix possible double free in ocfs2_reflink_xattr_rec
In ocfs2_reflink_xattr_rec(), meta_ac and data_ac are allocated by calling
ocfs2_lock_reflink_xattr_rec_allocators().
Once an error occurs when allocating *data_ac, it frees *meta_ac which is
allocated before. Here it mistakenly sets meta_ac to NULL but *meta_ac.
Then ocfs2_reflink_xattr_rec() will try to free meta_ac again which is
already invalid.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2/dlm: force clean refmap when doing local cleanup
dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
lockres is UNKNOWN. Otherwise node may hang when umounting filesystems.
Here's the situation:
Node1 Node2
dlmlock()
-> dlm_get_lock_resource()
send DLM_MASTER_REQUEST_MSG to
other nodes.
trying to master this lockres,
return MAYBE.
selected as the master of lockresA,
set mle->master to Node1,
and do assert_master,
send DLM_ASSERT_MASTER_MSG to Node2.
Node 2 has interest on lockresA
and return
DLM_ASSERT_RESPONSE_MASTERY_REF
then something happened and
Node2 crashed.
Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
sending DLM_ASSERT_MASTER_MSG to other nodes
o2hb found node2 down, calling dlm_hb_node_down() -->
dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
no need to call dlm_free_dead_locks().
Set the master of lockresA to Node1, but Node2 stills remains in refmap.
When Node1 umount, it found that the refmap of lockresA is not empty and
attempted to migrate it to Node2, But Node2 is already down, so umount
hang, trying to migrate lockresA again and again.
Signed-off-by: joyce <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 11 Sep 2013 21:19:56 +0000 (14:19 -0700)]
ocfs2: free meta_ac and data_ac when ocfs2_start_trans fails in ocfs2_xattr_set()
In ocfs2_xattr_set(), if ocfs2_start_trans failed, meta_ac and data_ac
should be free. Otherwise, It would lead to a memory leak.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:19:55 +0000 (14:19 -0700)]
ocfs2: add the missing return value check of ocfs2_xattr_get_clusters
In ocfs2_xattr_value_attach_refcount(), if error occurs when calling
ocfs2_xattr_get_clusters(), it will go with unexpected behavior since
local variables p_cluster, num_clusters and ext_flags are declared without
initialization.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jie Liu [Wed, 11 Sep 2013 21:19:53 +0000 (14:19 -0700)]
ocfs2: fix a memory leak in __ocfs2_move_extents()
The ocfs2 path is not properly freed which leads to a memory leak at
__ocfs2_move_extents().
This patch stops the leaks of the ocfs2_path structure.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Younger Liu <younger.liu@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:19:52 +0000 (14:19 -0700)]
ocfs2: add missing return value check of ocfs2_get_clusters()
In ocfs2_attach_refcount_tree() and ocfs2_duplicate_extent_list(), if
error occurs when calling ocfs2_get_clusters(), it will go with
unexpected behavior as local variables p_cluster, num_clusters and
ext_flags are declared without initialization.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Wed, 11 Sep 2013 21:19:51 +0000 (14:19 -0700)]
ocfs2: clean up dead code in ocfs2_acl_from_xattr()
In ocfs2_acl_from_xattr(), if size is less than sizeof(struct
posix_acl_entry), it returns ERR_PTR(-EINVAL) directly. Then assign (size
/ sizeof(struct posix_acl_entry)) to count which will be at least 1, that
means the following branch (count < 0) and (count == 0) will never be
true.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2: use list_for_each_entry() instead of list_for_each()
[dan.carpenter@oracle.com: fix up some NULL dereference bugs] Signed-off-by: Dong Fang <yp.fangdong@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 11 Sep 2013 21:19:47 +0000 (14:19 -0700)]
ocfs2: ac_bits_wanted should be local_alloc_bits when returns -ENOSPC
There is an issue in reserving and claiming space for localalloc, When
localalloc space is not enough, it would claim space from global_bitmap.
And if there is not enough free space in global_bitmap, the size of
claiming space would set to half of orignal size and retry.
The issue is as follows: osb->local_alloc_bits is set to half of orignal
size in ocfs2_recalc_la_window(), but ac->ac_bits_wanted is set to
osb->local_alloc_default_bits which is not changed. localalloc always
reserves and claims local_alloc_default_bits space and returns ENOSPC.
So, ac->ac_bits_wanted should be osb->local_alloc_bits which would be
changed.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2: dlm_request_all_locks() should deal with the status sent from target node
dlm_request_all_locks() should deal with the status sent from target node
if DLM_LOCK_REQUEST_MSG is sent successfully, or recovery master will fall
into endless loop, waiting for other nodes to send locks and
DLM_RECO_DATA_DONE_MSG to me.
NodeA NodeB
selected as recovery master
dlm_remaster_locks()
->dlm_request_all_locks()
send DLM_LOCK_REQUEST_MSG to nodeA
It happened that NodeA cannot alloc memory when it processes this
message. dlm_request_all_locks_handler() do not queue
dlm_request_all_locks_worker and returns -ENOMEM. It will never send
locks and DLM_RECO_DATA_DONE_MSG to NodeB.
NodeB do not deal with the status
sent from nodeA, and will fall in
endless loop waiting for the
recovery state of NodeA to be
changed.
Signed-off-by: joyce <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jeff Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Wed, 11 Sep 2013 21:19:45 +0000 (14:19 -0700)]
ocfs2: use i_size_read() to access i_size
Though ocfs2 uses inode->i_mutex to protect i_size, there are both
i_size_read/write() and direct accesses. Clean up all direct access to
eliminate confusion.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Younger Liu [Wed, 11 Sep 2013 21:19:44 +0000 (14:19 -0700)]
ocfs2: lighten up allocate transaction
The issue scenario is as following:
When fallocating a very large disk space for a small file,
__ocfs2_extend_allocation attempts to get a very large transaction. For
some journal sizes, there may be not enough room for this transaction,
and the fallocate will fail.
The patch below extends & restarts the transaction as necessary while
allocating space, and should work with even the smallest journal. This
patch refers ext4 resize.
Test:
# mkfs.ocfs2 -b 4K -C 32K -T datafiles /dev/sdc
...(jounral size is 32M)
# mount.ocfs2 /dev/sdc /mnt/ocfs2/
# touch /mnt/ocfs2/1.log
# fallocate -o 0 -l 400G /mnt/ocfs2/1.log
fallocate: /mnt/ocfs2/1.log: fallocate failed: Cannot allocate memory
# tail -f /var/log/messages
[ 7372.278591] JBD: fallocate wants too many credits (2051 > 2048)
[ 7372.278597] (fallocate,6438,0):__ocfs2_extend_allocation:709 ERROR: status = -12
[ 7372.278603] (fallocate,6438,0):ocfs2_allocate_unwritten_extents:1504 ERROR: status = -12
[ 7372.278607] (fallocate,6438,0):__ocfs2_change_file_space:1955 ERROR: status = -12
^C
With this patch, the test works well.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The driver core clears the driver data to NULL after device_release or
on probe failure. Thus, it is not needed to manually clear the device
driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: David Brown <davidb@codeaurora.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Suman Anna <s-anna@ti.com> Acked-by: Libo Chen <libo.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Bolle [Wed, 11 Sep 2013 21:19:42 +0000 (14:19 -0700)]
drivers/video/acornfb.c: remove dead code
acornfb checks for HAS_VIDC while support for that macro was removed in
v2.6.23 (when the arm26 port was removed). So we can remove a bit of
dead code.
fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.
Then later copy_process() denies CLONE_SIGHAND if the new process will
be in a different pid namespace (task_active_pid_ns() doesn't match
current->nsproxy->pid_ns).
This looks confusing and inconsistent. CLONE_NEWPID is very similar to
the case when ->pid_ns was already unshared, we want the same
restrictions so copy_process() should also nack CLONE_PARENT.
And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
just for consistency.
Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
copy_process() to do the same check along with ->pid_ns check we already
have.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pidns: kill the unnecessary CLONE_NEWPID in copy_process()
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
the same check.
Remove the CLONE_NEWPID check to cleanup the code and prepare for the
next change.
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:
Change this check to use CLONE_SIGHAND instead. This also forbids
CLONE_THREAD automatically, and this is what the comment implies.
We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this. The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.
Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Colin Walters <walters@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 11 Sep 2013 21:19:37 +0000 (14:19 -0700)]
include/linux/smp.h:on_each_cpu(): switch back to a C function
Revert commit c846ef7deba2 ("include/linux/smp.h:on_each_cpu(): switch
back to a macro"). It turns out that the problematic linux/irqflags.h
include was fixed within ia64 and mn10300.
Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
"This includes one bpf/jit bug fix where the jit compiler could
sometimes write generated code out of bounds of the allocated memory
area.
The rest of the patches are only cleanups and minor improvements"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/irq: reduce size of external interrupt handler hash array
s390/compat,uid16: use current_cred()
s390/ap_bus: use and-mask instead of a cast
s390/ftrace: avoid pointer arithmetics with function pointers
s390: make various functions static, add declarations to header files
s390/compat signal: add couple of __force annotations
s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table()
s390: keep Kconfig sorted
s390/irq: rework irq subclass handling
s390/irq: use hlists for external interrupt handler array
s390/dumpstack: convert print_symbol to %pSR
s390/perf: Remove print_hex_dump_bytes() debug output
s390: update defconfig
s390/bpf,jit: fix address randomization
Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
"This is the kconfig part of kbuild for v3.12-rc1:
- post-3.11 search code fixes and micro-optimizations
- CONFIG_MODULES is no longer a special case; this is needed to
eventually fix the bug that using KCONFIG_ALLCONFIG breaks
allmodconfig
- long long is used to store hex and int values
- make silentoldconfig no longer warns when a symbol changes from
tristate to bool (it's a job for make oldconfig)
- scripts/diffconfig updated to work with newer Pythons
- scripts/config does not rely on GNU sed extensions"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: do not allow more than one symbol to have 'option modules'
kconfig: regenerate bison parser
kconfig: do not special-case 'MODULES' symbol
diffconfig: Update script to support python versions 2.5 through 3.3
diffconfig: Gracefully exit if the default config files are not present
modules: do not depend on kconfig to set 'modules' option to symbol MODULES
kconfig: silence warning when parsing auto.conf when a symbol has changed type
scripts/config: use sed's POSIX interface
kconfig: switch to "long long" for sanity
kconfig: simplify symbol-search code
kconfig: don't allocate n+1 elements in temporary array
kconfig: minor style fixes in symbol-search code
kconfig/[mn]conf: shorten title in search-box
kconfig: avoid multiple calls to strlen
Documentation/kconfig: more concise and straightforward search explanation
Merge tag 'for-v3.12' of git://git.infradead.org/battery-2.6
Pull battery/power supply driver updates from Anton Vorontsov:
"New drivers:
- APM X-Gene system reboot driver by Feng Kan and Loc Ho (APM).
- Qualcomm MSM reboot/poweroff driver by Abhimanyu Kapur (Codeaurora).
- Texas Instruments BQ24190 charger driver by Mark A. Greer (Animal
Creek Technologies).
- Texas Instruments TWL4030 MADC battery driver by Lukas Märdian and
Marek Belisko (Golden Delicious Computers). The driver is used on
Freerunner GTA04 phones.
Highlighted fixes and improvements:
- Suspend/wakeup logic improvements: power supply objects will block
system suspend until all power supply events are processed. Thanks
to Zoran Markovic (Linaro), Arve Hjonnevag and Todd Poynor (Google)"
* tag 'for-v3.12' of git://git.infradead.org/battery-2.6:
rx51_battery: Fix channel number when reading adc value
power: Add twl4030_madc battery driver.
bq24190_charger: Workaround SS definition problem on i386 builds
power_supply: Prevent suspend until power supply events are processed
vexpress-poweroff: Should depend on the required infrastructure
twl4030-charger: Fix compiler warning with regulator_enable()
rx51_battery: Replace hardcoded channels values.
bq24190_charger: Add support for TI BQ24190 Battery Charger
ab8500-charger: We print an unintended error message
max8925_power: Fix missing of_node_put
power_supply: Replace strict_strtol() with kstrtol()
power: Add APM X-Gene system reboot driver
power_supply: tosa_battery: Get rid of irq_to_gpio usage
power supply: collie_battery: Convert to use dev_pm_ops
power_supply: Make goldfish_battery depend on GOLDFISH || COMPILE_TEST
power: reset: Add msm restart support
MAINTAINERS: drivers/power: add entry for SmartReflex AVS drivers
Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes from Ben Herrenschmidt:
"Here are a handful of small powerpc fixes.
A couple of section mismatches (always worth fixing), a missing export
of a new symbol causing build failures of modules, a page fault
deadlock fix (interestingly that bug has been around for a LONG time,
though it seems to be more easily triggered by KVM) and fixing pseries
default idle loop in the absence of the cpuidle drivers (such as
during boot)"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Default arch idle could cede processor on pseries
fbdev/ps3fb: Fix section mismatch warning for ps3fb_probe
powerpc: Fix section mismatch warning for prom_rtas_call
powerpc: Fix possible deadlock on page fault
powerpc: Export cpu_to_chip_id() to fix build error
Merge tag 'stable/for-linus-3.12-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
"This pull I usually do after rc1 is out but because we have a nice
amount of fixes, some bootup related fixes for ARM, and it is early in
the cycle we figured to do it now to help with tracking of potential
regressions.
The simple ones are the ARM ones - one of the patches fell through the
cracks, other fixes a bootup issue (unconditionally using Xen
functions). Then a fix for a regression causing preempt count being
off (patch causing this went in v3.12).
Lastly are the fixes to make Xen PVHVM guests use PV ticketlocks (Xen
PV already does).
The enablement of that was supposed to be part of the x86 spinlock
merge in commit 816434ec4a67 ("The biggest change here are
paravirtualized ticket spinlocks (PV spinlocks), which bring a nice
speedup on various benchmarks...") but unfortunatly it would cause
hang when booting Xen PVHVM guests. Yours truly got all of the bugs
fixed last week and they (six of them) are included in this pull.
Bug-fixes:
- Boot on ARM without using Xen unconditionally
- On Xen ARM don't run cpuidle/cpufreq
- Fix regression in balloon driver, preempt count warnings
- Fixes to make PVHVM able to use pv ticketlock.
- Revert Xen PVHVM disabling pv ticketlock (aka, re-enable pv ticketlocks)"
* tag 'stable/for-linus-3.12-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/spinlock: Don't use __initdate for xen_pv_spin
Revert "xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM"
xen/spinlock: Don't setup xen spinlock IPI kicker if disabled.
xen/smp: Update pv_lock_ops functions before alternative code starts under PVHVM
xen/spinlock: We don't need the old structure anymore
xen/spinlock: Fix locking path engaging too soon under PVHVM.
xen/arm: disable cpuidle and cpufreq when linux is running as dom0
xen/p2m: Don't call get_balloon_scratch_page() twice, keep interrupts disabled for multicalls
ARM: xen: only set pm function ptrs for Xen guests
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Daniel had some fixes queued up, that were delayed, the stolen memory
ones and vga arbiter ones are quite useful, along with his usual bunch
of stuff, nothing for HSW outputs yet.
The one nouveau fix is for a regression I caused with the poweroff stuff"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (30 commits)
drm/nouveau: fix oops on runtime suspend/resume
drm/i915: Delay disabling of VGA memory until vgacon->fbcon handoff is done
drm/i915: try not to lose backlight CBLV precision
drm/i915: Confine page flips to BCS on Valleyview
drm/i915: Skip stolen region initialisation if none is reserved
drm/i915: fix gpu hang vs. flip stall deadlocks
drm/i915: Hold an object reference whilst we shrink it
drm/i915: fix i9xx_crtc_clock_get for multiplied pixels
drm/i915: handle sdvo input pixel multiplier correctly again
drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock
drm/i915: fix up the relocate_entry refactoring
drm/i915: Fix pipe config warnings when dealing with LVDS fixed mode
drm/i915: Don't call sg_free_table() if sg_alloc_table() fails
i915: Update VGA arbiter support for newer devices
vgaarb: Fix VGA decodes changes
vgaarb: Don't disable resources that are not owned
drm/i915: Pin pages whilst mapping the dma-buf
drm/i915: enable trickle feed on Haswell
x86: add early quirk for reserving Intel graphics stolen memory v5
drm/i915: split PCI IDs out into i915_drm.h v4
...
Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"This was a very quiet cycle! Just a few bugfixes and some cleanup"
* 'nfsd-next' of git://linux-nfs.org/~bfields/linux:
rpc: let xdr layer allocate gssproxy receieve pages
rpc: fix huge kmalloc's in gss-proxy
rpc: comment on linux_cred encoding, treat all as unsigned
rpc: clean up decoding of gssproxy linux creds
svcrpc: remove unused rq_resused
nfsd4: nfsd4_create_clid_dir prints uninitialized data
nfsd4: fix leak of inode reference on delegation failure
Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR"
sunrpc: prepare NFS for 2038
nfsd4: fix setlease error return
nfsd: nfs4_file_get_access: need to be more careful with O_RDWR
Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon cleanups from Guenter Roeck:
"Minor cleanup in ina2xx and hwmon-vid drivers; no functional changes"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (ina2xx) Remove casting the return value which is a void pointer
hwmon: (hwmon-vid) Add __maybe_unused attribute to dummy variable
Merge branch 'x86/jumplabel' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 jumplabel changes from Peter Anvin:
"One more x86 tree for this merge window. This tree improves the
handling of jump labels, so that most of the time we don't have to do
a massive initial patching run.
Furthermore, we will error out of the jump label is not what is
expected, eg if it has been corrupted or tampered with"
* 'x86/jumplabel' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/jump-label: Show where and what was wrong on errors
x86/jump-label: Add safety checks to jump label conversions
x86/jump-label: Do not bother updating nops if they are correct
x86/jump-label: Use best default nops for inital jump label calls
powerpc: Default arch idle could cede processor on pseries
When adding cpuidle support to pSeries, we introduced two
regressions:
- The new cpuidle backend driver only works under hypervisors
supporting the "SLPLAR" option, which isn't the case of the
old POWER4 hypervisor and the HV "light" used on js2x blades
- The cpuidle driver registers fairly late, meaning that for
a significant portion of the boot process, we end up having
all threads spinning. This slows down the boot process and
increases the overall resource usage if the hypervisor has
shared processors.
This fixes both by implementing a "default" idle that will cede
to the hypervisor when possible, in a very simple way without
all the bells and whisles of cpuidle.
Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org>
Vladimir Murzin [Tue, 10 Sep 2013 16:46:30 +0000 (18:46 +0200)]
fbdev/ps3fb: Fix section mismatch warning for ps3fb_probe
While cross-building for PPC64 I've got
WARNING: drivers/video/built-in.o(.text+0x9f9ca): Section mismatch in
reference from the function .ps3fb_probe() to th e variable
.init.data:ps3fb_fix The function .ps3fb_probe() references the
variable __initdata ps3fb_fix. This is often because .ps3fb_probe
lacks a __initdata annotation or the annotation of ps3fb_fix is wrong.
WARNING: drivers/video/built-in.o(.text+0x9f9d2): Section mismatch in
reference from the function .ps3fb_probe() to the variable
.init.data:ps3fb_fix The function .ps3fb_probe() references the
variable __initdata ps3fb_fix. This is often because .ps3fb_probe
lacks a __initdata annotation or the annotation of ps3fb_fix is wrong.
WARNING: drivers/built-in.o(.text+0xe222a): Section mismatch in
reference from the function .ps3fb_probe() to the variable
.init.data:ps3fb_fix The function .ps3fb_probe() references the
variable __initdata ps3fb_fix. This is often because .ps3fb_probe
lacks a __initdata annotation or the annotation of ps3fb_fix is wrong.
WARNING: drivers/built-in.o(.text+0xe2232): Section mismatch in
reference from the function .ps3fb_probe() to the variable
.init.data:ps3fb_fix The function .ps3fb_probe() references the
variable __initdata ps3fb_fix. This is often because .ps3fb_probe
lacks a __initdata annotation or the annotation of ps3fb_fix is wrong.
WARNING: vmlinux.o(.text+0x561d4a): Section mismatch in reference from
the function .ps3fb_probe() to the variable .init.data:ps3fb_fix The
function .ps3fb_probe() references the variable __initdata ps3fb_fix.
This is often because .ps3fb_probe lacks a __initdata annotation or
the annotation of ps3fb_fix is wrong.
Mismatch was introduced with 48c68c4f "Drivers: video: remove __dev*
attributes."
Remove __init data annotation from ps3fb_fix.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Vladimir Murzin [Tue, 10 Sep 2013 16:42:07 +0000 (18:42 +0200)]
powerpc: Fix section mismatch warning for prom_rtas_call
While cross-building for PPC64 I've got
WARNING: vmlinux.o(.text.unlikely+0x1ba): Section mismatch in
reference from the function .prom_rtas_call() to the variable
.init.data:dt_string_start The function .prom_rtas_call() references
the variable __initdata dt_string_start. This is often because
.prom_rtas_call lacks a __initdata annotation or the annotation of
dt_string_start is wrong.
WARNING: vmlinux.o(.meminit.text+0xeb0): Section mismatch in reference
from the function .free_area_init_core.isra.47() to the function
.init.text:.set_pageblock_order() The function __meminit
.free_area_init_core.isra.47() references a function __init
.set_pageblock_order(). If .set_pageblock_order is only used by
.free_area_init_core.isra.47 then annotate .set_pageblock_order with a
matching annotation.
Fix it by proper annotation of prom_rtas_call.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull CRIS updates from Jesper Nilsson:
"Mostly cleanup and removal of unused configs"
* tag 'cris-for-3.12' of git://jni.nu/cris:
CRIS: drop unused Kconfig symbols
CRIS: Add kvm_para.h which includes generic file
CRIS: remove unused current_regs
CRIS: Remove last traces of legacy RTC drivers
CRIS: remove "config OOM_REBOOT"
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree core updates from Grant Likely:
"Generally minor changes. A bunch of bug fixes, particularly for
initialization and some refactoring. Most notable change if feeding
the entire flattened tree into the random pool at boot. May not be
significant, but shouldn't hurt either"
Tim Bird questions whether the boot time cost of the random feeding may
be noticeable. And "add_device_randomness()" is definitely not some
speed deamon of a function.
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
of/platform: add error reporting to of_amba_device_create()
irq/of: Fix comment typo for irq_of_parse_and_map
of: Feed entire flattened device tree into the random pool
of/fdt: Clean up casting in unflattening path
of/fdt: Remove duplicate memory clearing on FDT unflattening
gpio: implement gpio-ranges binding document fix
of: call __of_parse_phandle_with_args from of_parse_phandle
of: introduce of_parse_phandle_with_fixed_args
of: move of_parse_phandle()
of: move documentation of of_parse_phandle_with_args
of: Fix missing memory initialization on FDT unflattening
of: consolidate definition of early_init_dt_alloc_memory_arch()
of: Make of_get_phy_mode() return int i.s.o. const int
include: dt-binding: input: create a DT header defining key codes.
of/platform: Staticize of_platform_device_create_pdata()
of: Specify initrd location using 64-bit
dt: Typo fix
OF: make of_property_for_each_{u32|string}() use parameters if OF is not enabled
Merge branch 'for-linus' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine updates from Vinod Koul:
"This pull brings:
- Andy's DW driver updates
- Guennadi's sh driver updates
- Pl08x driver fixes from Tomasz & Alban
- Improvements to mmp_pdma by Daniel
- TI EDMA fixes by Joel
- New drivers:
- Hisilicon k3dma driver
- Renesas rcar dma driver
- New API for publishing slave driver capablities
- Various fixes across the subsystem by Andy, Jingoo, Sachin etc..."
* 'for-linus' of git://git.infradead.org/users/vkoul/slave-dma: (94 commits)
dma: edma: Remove limits on number of slots
dma: edma: Leave linked to Null slot instead of DUMMY slot
dma: edma: Find missed events and issue them
ARM: edma: Add function to manually trigger an EDMA channel
dma: edma: Write out and handle MAX_NR_SG at a given time
dma: edma: Setup parameters to DMA MAX_NR_SG at a time
dmaengine: pl330: use dma_set_max_seg_size to set the sg limit
dmaengine: dma_slave_caps: remove sg entries
dma: replace devm_request_and_ioremap by devm_ioremap_resource
dma: ste_dma40: Fix potential null pointer dereference
dma: ste_dma40: Remove duplicate const
dma: imx-dma: Remove redundant NULL check
dma: dmagengine: fix function names in comments
dma: add driver for R-Car HPB-DMAC
dma: k3dma: use devm_ioremap_resource() instead of devm_request_and_ioremap()
dma: imx-sdma: Staticize sdma_driver_data structures
pch_dma: Add MODULE_DEVICE_TABLE
dmaengine: PL08x: Add cyclic transfer support
dmaengine: PL08x: Fix reading the byte count in cctl
dmaengine: PL08x: Add support for different maximum transfer size
...
Merge tag 'mmc-updates-for-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
Pull MMC updates from Chris Ball:
"MMC highlights for 3.12:
Core:
- Support Allocation Units 8MB-64MB in SD3.0, previous max was 4MB.
- The slot-gpio helper can now handle GPIO debouncing card-detect.
- Read supported voltages from DT "voltage-ranges" property.
Drivers:
- dw_mmc: Add support for ARC architecture, and support exynos5420.
- mmc_spi: Support CD/RO GPIOs.
- sh_mobile_sdhi: Add compatibility for more Renesas SoCs.
- sh_mmcif: Add DT support for DMA channels"
* tag 'mmc-updates-for-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (50 commits)
Revert "mmc: tmio-mmc: Remove .set_pwr() callback from platform data"
mmc: dw_mmc: Add support for ARC
mmc: sdhci-s3c: initialize host->quirks2 for using quirks2
mmc: sdhci-s3c: fix the wrong register value, when clock is disabled
mmc: esdhc: add support to get voltage from device-tree
mmc: sdhci: get voltage from sdhc host
mmc: core: parse voltage from device-tree
mmc: omap_hsmmc: use the generic config for omap2plus devices
mmc: omap_hsmmc: clear status flags before starting a new command
mmc: dw_mmc: exynos: Add a new compatible string for exynos5420
mmc: sh_mmcif: revision-specific CLK_CTRL2 handling
mmc: sh_mmcif: revision-specific Command Completion Signal handling
mmc: sh_mmcif: add support for Device Tree DMA bindings
mmc: sh_mmcif: move header include from header into .c
mmc: SDHI: add DT compatibility strings for further SoCs
mmc: dw_mmc-pci: enable bus-mastering mode
mmc: dw_mmc-pci: get resources from a proper BAR
mmc: tmio-mmc: Remove .set_pwr() callback from platform data
mmc: tmio-mmc: Remove .get_cd() callback from platform data
mmc: sh_mobile_sdhi: Remove .set_pwr() callback from platform data
...
Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper updates from Mike Snitzer:
"Add the ability to collect I/O statistics on user-defined regions of a
device-mapper device. This dm-stats code required the reintroduction
of a div64_u64_rem() helper, but as a separate method that doesn't
slow down div64_u64() -- especially on 32-bit systems.
Allow the error target to replace request-based DM devices (e.g.
multipath) in addition to bio-based DM devices.
Various other small code fixes and improvements to thin-provisioning,
DM cache and the DM ioctl interface"
* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm stripe: silence a couple sparse warnings
dm: add statistics support
dm thin: always return -ENOSPC if no_free_space is set
dm ioctl: cleanup error handling in table_load
dm ioctl: increase granularity of type_lock when loading table
dm ioctl: prevent rename to empty name or uuid
dm thin: set pool read-only if breaking_sharing fails block allocation
dm thin: prefix pool error messages with pool device name
dm: allow error target to replace bio-based and request-based targets
math64: New separate div64_u64_rem helper
dm space map: optimise sm_ll_dec and sm_ll_inc
dm btree: prefetch child nodes when walking tree for a dm_btree_del
dm btree: use pop_frame in dm_btree_del to cleanup code
dm cache: eliminate holes in cache structure
dm cache: fix stacking of geometry limits
dm thin: fix stacking of geometry limits
dm thin: add data block size limits to Documentation
dm cache: add data block size limits to code and Documentation
dm cache: document metadata device is exclussive to a cache
dm: stop using WQ_NON_REENTRANT
Pull md update from Neil Brown:
"Headline item is multithreading for RAID5 so that more IO/sec can be
supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6
calculations and an assortment of bug fixes"
* tag 'md/3.12' of git://neil.brown.name/md:
raid5: only wakeup necessary threads
md/raid5: flush out all pending requests before proceeding with reshape.
md/raid5: use seqcount to protect access to shape in make_request.
raid5: sysfs entry to control worker thread number
raid5: offload stripe handle to workqueue
raid5: fix stripe release order
raid5: make release_stripe lockless
md: avoid deadlock when dirty buffers during md_stop.
md: Don't test all of mddev->flags at once.
md: Fix apparent cut-and-paste error in super_90_validate
raid6/test: replace echo -e with printf
RAID: add tilegx SIMD implementation of raid6
md: fix safe_mode buglet.
md: don't call md_allow_write in get_bitmap_file.